Merged Uni-kbs by exyw · Pull Request #39 · dscubed/connect3

exyw · 2026-01-31T02:36:04Z

Summary
This PR migrates our university knowledge base (KB) retrieval from multiple per-university vector stores (and file-per-chunk uploads) to a single centralized vector store using document/section-level ingestion units. This significantly reduces the number of OpenAI files, improves retrieval precision, and simplifies routing/filtering across universities and sources.

Problem
Previously we maintained separate vector stores per university + corpus (student union vs official), and uploaded chunk-level markdown files into each store. This caused:

Extremely large numbers of OpenAI file_ids (e.g., hundreds per page for large docs)
Hard-to-debug sync behavior (multiple syncs → many file_ids tied to one canonical doc)
Complex routing across many vector stores
Operational overhead when adding new universities or new KB categories

What changed

Centralized vector store

We now attach all university KB content to a single vector store (e.g., vs_central_2026).
Retrieval queries use this store and filter by metadata (uni/corpus/year) rather than choosing a store at runtime.

Doc/section-level ingestion units (no more file-per-chunk)

We continue to store chunk-level content in Supabase for UI/citations.
But we no longer upload every chunk as an OpenAI file.
Instead we generate vector-store ingestion files from the original page-level markdown:
Normal pages → 1 file per page (section_key="full")
Large pages → split by headings (H2) and optionally into parts (section_key="workshop", workshop__part_0, etc.)
This reduces file counts and prevents a single large page from dominating retrieval.

New Supabase mapping table: kb_document_files

Added/used kb_document_files keyed by (kb_slug, doc_id, section_key) to map ingestion units to OpenAI ids:
vector_store_id, file_id, vector_store_file_id
content_hash, is_active, last_fetched_at
plus metadata like canonical_url, uni, corpus, year, site
This is now the join layer to resolve retrieval hits back to canonical docs/sections.

New scripts

build-vs-files-from-pages.ts
Builds ingestion files + manifest.json from page-level markdown (splits large pages into section units)
sync-central-vector-store-vs-files.ts
Reads manifest.json, uploads ingestion files, attaches them to the central store with attributes, and upserts kb_document_files
Deactivates stale rows / detaches old vector store files based on content_hash

Pipeline update

KB search now queries the centralized store and filters by {uni, corpus, year}.
Retrieval results are resolved via kb_document_files (file_id → doc_id/section_key/canonical_url), then chunks are loaded from kb_chunks for UI/citations.
Metadata / filtering
Each attached vector-store file now includes attributes (for filtering and joins):
uni (e.g., rmit, uwa, monash)
corpus (su | official)
year (2026)
kb_slug ({uni}_{corpus} e.g. rmit_su)
doc_id (sha1(canonical_url))
site (source site code e.g. msa, rusu)
section_key (full or section slug / part slug)
content_hash (sha1(normalized content))

How to test
Generate ingestion files

tsx scripts/kb/build-vs-files-from-pages.ts

Sync into central vector store

tsx scripts/kb/sync-central-vector-store-vs-files.ts <CENTRAL_VS_ID>

App verification

Run a KB query for a known uni (e.g. RMIT).
Confirm:
only the central store is queried
filters apply (uni/corpus/year)
results map via kb_document_files
citations/snippets still come from kb_chunks

Notes / follow-ups
We may later add stricter section → chunk filtering by introducing a stable section_key field on kb_chunks during chunk generation (optional enhancement).

After validation, we can optionally clean up old OpenAI files by deleting orphaned file-... objects (separate script / cautious because destructive).

…d metadata filtering

scripts/kb/build-vs-files-from-pages.ts

+  return blocks.length ? blocks : [normalized.trim()];
+}
+
+function joinHeadingPath(pathParts: string[]): string {


… refactor/uni-kb-vs

exyw added 7 commits January 24, 2026 01:12

feat: refactored separate uni vector stores into one centralised uni vs

b430a63

feat: changed general pipeline to query centralised uni vs and applie…

9118092

…d metadata filtering

increased max number of hit results from uni kb

96ca7b9

Merge branch 'dev' into refactor/uni-kb-vs

37e2228

resolved some audit issues

8d70e60

Merge branch 'master' into refactor/uni-kb-vs

184fbfe

chore: fix swc version mismatch

e7d557a

github-code-quality bot found potential problems Jan 31, 2026

View reviewed changes

scripts/kb/build-vs-files-from-pages.ts

return blocks.length ? blocks : [normalized.trim()];

}

function joinHeadingPath(pathParts: string[]): string {

feat: reduced chunk noise and duplication and fixed umsu pages

5bea5e4

vercel bot had a problem deploying to Preview January 31, 2026 05:16 Failure

Merge branch 'master' of https://github.com/Tchanwangsa/connect3 into…

26e748a

… refactor/uni-kb-vs

vercel bot had a problem deploying to Preview February 3, 2026 13:15 Failure

micha31r force-pushed the master branch from 79fde79 to f9a2e36 Compare February 18, 2026 09:14

rasheed1306 force-pushed the master branch from 00eaf86 to a997138 Compare March 7, 2026 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merged Uni-kbs#39

Merged Uni-kbs#39
exyw wants to merge 9 commits intomasterfrom
refactor/uni-kb-vs

exyw commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

exyw commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants