Skip to content

Merged Uni-kbs#39

Open
exyw wants to merge 9 commits intomasterfrom
refactor/uni-kb-vs
Open

Merged Uni-kbs#39
exyw wants to merge 9 commits intomasterfrom
refactor/uni-kb-vs

Conversation

@exyw
Copy link
Collaborator

@exyw exyw commented Jan 31, 2026

Summary
This PR migrates our university knowledge base (KB) retrieval from multiple per-university vector stores (and file-per-chunk uploads) to a single centralized vector store using document/section-level ingestion units. This significantly reduces the number of OpenAI files, improves retrieval precision, and simplifies routing/filtering across universities and sources.

Problem
Previously we maintained separate vector stores per university + corpus (student union vs official), and uploaded chunk-level markdown files into each store. This caused:

  • Extremely large numbers of OpenAI file_ids (e.g., hundreds per page for large docs)
  • Hard-to-debug sync behavior (multiple syncs → many file_ids tied to one canonical doc)
  • Complex routing across many vector stores
  • Operational overhead when adding new universities or new KB categories

What changed

  1. Centralized vector store
  • We now attach all university KB content to a single vector store (e.g., vs_central_2026).
  • Retrieval queries use this store and filter by metadata (uni/corpus/year) rather than choosing a store at runtime.
  1. Doc/section-level ingestion units (no more file-per-chunk)
  • We continue to store chunk-level content in Supabase for UI/citations.
  • But we no longer upload every chunk as an OpenAI file.
  • Instead we generate vector-store ingestion files from the original page-level markdown:
  • Normal pages → 1 file per page (section_key="full")
  • Large pages → split by headings (H2) and optionally into parts (section_key="workshop", workshop__part_0, etc.)
  • This reduces file counts and prevents a single large page from dominating retrieval.
  1. New Supabase mapping table: kb_document_files
  • Added/used kb_document_files keyed by (kb_slug, doc_id, section_key) to map ingestion units to OpenAI ids:
  • vector_store_id, file_id, vector_store_file_id
  • content_hash, is_active, last_fetched_at
  • plus metadata like canonical_url, uni, corpus, year, site
  • This is now the join layer to resolve retrieval hits back to canonical docs/sections.
  1. New scripts
  • build-vs-files-from-pages.ts
  • Builds ingestion files + manifest.json from page-level markdown (splits large pages into section units)
  • sync-central-vector-store-vs-files.ts
  • Reads manifest.json, uploads ingestion files, attaches them to the central store with attributes, and upserts kb_document_files
  • Deactivates stale rows / detaches old vector store files based on content_hash
  1. Pipeline update
  • KB search now queries the centralized store and filters by {uni, corpus, year}.
  • Retrieval results are resolved via kb_document_files (file_id → doc_id/section_key/canonical_url), then chunks are loaded from kb_chunks for UI/citations.
  • Metadata / filtering
  • Each attached vector-store file now includes attributes (for filtering and joins):
  • uni (e.g., rmit, uwa, monash)
  • corpus (su | official)
  • year (2026)
  • kb_slug ({uni}_{corpus} e.g. rmit_su)
  • doc_id (sha1(canonical_url))
  • site (source site code e.g. msa, rusu)
  • section_key (full or section slug / part slug)
  • content_hash (sha1(normalized content))

How to test
Generate ingestion files

  • tsx scripts/kb/build-vs-files-from-pages.ts

Sync into central vector store

  • tsx scripts/kb/sync-central-vector-store-vs-files.ts <CENTRAL_VS_ID>

App verification

  • Run a KB query for a known uni (e.g. RMIT).
  • Confirm:
  • only the central store is queried
  • filters apply (uni/corpus/year)
  • results map via kb_document_files
  • citations/snippets still come from kb_chunks

Notes / follow-ups
We may later add stricter section → chunk filtering by introducing a stable section_key field on kb_chunks during chunk generation (optional enhancement).

After validation, we can optionally clean up old OpenAI files by deleting orphaned file-... objects (separate script / cautious because destructive).

return blocks.length ? blocks : [normalized.trim()];
}

function joinHeadingPath(pathParts: string[]): string {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants