This repository contains two coordinated scripts for embedding and analyzing free-text descriptions of KG2 concept nodes using BioLinkBERT (model documentation) and ChromaDB. The goal is to generate a reusable, persistent biomedical vector store that supports concept canonicalization, cross-reference discovery, and downstream reasoning or visualization.
| File | Purpose | Typical Usage |
|---|---|---|
embed_kg2nodes.py |
Builds embeddings from KG2 node descriptions and stores them in a persistent ChromaDB collection (locally). | Run once per KG version to generate a new vector store. |
analyze_embeddings.py |
Connects to an existing vector store for validation, inspection, and visualization of the embeddings. | Run after embeddings are generated to explore semantic structure and relationships. |
run_embedder.sh |
Executes the KG2 node embedding pipeline using nodes_cleaned.tsv as input. |
Intended for quick testing, proof-of-concept runs, and general demonstration. |
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python embed_kg2nodes.py -i nodes_cleaned.tsv -c kg2103 -o chromadb -m new -v
If you'd prefer not to supply arguments to the script, and just want a MVP vector store to kick the tires, instead run:
bash run_embedder.sh -vpython analyze_embeddings.py -c kg2103 -d chromadb -m info
python analyze_embeddings.py -c kg2103 -d chromadb -m similar -q "lactulose"
python analyze_embeddings.py -c kg2103 -d chromadb -m umap
Inputs:
- Optional: include the
-vflag to enable verbose output from the embedding pipeline.
Processing:
- Automatically runs the full KG2 node embedding pipeline using nodes_cleaned.tsv as input.
- Requires no additional configuration.
Inputs:
- TSV file containing
CURIE,Name, andDescriptionfields. Must be tab-delimited for the script to work. - Output directory.
- Collection name to create, update, or overwrite.
- Mode indicating whether you are creating, updating, or overwriting a collection.
Processing:
- Uses
BioLinkBERT-baseto generate normalized embeddings. - Falls back to
NameorCURIEwhenDescriptionis missing. - Metadata (CURIE + name) is preserved for traceability.
Vector Store Configuration:
"hnsw": {
"space": "cosine",
"ef_construction": 200,
"ef_search": 150,
"max_neighbors": 32
}
Modes:
new— create a new collection (default)add— append to an existing collectionoverwrite— replace an existing collection
Output:
- Embeddings stored in a Chroma collection named after the current KG version (e.g.,
kg2103). - The vector store is created as a
PersistentClientand saved locally in the directory specified byoutput_dir.
Inputs:
-c(Collection name): name of the Chroma collection to analyze.-d(Directory): path to the ChromaDB persistent store.-m(Mode): specifies the type of analysis to run.
Available Modes:
info— Summarizes collection statistics and sample embeddings.similar— Finds semantically similar concepts to the input query using existing embeddings.
Example: find concepts semantically similar to'lactulose'.
Uses the embedding already stored for'lactulose'rather than embedding the query text.pair— Computes cosine similarity between two named concepts.clusters— Runs PCA → KMeans to identify semantic clusters, outputs representative concepts, and saves cluster visualizations.umap— Performs UMAP (or t-SNE if UMAP is not installed) dimensionality reduction and generates labeled 2D visualizations of embedding space.
- Model: BioLinkBERT-base
- Embedding Normalization: All embeddings are L2-normalized to ensure consistent cosine similarity comparisons.
- Backend: Persistent ChromaDB store using the HNSW index with cosine distance.
- Dependencies:
chromadb,torch,sentence-transformers,scikit-learn,numpy,matplotlib,umap-learn(optional).