Takes a product description string and returns the best-matching Harmonized System (HS) trade codes.
from hs_classifier import init_index, init_classifier, classify_row
init_index() # one-time: build FAISS index from Atlas DB
classifier = init_classifier() # load FAISS index + S-BERT model
result = classify_row({"product_description": "organic bananas"}, classifier)
print(result){
"codes": ["0803", "0806"],
"descriptions": ["Bananas, including plantains, fresh or dried", "Grapes, fresh or dried"],
"reason": "The product is explicitly bananas, matching HS 0803...",
"search_terms": ["fresh fruit tropical", "banana plantain", ...],
"detected_language": "en"
}Requires Python 3.12+.
uv venv && source .venv/bin/activate
pip install "hs-classifier[google] @ git+https://github.com/karandaryanani/panjiva-hscode.git"
cp .env.example .env # fill in API keys, Atlas DB credentials, and model choicesPick the extra that matches your LLM provider:
| Extra | Provider | API key env var |
|---|---|---|
[anthropic] |
Anthropic | ANTHROPIC_API_KEY |
[google] |
Google Gemini | GOOGLE_API_KEY |
[openai] |
OpenAI | OPENAI_API_KEY |
[cohere] |
Cohere | COHERE_API_KEY |
[all] |
All of the above | — |
To run the example.ipynb notebook, add the notebook extra:
pip install "hs-classifier[google,notebook] @ git+https://github.com/karandaryanani/panjiva-hscode.git"The example.ipynb notebook walks through the full workflow: index setup, classification, eval sampling, labeling, evaluation, and tuning. Start there.
flowchart TD
subgraph setup ["Setup (run once)"]
DB[("Atlas DB")] --> EMB["S-BERT embeddings"]
EMB --> IDX[("hs12_4_index.parquet")]
DB --> CH[("hs2_chapters.parquet")]
end
subgraph classify ["classify_row()"]
A["row dict"] --> BQ["build query + context"]
BQ --> T["detect language"]
T -->|not English| TR["translate"]
T -->|English| ST
TR --> ST["generate search terms\n(LLM + HS2 chapters)"]
CH --> ST
ST --> R["FAISS retrieval\n~25 candidates"]
IDX --> R
R --> RR["LLM reranking"]
RR --> OUT["top N codes + reasoning"]
end
Language detection — Input text is checked with Lingua. Non-English text is translated via the translators package (Google backend).
Search term generation — The LLM receives the product string, shipping context, and the 97 HS2 chapter descriptions. It generates 5-8 search terms using HS vocabulary to match well in the embedding space.
Retrieval — The original query and each generated term are embedded with S-BERT and searched against a FAISS index. Results are pooled and deduplicated.
Reranking — The LLM receives the candidate shortlist and selects the top N codes with a short justification.
The eval splitter produces a representative sample for labeling and evaluation. The approach follows Dell (2025), who argues that embedding-based stratified sampling avoids two common pitfalls: keyword-based sampling, which fails to place positive probability on all instances; and active learning, which undersamples rare classes under severe class imbalance.
flowchart LR
A["Descriptions"] --> B["S-BERT"] --> C["UMAP"] --> D["HDBSCAN"] --> E["Stratified\nsample"]
The result is a sample that covers the full diversity of your data, including rare product types that keyword filters or random sampling would miss.
example.ipynb # Full walkthrough: classify, split, evaluate
run_init.py # One-time setup: build lookup index from Atlas DB
run_pipeline.py # CLI: classify a single row (--row_index, --csv_path)
run_splitter.py # CLI: generate eval sample (--csv_path, --sample_frac)
hs_classifier/
├── __init__.py # init_index(), init_classifier(), classify_row()
├── init_lookup_index.py # DB connection, S-BERT encoding, save index parquet
├── build_query.py # Build one classifier query from one raw row
├── translator.py # Lingua language detection + Google translation backend
├── search_terms.py # LLM search term generation (Instructor + Pydantic)
├── retrieval.py # Load index parquet, FAISS search, aggregate and deduplicate
├── reranker.py # LLM reranking of candidates (Instructor + Pydantic)
├── splitter.py # S-BERT + UMAP + HDBSCAN clustering, stratified sampling
└── evaluator.py # Classification metrics with readable counts + summary
data/
├── raw/ # Sample CSV data
└── intermediate/ # Parquet artifacts + splitter outputs under samples/
All configuration lives in .env (see .env.example for annotated defaults). Model and retrieval parameters can also be overridden per call via classify_row() keyword arguments.
| Variable | Role | Per-call override | Default |
|---|---|---|---|
EMBEDDING_MODEL |
S-BERT model for encoding descriptions and queries | — (rebuild index) | dell-research-harvard/lt-un-data-fine-fine-en |
SEARCH_TERM_MODEL |
LLM that generates search terms | search_term_model= |
google/gemini-2.5-flash-lite |
RERANKER_MODEL |
LLM that picks the top N codes from candidates | reranker_model= |
google/gemini-2.5-flash-lite |
TOP_K_TOTAL |
Total FAISS candidates retrieved | top_k_total= |
25 |
TOP_K_BERT |
Candidates allocated to the raw query | top_k_bert= |
10 |
LLM_TEMPERATURE |
Temperature for LLM calls | temperature= |
0.1 |
INTERMEDIATE_DATA_DIR |
Directory for parquet artifacts | intermediate_data_dir= |
data/intermediate |
Database and credential variables (ATLAS_*, HF_TOKEN, provider API keys) are documented in .env.example.
- Support NAICS: The core pipeline is taxonomy-agnostic in principle. Extending to NAICS would mainly require a new index. The econ-embeddings work is relevant here — embeddings trained across economic taxonomies would enable better cross-domain retrieval and reranking.
- Batch classification: A
classify_batch()that batches LLM calls for bulk runs. - Vector DB: FAISS works well at ~1,200 HS4 codes. A managed vector DB would only matter at much larger scale.