Skip to content

harvard-growth-lab/hscode-mapping

Repository files navigation

hs-classifier

Takes a product description string and returns the best-matching Harmonized System (HS) trade codes.

from hs_classifier import init_index, init_classifier, classify_row

init_index()                     # one-time: build FAISS index from Atlas DB
classifier = init_classifier()   # load FAISS index + S-BERT model

result = classify_row({"product_description": "organic bananas"}, classifier)
print(result)
{
  "codes": ["0803", "0806"],
  "descriptions": ["Bananas, including plantains, fresh or dried", "Grapes, fresh or dried"],
  "reason": "The product is explicitly bananas, matching HS 0803...",
  "search_terms": ["fresh fruit tropical", "banana plantain", ...],
  "detected_language": "en"
}

Installation

Requires Python 3.12+.

uv venv && source .venv/bin/activate
pip install "hs-classifier[google] @ git+https://github.com/karandaryanani/panjiva-hscode.git"
cp .env.example .env   # fill in API keys, Atlas DB credentials, and model choices

Pick the extra that matches your LLM provider:

Extra Provider API key env var
[anthropic] Anthropic ANTHROPIC_API_KEY
[google] Google Gemini GOOGLE_API_KEY
[openai] OpenAI OPENAI_API_KEY
[cohere] Cohere COHERE_API_KEY
[all] All of the above

To run the example.ipynb notebook, add the notebook extra:

pip install "hs-classifier[google,notebook] @ git+https://github.com/karandaryanani/panjiva-hscode.git"

Quick start

The example.ipynb notebook walks through the full workflow: index setup, classification, eval sampling, labeling, evaluation, and tuning. Start there.

How it works

Classification

flowchart TD
    subgraph setup ["Setup (run once)"]
        DB[("Atlas DB")] --> EMB["S-BERT embeddings"]
        EMB --> IDX[("hs12_4_index.parquet")]
        DB --> CH[("hs2_chapters.parquet")]
    end

    subgraph classify ["classify_row()"]
        A["row dict"] --> BQ["build query + context"]
        BQ --> T["detect language"]
        T -->|not English| TR["translate"]
        T -->|English| ST
        TR --> ST["generate search terms\n(LLM + HS2 chapters)"]
        CH --> ST
        ST --> R["FAISS retrieval\n~25 candidates"]
        IDX --> R
        R --> RR["LLM reranking"]
        RR --> OUT["top N codes + reasoning"]
    end
Loading

Language detection — Input text is checked with Lingua. Non-English text is translated via the translators package (Google backend).

Search term generation — The LLM receives the product string, shipping context, and the 97 HS2 chapter descriptions. It generates 5-8 search terms using HS vocabulary to match well in the embedding space.

Retrieval — The original query and each generated term are embedded with S-BERT and searched against a FAISS index. Results are pooled and deduplicated.

Reranking — The LLM receives the candidate shortlist and selects the top N codes with a short justification.

Eval sampling

The eval splitter produces a representative sample for labeling and evaluation. The approach follows Dell (2025), who argues that embedding-based stratified sampling avoids two common pitfalls: keyword-based sampling, which fails to place positive probability on all instances; and active learning, which undersamples rare classes under severe class imbalance.

flowchart LR
    A["Descriptions"] --> B["S-BERT"] --> C["UMAP"] --> D["HDBSCAN"] --> E["Stratified\nsample"]
Loading

The result is a sample that covers the full diversity of your data, including rare product types that keyword filters or random sampling would miss.

Developer

Project structure

example.ipynb             # Full walkthrough: classify, split, evaluate
run_init.py               # One-time setup: build lookup index from Atlas DB
run_pipeline.py           # CLI: classify a single row (--row_index, --csv_path)
run_splitter.py           # CLI: generate eval sample (--csv_path, --sample_frac)

hs_classifier/
├── __init__.py           # init_index(), init_classifier(), classify_row()
├── init_lookup_index.py  # DB connection, S-BERT encoding, save index parquet
├── build_query.py        # Build one classifier query from one raw row
├── translator.py         # Lingua language detection + Google translation backend
├── search_terms.py       # LLM search term generation (Instructor + Pydantic)
├── retrieval.py          # Load index parquet, FAISS search, aggregate and deduplicate
├── reranker.py           # LLM reranking of candidates (Instructor + Pydantic)
├── splitter.py           # S-BERT + UMAP + HDBSCAN clustering, stratified sampling
└── evaluator.py          # Classification metrics with readable counts + summary

data/
├── raw/                  # Sample CSV data
└── intermediate/         # Parquet artifacts + splitter outputs under samples/

Configuration

All configuration lives in .env (see .env.example for annotated defaults). Model and retrieval parameters can also be overridden per call via classify_row() keyword arguments.

Variable Role Per-call override Default
EMBEDDING_MODEL S-BERT model for encoding descriptions and queries — (rebuild index) dell-research-harvard/lt-un-data-fine-fine-en
SEARCH_TERM_MODEL LLM that generates search terms search_term_model= google/gemini-2.5-flash-lite
RERANKER_MODEL LLM that picks the top N codes from candidates reranker_model= google/gemini-2.5-flash-lite
TOP_K_TOTAL Total FAISS candidates retrieved top_k_total= 25
TOP_K_BERT Candidates allocated to the raw query top_k_bert= 10
LLM_TEMPERATURE Temperature for LLM calls temperature= 0.1
INTERMEDIATE_DATA_DIR Directory for parquet artifacts intermediate_data_dir= data/intermediate

Database and credential variables (ATLAS_*, HF_TOKEN, provider API keys) are documented in .env.example.

Nice to have

  • Support NAICS: The core pipeline is taxonomy-agnostic in principle. Extending to NAICS would mainly require a new index. The econ-embeddings work is relevant here — embeddings trained across economic taxonomies would enable better cross-domain retrieval and reranking.
  • Batch classification: A classify_batch() that batches LLM calls for bulk runs.
  • Vector DB: FAISS works well at ~1,200 HS4 codes. A managed vector DB would only matter at much larger scale.

About

refactored codebase for a tool that chains llms to create concordances to hscode descriptions from text strings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors