hs-classifier

Takes a product description string and returns the best-matching Harmonized System (HS) trade codes.

from hs_classifier import init_index, init_classifier, classify_row

init_index()                     # one-time: build FAISS index from Atlas DB
classifier = init_classifier()   # load FAISS index + S-BERT model

result = classify_row({"product_description": "organic bananas"}, classifier)
print(result)

{
  "codes": ["0803", "0806"],
  "descriptions": ["Bananas, including plantains, fresh or dried", "Grapes, fresh or dried"],
  "reason": "The product is explicitly bananas, matching HS 0803...",
  "search_terms": ["fresh fruit tropical", "banana plantain", ...],
  "detected_language": "en"
}

Installation

Requires Python 3.12+.

uv venv && source .venv/bin/activate
pip install "hs-classifier[google] @ git+https://github.com/karandaryanani/panjiva-hscode.git"
cp .env.example .env   # fill in API keys, Atlas DB credentials, and model choices

Pick the extra that matches your LLM provider:

Extra	Provider	API key env var
`[anthropic]`	Anthropic	`ANTHROPIC_API_KEY`
`[google]`	Google Gemini	`GOOGLE_API_KEY`
`[openai]`	OpenAI	`OPENAI_API_KEY`
`[cohere]`	Cohere	`COHERE_API_KEY`
`[all]`	All of the above	—

To run the example.ipynb notebook, add the notebook extra:

pip install "hs-classifier[google,notebook] @ git+https://github.com/karandaryanani/panjiva-hscode.git"

Quick start

The example.ipynb notebook walks through the full workflow: index setup, classification, eval sampling, labeling, evaluation, and tuning. Start there.

How it works

Classification

flowchart TD
    subgraph setup ["Setup (run once)"]
        DB[("Atlas DB")] --> EMB["S-BERT embeddings"]
        EMB --> IDX[("hs12_4_index.parquet")]
        DB --> CH[("hs2_chapters.parquet")]
    end

    subgraph classify ["classify_row()"]
        A["row dict"] --> BQ["build query + context"]
        BQ --> T["detect language"]
        T -->|not English| TR["translate"]
        T -->|English| ST
        TR --> ST["generate search terms\n(LLM + HS2 chapters)"]
        CH --> ST
        ST --> R["FAISS retrieval\n~25 candidates"]
        IDX --> R
        R --> RR["LLM reranking"]
        RR --> OUT["top N codes + reasoning"]
    end

Language detection — Input text is checked with Lingua. Non-English text is translated via the translators package (Google backend).

Search term generation — The LLM receives the product string, shipping context, and the 97 HS2 chapter descriptions. It generates 5-8 search terms using HS vocabulary to match well in the embedding space.

Retrieval — The original query and each generated term are embedded with S-BERT and searched against a FAISS index. Results are pooled and deduplicated.

Reranking — The LLM receives the candidate shortlist and selects the top N codes with a short justification.

Eval sampling

The eval splitter produces a representative sample for labeling and evaluation. The approach follows Dell (2025), who argues that embedding-based stratified sampling avoids two common pitfalls: keyword-based sampling, which fails to place positive probability on all instances; and active learning, which undersamples rare classes under severe class imbalance.

flowchart LR
    A["Descriptions"] --> B["S-BERT"] --> C["UMAP"] --> D["HDBSCAN"] --> E["Stratified\nsample"]

The result is a sample that covers the full diversity of your data, including rare product types that keyword filters or random sampling would miss.

Developer

Project structure

example.ipynb             # Full walkthrough: classify, split, evaluate
run_init.py               # One-time setup: build lookup index from Atlas DB
run_pipeline.py           # CLI: classify a single row (--row_index, --csv_path)
run_splitter.py           # CLI: generate eval sample (--csv_path, --sample_frac)

hs_classifier/
├── __init__.py           # init_index(), init_classifier(), classify_row()
├── init_lookup_index.py  # DB connection, S-BERT encoding, save index parquet
├── build_query.py        # Build one classifier query from one raw row
├── translator.py         # Lingua language detection + Google translation backend
├── search_terms.py       # LLM search term generation (Instructor + Pydantic)
├── retrieval.py          # Load index parquet, FAISS search, aggregate and deduplicate
├── reranker.py           # LLM reranking of candidates (Instructor + Pydantic)
├── splitter.py           # S-BERT + UMAP + HDBSCAN clustering, stratified sampling
└── evaluator.py          # Classification metrics with readable counts + summary

data/
├── raw/                  # Sample CSV data
└── intermediate/         # Parquet artifacts + splitter outputs under samples/

Configuration

All configuration lives in .env (see .env.example for annotated defaults). Model and retrieval parameters can also be overridden per call via classify_row() keyword arguments.

Variable	Role	Per-call override	Default
`EMBEDDING_MODEL`	S-BERT model for encoding descriptions and queries	— (rebuild index)	`dell-research-harvard/lt-un-data-fine-fine-en`
`SEARCH_TERM_MODEL`	LLM that generates search terms	`search_term_model=`	`google/gemini-2.5-flash-lite`
`RERANKER_MODEL`	LLM that picks the top N codes from candidates	`reranker_model=`	`google/gemini-2.5-flash-lite`
`TOP_K_TOTAL`	Total FAISS candidates retrieved	`top_k_total=`	25
`TOP_K_BERT`	Candidates allocated to the raw query	`top_k_bert=`	10
`LLM_TEMPERATURE`	Temperature for LLM calls	`temperature=`	0.1
`INTERMEDIATE_DATA_DIR`	Directory for parquet artifacts	`intermediate_data_dir=`	`data/intermediate`

Database and credential variables (ATLAS_*, HF_TOKEN, provider API keys) are documented in .env.example.

Nice to have

Support NAICS: The core pipeline is taxonomy-agnostic in principle. Extending to NAICS would mainly require a new index. The econ-embeddings work is relevant here — embeddings trained across economic taxonomies would enable better cross-domain retrieval and reranking.
Batch classification: A classify_batch() that batches LLM calls for bulk runs.
Vector DB: FAISS works well at ~1,200 HS4 codes. A managed vector DB would only matter at much larger scale.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
hs_classifier		hs_classifier
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
example.ipynb		example.ipynb
known_issues.md		known_issues.md
pyproject.toml		pyproject.toml
run_init.py		run_init.py
run_pipeline.py		run_pipeline.py
run_splitter.py		run_splitter.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hs-classifier

Installation

Quick start

How it works

Classification

Eval sampling

Developer

Project structure

Configuration

Nice to have

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hs-classifier

Installation

Quick start

How it works

Classification

Eval sampling

Developer

Project structure

Configuration

Nice to have

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages