Cross-lingual legal information retrieval for Swiss law. Given an English legal question, retrieve the most relevant German-language citation strings (statutes & court decisions) from a corpus of ~170K laws and ~2.4M court considerations.
Harrier-27B (multilingual MTEB v2 #1) domain-adapted via two-stage LoRA fine-tuning, combined with sparse retrieval and a fine-tuned reranker in a hybrid pipeline.
- Hard negative mining — TopK-PercPos (NV-Retriever): TF-IDF retrieves top-50 candidates, keeping those scored below 95% of the positive as hard negatives (5 per query)
- Two-stage Harrier-27B LoRA — Stage 1: in-batch negatives → Stage 2: hard negatives (k=3). 4-bit NF4 + LoRA (rank=64, 454M trainable params) + DDP on 6× RTX 5090
- BGE-Reranker fine-tuning — Full fine-tune
BAAI/bge-reranker-v2-m3(570M) on the same training pairs - Hybrid retrieval + RRF fusion — Dense (fine-tuned Harrier top-100) + Sparse (TF-IDF laws top-100 + court top-100) → Reciprocal Rank Fusion → top-60
- Reranker re-rank → dynamic cutoff based on score gap → final citation set
| Model | Description |
|---|---|
microsoft/harrier-oss-v1 (27B) |
LoRA fine-tuned 2-stage, 4-bit NF4 |
BAAI/bge-reranker-v2-m3 (570M) |
Full fine-tune, legal domain |
# Clone
git clone https://github.com/tensor2023/Harrier-SwissLaw-Retrieval.git
cd Harrier-SwissLaw-Retrieval
# Install
pip install -r Omnilex-Agentic-Retrieval-Competition-main/requirements.txt
pip install transformers peft bitsandbytes accelerate faiss-gpu sentence-transformers
# Download data (Kaggle competition)
# Place laws_de.csv, court_considerations.csv, train.csv in data/# 1. Build training data with hard negatives
python scripts/01_build_training_data.py
# 2. Fine-tune Harrier-27B (two-stage)
python scripts/02_finetune_harrier.py # ~8h on 6× RTX 5090
# 3. Fine-tune BGE Reranker
python scripts/06_finetune_reranker.py # ~1h
# 4. Build FAISS index for laws (~170K)
python scripts/03_build_index.py
# 5. Run inference
python scripts/07_inference_v3.py # generates submission.csvOmnilex Agentic Retrieval Competition — Macro F1 on hidden test set.