Skip to content

tensor2023/Harrier-SwissLaw-Retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Harrier-SwissLaw-Retrieval

Cross-lingual legal information retrieval for Swiss law. Given an English legal question, retrieve the most relevant German-language citation strings (statutes & court decisions) from a corpus of ~170K laws and ~2.4M court considerations.

Method

Harrier-27B (multilingual MTEB v2 #1) domain-adapted via two-stage LoRA fine-tuning, combined with sparse retrieval and a fine-tuned reranker in a hybrid pipeline.

Pipeline

  1. Hard negative mining — TopK-PercPos (NV-Retriever): TF-IDF retrieves top-50 candidates, keeping those scored below 95% of the positive as hard negatives (5 per query)
  2. Two-stage Harrier-27B LoRA — Stage 1: in-batch negatives → Stage 2: hard negatives (k=3). 4-bit NF4 + LoRA (rank=64, 454M trainable params) + DDP on 6× RTX 5090
  3. BGE-Reranker fine-tuning — Full fine-tune BAAI/bge-reranker-v2-m3 (570M) on the same training pairs
  4. Hybrid retrieval + RRF fusion — Dense (fine-tuned Harrier top-100) + Sparse (TF-IDF laws top-100 + court top-100) → Reciprocal Rank Fusion → top-60
  5. Reranker re-rank → dynamic cutoff based on score gap → final citation set

Models

Model Description
microsoft/harrier-oss-v1 (27B) LoRA fine-tuned 2-stage, 4-bit NF4
BAAI/bge-reranker-v2-m3 (570M) Full fine-tune, legal domain

Setup

# Clone
git clone https://github.com/tensor2023/Harrier-SwissLaw-Retrieval.git
cd Harrier-SwissLaw-Retrieval

# Install
pip install -r Omnilex-Agentic-Retrieval-Competition-main/requirements.txt
pip install transformers peft bitsandbytes accelerate faiss-gpu sentence-transformers

# Download data (Kaggle competition)
# Place laws_de.csv, court_considerations.csv, train.csv in data/

Usage

# 1. Build training data with hard negatives
python scripts/01_build_training_data.py

# 2. Fine-tune Harrier-27B (two-stage)
python scripts/02_finetune_harrier.py    # ~8h on 6× RTX 5090

# 3. Fine-tune BGE Reranker
python scripts/06_finetune_reranker.py   # ~1h

# 4. Build FAISS index for laws (~170K)
python scripts/03_build_index.py

# 5. Run inference
python scripts/07_inference_v3.py        # generates submission.csv

Leaderboard

Omnilex Agentic Retrieval Competition — Macro F1 on hidden test set.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors