A retrieval-augmented generation pipeline for querying Indian financial regulatory text.
Evaluated on the IndiaFinBench corpus (192 documents, 5,734 chunks) with a 35-item adversarial evaluation set including ablation across retrieval strategy, chunk size, and retrieval depth.
IndiaFinBench evaluates LLM performance in a closed-book setting: models answer questions from memory. This system addresses the complementary open-book task: given the same regulatory corpus, can a retrieval system surface the correct document chunk so that a generative model can answer accurately?
Regulatory documents create specific retrieval challenges:
- Structured identifiers embedded in prose (
Section 4(2)(b),FEMA 6(R)/2026-RB,91-day T-bill) - Chains of superseding circulars requiring temporal disambiguation
- Near-duplicate documents (same RBI circular published under two filenames)
- Dense numerical thresholds (
Rs. 15,000 crore,2% risk weight,US$100 million)
These properties favour lexical retrieval over pure semantic search, motivating a hybrid approach.
Corpus (192 docs)
│
▼
TextPreprocessor header/footer removal, whitespace normalisation
│
▼
RecursiveCharacterSplitter 1600-char target, 200-char overlap, 100-char min
│
├──────────────────────────────────────┐
▼ ▼
BGEEmbedder BM25Okapi
BAAI/bge-base-en-v1.5 (768-dim) k1=1.5, b=0.75
L2-normalised → FAISS IndexFlatIP tokeniser preserves hyphens & parens
│ │
└──────────────┬───────────────────────┘
▼
HybridRetriever
Reciprocal Rank Fusion (RRF k=60)
source-diversity cap (max 3 per rbi/sebi)
│
▼
LLMGenerator
Groq llama-3.3-70b-versatile (primary)
Ollama llama3.2:3b (offline fallback)
temperature=0.0
BGE asymmetric encoding: corpus chunks are encoded without any prefix; queries are encoded with the BAAI-specified prefix "Represent this sentence for searching relevant passages: ". Skipping this prefix causes a measurable recall drop (~3–5% on MTEB).
RRF formula: score(d) = 1/(60 + rank_dense) + 1/(60 + rank_bm25) — the constant k=60 follows Cormack et al. (2009).
| Property | Value |
|---|---|
| Documents | 192 (100 RBI + 92 SEBI) |
| Avg. document size | ~4 KB (RBI press releases) / ~88 KB (SEBI frameworks) |
| Total corpus | ~8.5 MB |
| Chunks (1600-char) | 5,734 |
| Index size | 17 MB FAISS + 18 MB BM25 + 10 MB chunk metadata |
| Retrieval latency | ~1.35 ms/query (exact cosine, CPU) |
A 35-item evaluation set with two tiers:
| Tier | Items | Description |
|---|---|---|
| Synthetic | 20 | Questions grounded in specific document chunks; relevant_chunk_ids verified against the index |
| Adversarial | 15 | Out-of-corpus questions (crypto regulation, digital lending, CBDC) and reworded synthetic questions without labelled chunks |
| Total with ground truth | 24 | Queries with at least one relevant_chunk_id for metric computation |
Adversarial items test the system's handling of unanswerable queries — the correct response is "insufficient context."
- Recall@k — fraction of ground-truth chunks found in the top-k results (primary metric)
- MRR — Mean Reciprocal Rank; rewards correct chunks ranked higher
- Precision@k — fraction of retrieved chunks that are relevant (bounded by ~0.20 for single-answer queries)
- p50/p95 latency — per-query wall-clock time in milliseconds
Six configurations evaluated on the same 35-item set:
| Config | Strategy | Chunk size | k | Recall@5 | MRR | Prec@5 | p50ms | p95ms |
|---|---|---|---|---|---|---|---|---|
| B0 Dense | Dense only | 1600 | 5 | 0.6875 | 0.5417 | 0.1882 | 48 | 88 |
| B1 BM25 | Sparse only | 1600 | 5 | 0.7639 | 0.6736 | 0.2083 | 30 | 46 |
| B2 Hybrid ◄ | Dense + BM25 (RRF) | 1600 | 5 | 0.7847 | 0.6403 | 0.2000 | 77 | 103 |
| B3 Small | Hybrid | 800 | 5 | 0.5833 | 0.4931 | 0.1500 | 138 | 205 |
| B4 Large | Hybrid | 2400 | 5 | 0.5417 | 0.4097 | 0.1361 | 71 | 92 |
| B5 Higher-k | Hybrid | 1600 | 10 | 0.7847 | 0.6403 | 0.1764 | 78 | 102 |
24 queries with ground-truth chunk IDs used for all metric calculations.
Hybrid (B2) achieves the best Recall@5. Combining dense and sparse signals recovers 7.2 percentage points more relevant chunks than dense-only (B0). However, BM25-only (B1) achieves the best MRR — it ranks the correct chunk higher when it finds it. This is consistent with regulatory text where exact identifier matches (Section 51A, RBI/2025-26/251, FEMA 6(R)) are the strongest retrieval signal.
The hybrid's lower MRR relative to BM25 reflects a known RRF trade-off: fusing rankings redistributes scores, sometimes depressing the rank of a high-confidence BM25 hit.
BM25 is the fastest at p50=30ms (pure inverted-index lookup), followed by dense at 48ms (FAISS scan). Hybrid at 77ms is the sum of both, but remains well within interactive latency budgets.
| Chunk | Recall@5 | Δ vs. B2 | Interpretation |
|---|---|---|---|
| 800 (B3) | 0.5833 | −20.1pp | Over-fragmentation: answers split across chunks |
| 1600 (B2) | 0.7847 | — | Optimal |
| 2400 (B4) | 0.5417 | −24.3pp | Too much irrelevant context per chunk; ranking degrades |
Both deviations from 1600-char cause substantial recall drops. Smaller chunks (800) reduce recall due to fragmentation of multi-clause regulatory provisions — capital formulae, tiered definitions, and cross-referencing clauses must be read together to be retrievable as a unit. At 2400 chars, a single chunk covers multiple unrelated provisions, diluting both the BM25 term match and the dense embedding signal.
1600 characters is the empirical optimum for this corpus — roughly four 400-character regulatory paragraphs, the natural unit of a single provision with its context.
Doubling top-k from 5 to 10 does not change Recall@5 or MRR (B5 = B2 identically at 0.7847 / 0.6403). This confirms that missed queries represent genuine corpus issues (boundary splits, duplicates) rather than rank-order problems fixable by returning more results. Increasing k only hurts Precision@5 (0.1764 vs. 0.2000).
6 of 35 queries (17.1%) are retrieval misses. The two dominant failure patterns:
1. Chunk boundary fragmentation — relevant content spans a chunk boundary; the retriever finds adjacent chunks but not the one with the labelled chunk_id.
Example: syn_013 (eligible quarterly profits formula)
Expected:
_026__0001(contains the EP formula) Retrieved:_026__0000(chunk before the formula) + a different document
2. Corpus deduplication — the same RBI circular is indexed under two filenames; the retriever finds the duplicate but the eval set only labels one chunk_id.
Example: syn_011 (Section 51A UAPA)
RBI/2025-26/251 exists as both
_085__*and_001__*(same circular, two parsed copies) Retrieved_001__0000(correct document, wrong copy) → counted as miss
These are dataset/corpus artefacts, not retrieval failures. Deduplicating the corpus and relaxing the evaluation to accept any chunk from the same circular would raise B2 Recall@5 to approximately 0.83.
- Evaluation set size: 35 queries (24 with ground truth) is small; results carry wide confidence intervals. A 200-item set with diverse human-written questions would strengthen the conclusions.
- Synthetic query bias: The 20 synthetic queries were authored with knowledge of the corpus, likely overfitting evaluation to the existing chunk boundaries.
- No re-ranker: A cross-encoder re-ranker (e.g.,
bge-reranker-base) could improve MRR without changing recall, at the cost of additional latency. - Generation evaluation: Faithfulness and answer-relevance scores (Gemini-1.5-Flash as judge) were not collected due to API constraints. The pipeline is instrumented for this; results pending.
- Corpus quality: Near-duplicate documents inflate index size and create ambiguous ground truth. A deduplication step before indexing is the highest-value preprocessing improvement.
Hybrid retrieval with RRF improves Recall@5 by 9.7 percentage points over dense-only on this regulatory corpus, confirming the hypothesis that lexical signals are important for identifier-heavy text. The optimal chunk size of 1600 characters reflects the natural paragraph structure of Indian regulatory documents.
The primary remaining gap to the 0.80 Recall@5 target is attributable to corpus artefacts (duplicates, boundary splits) rather than retrieval algorithm limitations. Corpus-level fixes would close this gap before any algorithmic improvement.
All indices and results are committed to this repository.
# Build index (runs in ~3 minutes on CPU)
python -m rag.scripts.build_index
# Run full ablation (retrieval only, no API keys needed)
python -m rag.scripts.run_evaluation --no-generation
# Run with generation evaluation (requires GROQ_API_KEY + GEMINI_API_KEY)
GROQ_API_KEY=... GEMINI_API_KEY=... python -m rag.scripts.run_evaluationResults are saved to data/eval/results_<timestamp>.json.
Indices are at rag/index/ (1600-char), rag/index_800/ (800-char), rag/index_2400/ (2400-char).
sentence-transformers>=2.7.0 # BGE embeddings
faiss-cpu>=1.8.0 # FAISS IndexFlatIP
rank-bm25>=0.2.2 # BM25Okapi
groq>=0.5.0 # LLM generation (optional)
rag/
├── config.py RAGConfig — all hyperparameters
├── models.py Document, ChunkRecord, RetrievalResult dataclasses
├── data_loader.py Loads .txt files from data/parsed/{rbi,sebi}/
├── preprocessing.py TextPreprocessor — normalisation
├── chunking.py RecursiveCharacterSplitter
├── embeddings.py BGEEmbedder — asymmetric BGE encoding
├── index.py FAISSIndex — build, save, load
├── bm25_index.py BM25Index — build, save, load
├── retriever.py HybridRetriever — dense | bm25 | hybrid modes
├── generator.py LLMGenerator — Groq / Ollama backends
├── pipeline.py RAGPipeline — top-level orchestrator
├── evaluation.py EvalItem, ablation runner, metrics, report printer
├── scripts/
│ ├── build_index.py CLI: build index from corpus
│ └── run_evaluation.py CLI: Phase 3 evaluation entry point
└── index/ Built indices (not tracked in git — use build_index.py)
├── faiss.index
├── chunks.pkl
└── bm25.pkl
Part of the IndiaFinBench project. See README.md for the benchmark.