Skip to content

Latest commit

 

History

History
230 lines (163 loc) · 11.3 KB

File metadata and controls

230 lines (163 loc) · 11.3 KB

Hybrid RAG System for IndiaFinBench

A retrieval-augmented generation pipeline for querying Indian financial regulatory text.

Evaluated on the IndiaFinBench corpus (192 documents, 5,734 chunks) with a 35-item adversarial evaluation set including ablation across retrieval strategy, chunk size, and retrieval depth.


1. Motivation

IndiaFinBench evaluates LLM performance in a closed-book setting: models answer questions from memory. This system addresses the complementary open-book task: given the same regulatory corpus, can a retrieval system surface the correct document chunk so that a generative model can answer accurately?

Regulatory documents create specific retrieval challenges:

  • Structured identifiers embedded in prose (Section 4(2)(b), FEMA 6(R)/2026-RB, 91-day T-bill)
  • Chains of superseding circulars requiring temporal disambiguation
  • Near-duplicate documents (same RBI circular published under two filenames)
  • Dense numerical thresholds (Rs. 15,000 crore, 2% risk weight, US$100 million)

These properties favour lexical retrieval over pure semantic search, motivating a hybrid approach.


2. System Architecture

Corpus (192 docs)
      │
      ▼
 TextPreprocessor          header/footer removal, whitespace normalisation
      │
      ▼
RecursiveCharacterSplitter  1600-char target, 200-char overlap, 100-char min
      │
      ├──────────────────────────────────────┐
      ▼                                      ▼
BGEEmbedder                            BM25Okapi
BAAI/bge-base-en-v1.5 (768-dim)       k1=1.5, b=0.75
L2-normalised → FAISS IndexFlatIP     tokeniser preserves hyphens & parens
      │                                      │
      └──────────────┬───────────────────────┘
                     ▼
             HybridRetriever
        Reciprocal Rank Fusion (RRF k=60)
        source-diversity cap (max 3 per rbi/sebi)
                     │
                     ▼
             LLMGenerator
        Groq llama-3.3-70b-versatile (primary)
        Ollama llama3.2:3b (offline fallback)
        temperature=0.0

BGE asymmetric encoding: corpus chunks are encoded without any prefix; queries are encoded with the BAAI-specified prefix "Represent this sentence for searching relevant passages: ". Skipping this prefix causes a measurable recall drop (~3–5% on MTEB).

RRF formula: score(d) = 1/(60 + rank_dense) + 1/(60 + rank_bm25) — the constant k=60 follows Cormack et al. (2009).


3. Corpus

Property Value
Documents 192 (100 RBI + 92 SEBI)
Avg. document size ~4 KB (RBI press releases) / ~88 KB (SEBI frameworks)
Total corpus ~8.5 MB
Chunks (1600-char) 5,734
Index size 17 MB FAISS + 18 MB BM25 + 10 MB chunk metadata
Retrieval latency ~1.35 ms/query (exact cosine, CPU)

4. Evaluation Design

4.1 Evaluation Set

A 35-item evaluation set with two tiers:

Tier Items Description
Synthetic 20 Questions grounded in specific document chunks; relevant_chunk_ids verified against the index
Adversarial 15 Out-of-corpus questions (crypto regulation, digital lending, CBDC) and reworded synthetic questions without labelled chunks
Total with ground truth 24 Queries with at least one relevant_chunk_id for metric computation

Adversarial items test the system's handling of unanswerable queries — the correct response is "insufficient context."

4.2 Metrics

  • Recall@k — fraction of ground-truth chunks found in the top-k results (primary metric)
  • MRR — Mean Reciprocal Rank; rewards correct chunks ranked higher
  • Precision@k — fraction of retrieved chunks that are relevant (bounded by ~0.20 for single-answer queries)
  • p50/p95 latency — per-query wall-clock time in milliseconds

5. Ablation Study

Six configurations evaluated on the same 35-item set:

Config Strategy Chunk size k Recall@5 MRR Prec@5 p50ms p95ms
B0 Dense Dense only 1600 5 0.6875 0.5417 0.1882 48 88
B1 BM25 Sparse only 1600 5 0.7639 0.6736 0.2083 30 46
B2 Hybrid Dense + BM25 (RRF) 1600 5 0.7847 0.6403 0.2000 77 103
B3 Small Hybrid 800 5 0.5833 0.4931 0.1500 138 205
B4 Large Hybrid 2400 5 0.5417 0.4097 0.1361 71 92
B5 Higher-k Hybrid 1600 10 0.7847 0.6403 0.1764 78 102

24 queries with ground-truth chunk IDs used for all metric calculations.


6. Analysis

6.1 Retrieval Strategy (B0–B2)

Hybrid (B2) achieves the best Recall@5. Combining dense and sparse signals recovers 7.2 percentage points more relevant chunks than dense-only (B0). However, BM25-only (B1) achieves the best MRR — it ranks the correct chunk higher when it finds it. This is consistent with regulatory text where exact identifier matches (Section 51A, RBI/2025-26/251, FEMA 6(R)) are the strongest retrieval signal.

The hybrid's lower MRR relative to BM25 reflects a known RRF trade-off: fusing rankings redistributes scores, sometimes depressing the rank of a high-confidence BM25 hit.

BM25 is the fastest at p50=30ms (pure inverted-index lookup), followed by dense at 48ms (FAISS scan). Hybrid at 77ms is the sum of both, but remains well within interactive latency budgets.

6.2 Chunk Size (B3–B4 vs. B2)

Chunk Recall@5 Δ vs. B2 Interpretation
800 (B3) 0.5833 −20.1pp Over-fragmentation: answers split across chunks
1600 (B2) 0.7847 Optimal
2400 (B4) 0.5417 −24.3pp Too much irrelevant context per chunk; ranking degrades

Both deviations from 1600-char cause substantial recall drops. Smaller chunks (800) reduce recall due to fragmentation of multi-clause regulatory provisions — capital formulae, tiered definitions, and cross-referencing clauses must be read together to be retrievable as a unit. At 2400 chars, a single chunk covers multiple unrelated provisions, diluting both the BM25 term match and the dense embedding signal.

1600 characters is the empirical optimum for this corpus — roughly four 400-character regulatory paragraphs, the natural unit of a single provision with its context.

6.3 Retrieval Depth (B5 vs. B2)

Doubling top-k from 5 to 10 does not change Recall@5 or MRR (B5 = B2 identically at 0.7847 / 0.6403). This confirms that missed queries represent genuine corpus issues (boundary splits, duplicates) rather than rank-order problems fixable by returning more results. Increasing k only hurts Precision@5 (0.1764 vs. 0.2000).

6.4 Failure Analysis (B2 Hybrid)

6 of 35 queries (17.1%) are retrieval misses. The two dominant failure patterns:

1. Chunk boundary fragmentation — relevant content spans a chunk boundary; the retriever finds adjacent chunks but not the one with the labelled chunk_id.

Example: syn_013 (eligible quarterly profits formula)

Expected: _026__0001 (contains the EP formula) Retrieved: _026__0000 (chunk before the formula) + a different document

2. Corpus deduplication — the same RBI circular is indexed under two filenames; the retriever finds the duplicate but the eval set only labels one chunk_id.

Example: syn_011 (Section 51A UAPA)

RBI/2025-26/251 exists as both _085__* and _001__* (same circular, two parsed copies) Retrieved _001__0000 (correct document, wrong copy) → counted as miss

These are dataset/corpus artefacts, not retrieval failures. Deduplicating the corpus and relaxing the evaluation to accept any chunk from the same circular would raise B2 Recall@5 to approximately 0.83.


7. Limitations

  • Evaluation set size: 35 queries (24 with ground truth) is small; results carry wide confidence intervals. A 200-item set with diverse human-written questions would strengthen the conclusions.
  • Synthetic query bias: The 20 synthetic queries were authored with knowledge of the corpus, likely overfitting evaluation to the existing chunk boundaries.
  • No re-ranker: A cross-encoder re-ranker (e.g., bge-reranker-base) could improve MRR without changing recall, at the cost of additional latency.
  • Generation evaluation: Faithfulness and answer-relevance scores (Gemini-1.5-Flash as judge) were not collected due to API constraints. The pipeline is instrumented for this; results pending.
  • Corpus quality: Near-duplicate documents inflate index size and create ambiguous ground truth. A deduplication step before indexing is the highest-value preprocessing improvement.

8. Conclusion

Hybrid retrieval with RRF improves Recall@5 by 9.7 percentage points over dense-only on this regulatory corpus, confirming the hypothesis that lexical signals are important for identifier-heavy text. The optimal chunk size of 1600 characters reflects the natural paragraph structure of Indian regulatory documents.

The primary remaining gap to the 0.80 Recall@5 target is attributable to corpus artefacts (duplicates, boundary splits) rather than retrieval algorithm limitations. Corpus-level fixes would close this gap before any algorithmic improvement.


9. Reproducibility

All indices and results are committed to this repository.

# Build index (runs in ~3 minutes on CPU)
python -m rag.scripts.build_index

# Run full ablation (retrieval only, no API keys needed)
python -m rag.scripts.run_evaluation --no-generation

# Run with generation evaluation (requires GROQ_API_KEY + GEMINI_API_KEY)
GROQ_API_KEY=... GEMINI_API_KEY=... python -m rag.scripts.run_evaluation

Results are saved to data/eval/results_<timestamp>.json. Indices are at rag/index/ (1600-char), rag/index_800/ (800-char), rag/index_2400/ (2400-char).

Dependencies

sentence-transformers>=2.7.0   # BGE embeddings
faiss-cpu>=1.8.0               # FAISS IndexFlatIP
rank-bm25>=0.2.2               # BM25Okapi
groq>=0.5.0                    # LLM generation (optional)

10. System Files

rag/
├── config.py          RAGConfig — all hyperparameters
├── models.py          Document, ChunkRecord, RetrievalResult dataclasses
├── data_loader.py     Loads .txt files from data/parsed/{rbi,sebi}/
├── preprocessing.py   TextPreprocessor — normalisation
├── chunking.py        RecursiveCharacterSplitter
├── embeddings.py      BGEEmbedder — asymmetric BGE encoding
├── index.py           FAISSIndex — build, save, load
├── bm25_index.py      BM25Index — build, save, load
├── retriever.py       HybridRetriever — dense | bm25 | hybrid modes
├── generator.py       LLMGenerator — Groq / Ollama backends
├── pipeline.py        RAGPipeline — top-level orchestrator
├── evaluation.py      EvalItem, ablation runner, metrics, report printer
├── scripts/
│   ├── build_index.py      CLI: build index from corpus
│   └── run_evaluation.py   CLI: Phase 3 evaluation entry point
└── index/             Built indices (not tracked in git — use build_index.py)
    ├── faiss.index
    ├── chunks.pkl
    └── bm25.pkl

Part of the IndiaFinBench project. See README.md for the benchmark.