Skip to content

Rule Batching Optimization (Implemented & Disabled)#54

Draft
ayo6706 wants to merge 11 commits intomainfrom
ft/rule-batch-optimaztion
Draft

Rule Batching Optimization (Implemented & Disabled)#54
ayo6706 wants to merge 11 commits intomainfrom
ft/rule-batch-optimaztion

Conversation

@ayo6706
Copy link
Collaborator

@ayo6706 ayo6706 commented Jan 13, 2026

This PR implements the infrastructure for Rule Batching Optimization, a strategy to evaluate multiple rules in a single LLM call to reduce token usage and latency.

Status: The feature is DISABLED BY DEFAULT (BatchRules=false).

Reason: Extensive validation confirmed that batching significantly degrades accuracy due to LLM cognitive load limitations ("Lost in the Middle" phenomenon).


Implementation Details

1. New Core Components

  • BatchedCheckEvaluator: A new evaluator class that:
    • Chunks content once.
    • Constructs a single "Multi-Rule" prompt.
    • Parses a new JSON schema (BatchedCheckLLMResult) to map violations back to specific Rule IDs.
  • BatchedPromptBuilder: Dynamically assembles multiple rule definitions into a single system prompt using a "Task-Based" structure.

2. Orchestrator Updates

  • Modified evaluateFile in src/cli/orchestrator.ts to:
    • Partition rules into Batchable (Check rules, Base evaluator) and Non-Batchable (Judge rules, Custom evaluators).
    • Route batchable rules to the new evaluator.
    • Gracefully fall back to individual evaluation if the batch request fails.

3. Configuration

  • Added BatchRules (boolean) and MaxRulesPerBatch (number) to .vectorlint.ini.
  • Default: false (Safety first).

Validation & Analysis

Conducted analysis validation experiments to measure accuracy before enabling the feature.

Methodology

  • Test File: tests/fixtures/ai-pattern/negation-pattern.md (complex content with negation patterns)
  • Model: gpt-5.1
  • Process: Compared BatchRules=false (Baseline) vs BatchRules=true (Experiment)
  • Metric: Intersection of findings by Rule ID + Line Number
  • Targets: >95% overlap with baseline, >50% token reduction

Experiments & Findings

Report Batch Size Overlap Token Reduction Hallucinations Status
V5 4 rules 37.5% 37% ~9% FAIL
V6 2 rules 59% 21% ~5% FAIL
Baseline 1 rule 100% 0% 0% PASS

Key Metrics Comparison

Metric Baseline Batch=4 (V5) Batch=2 (V6)
Warnings Found 32 34 37
Input Tokens 50,570 31,996 39,868
LLM Requests 24 6 12
Missed Findings 0 18 13

Root Cause Analysis

Why Batching Fails

  1. Lost in the Middle: Complex rules (negation-contrast patterns, structural analysis) are systematically missed when multiple rules compete for attention in one prompt.

  2. Context Bleed (Hallucination): The model applies the logic of one rule to another, creating false positives where the "sentiment" of Rule A bleeds into Rule B's evaluation.

  3. Inconsistent Rule Application: The Repetition rule found 7 issues in batched mode vs 2 in baseline - the model's interpretation varies significantly based on prompt structure.

Batch Size Impact

Batch Size Observation
4 rules Missed 18 findings (56% loss), especially negation patterns on lines 3, 37, 58, 130, 136
2 rules Missed 13 findings (41% loss), recovered negation patterns on line 3, but still missed lines 37, 58, 130, 136
1 rule (baseline) All findings recovered, no "lost in the middle" effect

Detailed Examples of Missed Findings

Line:Col Rule Quoted Text Description Missed By
1:102 AIPattern "don't just need tools, they need integrated platforms" Rhetorical structure adds flair but no substance Both V5 & V6
3:15 AIPattern "doesn't simply improve productivity" Introduces and dismisses idea never discussed V5 only
37:1 AIPattern "isn't just a comment bot" Negation-contrast without prior setup Both V5 & V6
58:323 AIPattern "doesn't have X, but lacks Y" Redundant negative contrasts Both V5 & V6
130:50 AIPattern "Instead of trying to do everything" Artificial contrast Both V5 & V6
136:31 AIPattern "doesn't generate, doesn't provide, doesn't help" Repeated "doesn'ts" - templated AI phrasing Both V5 & V6

Process Note: "Implement to Validate"

The original plan required validation before core engineering. However, to scientifically validate the batching hypothesis, the batching infrastructure (Evaluator, Prompt Builder, Schema) had to be built first to run the experiments.

Therefore, this PR includes the full implementation of the optimization, but respects the quality gate by shipping it in a disabled state. This preserves the engineering work for future R&D (e.g., when models improve) without risking production quality.


Future Improvements

If batching is revisited, consider:

  1. Rule-type-aware batching: Only batch simple buzzword rules together; keep complex structural rules individual
  2. Hybrid approach: Use batching for first-pass scanning, then verify edge cases individually
  3. Prompt engineering: Experiment with stronger rule separation in the prompt format
  4. Model improvements: Test with newer models that may have better "lost in the middle" handling

Artifacts

  • BATCHING_COMPARISON_REPORT_V5.md: Batch size 4 analysis with full findings and descriptions
  • BATCHING_COMPARISON_REPORT_V6.md: Batch size 2 analysis showing improved overlap (59% vs 37.5%)
  • scripts/measure-batching-accuracy.ts: Reusable validation tool

Summary

Criterion Target V5 (Batch=4) V6 (Batch=2) Verdict
Overlap with Baseline >95% 37.5% 59% FAIL
Token Reduction >50% 37% 21% FAIL
Hallucination Rate 0% ~9% ~5% MARGINAL

Recommendation: Feature remains DISABLED (BatchRules=false). The infrastructure is preserved for future experimentation when LLM capabilities improve.

- Add batchRules and maxRulesPerBatch options to EvaluationOptions interface
- Create EvaluateBatchedRulesParams interface for batched evaluation parameters
- Create EvaluateBatchedRulesResult interface to track batched evaluation results
- Implement buildBatchedCheckLLMSchema() function to generate JSON schema for multiple rules
- Add BatchedCheckLLMResult type to represent LLM output for batched evaluations
- Enable evaluating multiple rules in a single LLM call for improved efficiency
- Create new BatchedPromptBuilder module to combine multiple Check rules into single prompts
- Implement formatRuleForBatch() to format individual rules with clear task numbering
- Add buildBatchedCheckPrompt() to construct system prompts with rule preamble and verification
- Implement extractBatchedRuleContexts() to extract essential rule metadata from PromptFile objects
- Add groupIntoBatches() utility to partition rules into configurable batch sizes
- Define BatchedRuleContext interface for rule metadata (id, name, body)
- Include comprehensive system preamble with mission, protocol, and output format instructions
- Enables efficient batch evaluation of multiple rules in a single LLM call
- Implement BatchedCheckEvaluator class to process multiple Check-type rules in single LLM calls
- Add support for batching rules to reduce token usage by sending content only once per batch
- Implement rule batching logic with configurable max rules per batch to mitigate "lost in the middle" problem
- Add content chunking with configurable thresholds (600 word threshold, 500 word max chunk size)
- Integrate with batched prompt builder and schema for structured LLM responses
- Add violation merging and score calculation across multiple chunks and batches
- Include comprehensive TODO noting that batching showed 60-90% accuracy loss in validation and is currently disabled by default
- Add token usage aggregation and distribution across evaluated rules
- Support document-level evaluation mode that disables chunking when required
- Add `batchRules` boolean flag to enable/disable batched rule evaluation
- Add `maxRulesPerBatch` integer option to control batch size (1-20, default 5)
- These options allow users to optimize rule evaluation performance by processing multiple rules in parallel batches
- Add BatchRules and MaxRulesPerBatch configuration keys to config loader
- Parse BatchRules as boolean from config file (accepts "true", "1", or "false")
- Parse MaxRulesPerBatch as integer with validation (must be between 1 and 20)
- Propagate batch configuration options through CLI commands to evaluation context
- Export BatchedCheckEvaluator and related utilities from evaluators module
- Update module documentation to reference batched rule evaluation capability
- Enables users to configure batch evaluation behavior via configuration file
- Create new script to compare batched vs non-batched rule evaluation results
- Implement violation key normalization for accurate comparison across runs
- Add support for automatic baseline and batched evaluation execution
- Include detailed accuracy reporting with overlap percentage metrics
- Support both auto mode (runs evaluations) and manual mode (provides instructions)
- Add JSON output format for programmatic result processing
- Include verbose mode for detailed violation comparison analysis
- Validate that batching optimization doesn't degrade evaluation quality
@coderabbitai
Copy link

coderabbitai bot commented Jan 13, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant