Background
Building a 5-tier context compiler for agent dispatch revealed that context quality dominates context quantity — 205-word summaries outperform 24K-word full trajectories (SWE Context Bench, Oxford 2026). But nobody benchmarks the tradeoff systematically.
Proposed Addition
Add a benchmark mode that measures how context compression affects task completion:
Configurations to test
- No compression — full file contents, all rules, no filtering
- Manifest only — file paths + line counts, no content
- Tiered — submodular tier assignment (full/skeleton/summary/manifest)
- Minimal — just the task description, no file context
Metrics per configuration
- Task completion rate (ok/partial/fail)
- Token usage (input + output)
- Cost
- Wall-clock time
- Turns to completion
Implementation
pawbench --context-mode none # baseline: no context injection
pawbench --context-mode full # all files inline
pawbench --context-mode manifest # file manifest only
pawbench --context-mode tiered # submodular tier assignment
Each mode runs all scenarios and produces a comparison table showing the quality-cost tradeoff.
Why this matters
No published benchmark measures context composition strategy vs task success. This would be the first, and would validate (or invalidate) techniques like manifest-first, skeleton compression, and selective retrieval.
Background
Building a 5-tier context compiler for agent dispatch revealed that context quality dominates context quantity — 205-word summaries outperform 24K-word full trajectories (SWE Context Bench, Oxford 2026). But nobody benchmarks the tradeoff systematically.
Proposed Addition
Add a benchmark mode that measures how context compression affects task completion:
Configurations to test
Metrics per configuration
Implementation
Each mode runs all scenarios and produces a comparison table showing the quality-cost tradeoff.
Why this matters
No published benchmark measures context composition strategy vs task success. This would be the first, and would validate (or invalidate) techniques like manifest-first, skeleton compression, and selective retrieval.