Skip to content

Benchmark mode: measure context compression vs task success tradeoff #8

@zenprocess

Description

@zenprocess

Background

Building a 5-tier context compiler for agent dispatch revealed that context quality dominates context quantity — 205-word summaries outperform 24K-word full trajectories (SWE Context Bench, Oxford 2026). But nobody benchmarks the tradeoff systematically.

Proposed Addition

Add a benchmark mode that measures how context compression affects task completion:

Configurations to test

  1. No compression — full file contents, all rules, no filtering
  2. Manifest only — file paths + line counts, no content
  3. Tiered — submodular tier assignment (full/skeleton/summary/manifest)
  4. Minimal — just the task description, no file context

Metrics per configuration

  • Task completion rate (ok/partial/fail)
  • Token usage (input + output)
  • Cost
  • Wall-clock time
  • Turns to completion

Implementation

pawbench --context-mode none       # baseline: no context injection
pawbench --context-mode full       # all files inline
pawbench --context-mode manifest   # file manifest only
pawbench --context-mode tiered     # submodular tier assignment

Each mode runs all scenarios and produces a comparison table showing the quality-cost tradeoff.

Why this matters

No published benchmark measures context composition strategy vs task success. This would be the first, and would validate (or invalidate) techniques like manifest-first, skeleton compression, and selective retrieval.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions