Skip to content

Fix: stop leaking held-out reference answers to the meta/feedback agents#36

Open
francisco-perez-sorrosal wants to merge 1 commit into
hexo-ai:mainfrom
francisco-perez-sorrosal:fix/feedback-gold-leak
Open

Fix: stop leaking held-out reference answers to the meta/feedback agents#36
francisco-perez-sorrosal wants to merge 1 commit into
hexo-ai:mainfrom
francisco-perez-sorrosal:fix/feedback-gold-leak

Conversation

@francisco-perez-sorrosal

Copy link
Copy Markdown

Fix: stop leaking held-out reference answers to the meta/feedback agents

The problem

In the self-improvement loop, the feedback context is built in
sia/orchestrator.py::_build_feedback_context by dumping the entire
results.json into the feedback-agent prompt:

eval_data = json.load(f)
eval_results_section = f"""
**EVALUATION RESULTS**:
```json
{json.dumps(eval_data, indent=2)}
```
"""

When a task's grader writes per-item records that include the held-out
reference answers (the answer key), those answers flow straight into the
feedback-agent prompt — and, because the meta/feedback agent can read the
filesystem, the held-out ground truth is effectively visible to the very agent
whose job is to improve the solution.

This contradicts the project's own stated contract:

private/ — Held-out eval data; never exposed to the agentREADME.md

The LLM is not told about data/private/ during evaluation. This
prevents the agent from cheating and ensures fair scoring. — docs/walkthrough.md

This is reproducible on a bundled task today. The longcot-chess grader
writes results.json as {"summary": {...}, "results": [{"question_id", "expected": <correct answer>, "predicted", "status"}, ...]}. The orchestrator
runs it via evaluate.py --gen-dir <gen> (so it lands at <gen>/results.json),
then _build_feedback_context dumps the whole file — including every
results[].expected answer — into gen_N/feedback_prompt. (For lawbench the
exposure is milder — only aggregate per_class accuracy; gpqa writes its gold
to a different filename and isn't dumped.) One concrete leaking bundled task is
enough to violate the invariant above.

This is a reward-hacking / contamination surface: the agent can optimize
toward matching the answer key rather than toward a genuinely better,
generalizing solution. It silently undermines the integrity of every
self-improvement run on a task that grades against held-out gold.

The fix — three composable, task-agnostic layers

Two distinct leak vectors, named up front. The reported bug is harness-side:
the orchestrator itself hands the agent the answer key by dumping results.json
into the prompt it builds. That is the vector that is reproducible today, and
Layer 1 is the actual fix for it — it curates what the harness emits, so the gold
never enters the prompt. Layers 2 and 3 are defense-in-depth against a second,
latent
vector the prompt-side fix doesn't touch — the agent reaching into
data/private on its own during its run (read/glob/cat). Keeping the two
vectors separate is what lets the seal stay task-agnostic: Layer 1 closes the
trusted channel (the harness), Layers 2–3 narrow the agent's own channel.

Each layer is generic (no task-specific coupling) and defaults to today's
behavior where applicable, so existing tasks keep working.

Layer 1 — curated, reference-answer-free eval summary

_build_feedback_context no longer dumps results.json. It calls a new
_build_eval_summary that emits only:

  • Every scalar metric, at any nesting depth, rendered as a JSON object in the
    original results.json shape and field names (top-level accuracy/correct/
    total, or a nested summary block — whatever the grader wrote). This is a
    near drop-in for the old dump for a grader that emits only scalars.
  • a generic, opt-in items[] contract: each item is an open dict, of which
    only id / status / group / category / input / output / detail are
    framework-recognized.

The reference-answer-free guarantee comes from dropping every list: per-item
record arrays — results / details, where the bundled graders put the answer
key — are excluded wholesale; nested dicts (aggregate blocks) are recursed for
their scalars. The only per-item channel is the items[] array, surfaced solely
through the render whitelist, so no task-specific key can ride along. Failing
items are sampled (capped, diversified across status and group). A bounded
summarize_items digest is also stored in context.md metrics.

Concrete case (longcot-chess, a bundled task): its grader writes
results.json as {"summary": {...}, "results": [{"expected": <correct answer>, …}]},
and the old code dumped the whole file — every held-out answer — into the
feedback prompt. After the fix, the feedback agent sees the nested summary
scalars and none of the results[].expected answers.

Opt-in + graceful degradation: a grader opts into rich, reference-free
feedback by emitting items[] with input/output but not the reference
answer. Graders that emit only scalars still get the original JSON summary — no
breakage, no format change.

Two opt-in env vars (SIA_VERIFIER_PASS_STATUSES, SIA_FEEDBACK_FAILURE_SAMPLES)
tune the failed-item sample; both default to current behavior. They are
documented in docs/configuration.md.

Layer 2 — held-out-integrity prompt notice

A generic, path-free instruction is injected into both the meta and feedback
prompts, telling the agent that grader-only ground truth lives in a private
directory and must not be read/listed/globbed. It names no concrete path, so it
applies to any task.

Layer 3 — protected-path defense-in-depth (two complementary guards)

During the meta and feedback agent runs, the task's held-out dir
(<task_dir>/data/private) is protected by two mechanisms, both keyed off the same
generic compute_protected_paths(task_dir):

3a — claude-impl PreToolUse deny hook. A claude_agent_sdk hook denies any
Bash / Read / Edit / Glob / … tool call that resolves inside the held-out
dir, wired through a protected_paths kwarg on run_agent. It defaults to None
(no hook registered), so behavior is byte-identical when unused, and is honored only
by the claude impl (others ignore the kwarg).

3b — impl-agnostic OS-level guard. A restricted_access context manager strips
all filesystem permissions on data/private (chmod 000) for the duration of the
sequential meta/feedback agent run, restoring the original mode on exit (including on
error). It is wrapped only around the agent calls — never around grading, which
legitimately reads the dir. Gated by SIA_PRIVATE_DIR_GUARD (default on).

3b closes two gaps that 3a alone leaves open:

  • Cross-impl. 3a only protects the claude impl; the openai / pydantic-ai /
    openhands impls ignore the kwarg entirely. The OS guard is enforced in the
    orchestrator, so it protects the held-out dir regardless of agent impl.
  • Evasion. 3a's Bash branch is a path-string match — robust against a stray
    cat .../data/private/…, but not against an obfuscated shell or a subprocess. The
    OS guard fails the read at the filesystem layer, which a string-match cannot be
    talked around.

Honest caveat: the OS guard is process-global (it chmods a shared dir), so it is
applied only around the sequential agent calls. It does not defend against a
root process or a copy-out-then-read, and a hard kill (SIGKILL) mid-run leaves the
dir stripped until restored. All of Layer 3 is defense-in-depth behind the primary
seal (Layer 1); container-level sandboxing (--sandbox docker) remains the
strongest boundary
, not a security guarantee on its own.

Why SIA's existing Docker sandbox doesn't close this

SIA already ships a configurable execution sandbox (SANDBOX_MODE = none | docker,
with DOCKER_IMAGE / DOCKER_MEMORY_LIMIT / … and a --sandbox docker invocation), so
it's worth saying explicitly why it does not solve the gold leak. First, the
primary leak (Layer 1) is harness-side: the orchestrator itself dumps the
gold-bearing results.json into the prompt text it builds, so the answers ride into the
feedback prompt whether or not the agent's code runs inside a container — an execution
sandbox sees none of it. Second, for the secondary self-exfiltration vector (an agent
reading data/private on its own), Docker is indeed the strongest containment — but it
is off by default (SANDBOX_MODE="none") and execution-focused: a heavy, opt-in,
run-level mechanism. That is exactly the gap Layer 3b fills — a lightweight, default-on,
per-agent-run chmod 000 backstop that needs no container. The two are complements,
not substitutes
: Docker is the proper OS-level isolation when a run opts into it; 3b is
the in-process, always-on default.

Configuration

Three env vars, all documented in docs/configuration.md:

  • SIA_VERIFIER_PASS_STATUSES (default CORRECT,PASS,correct) — the set of
    items[] statuses that count as passing; any other status is a failure.
  • SIA_FEEDBACK_FAILURE_SAMPLES (default 20) — cap on failing items surfaced
    in the summary.
  • SIA_PRIVATE_DIR_GUARD (default on) — toggles Layer 3b's OS-level guard.
    Set 0 to disable (e.g. if a hard-kill left a dir stripped and you want to opt
    out while debugging).

The first two default to current behavior; the third adds the OS-level backstop
on by default and is the only runtime-behavior change for tasks that ship a
data/private dir.

Tests

  • Leak guard: feed eval_data whose results[] contains a gold sentinel;
    assert the built feedback context contains none of it.

  • Render whitelist: non-whitelisted keys (e.g. a reference answer) are dropped.

  • Stratified sampling: failures spread across status/group under the cap.

  • Protected-path matcher matrix: file tools, Bash, ../glob resolution,
    relative-form matching, empty-protected-list no-op.

  • OS-level guard: restricted_access strips then restores permissions; is a
    no-op when disabled; restores on exception; tolerates a missing dir.

  • Prompt notice: present in meta + feedback prompts; names no concrete path.

  • Source-level genericity guards: the seal code paths reference no
    task-specific identifiers.

  • Format preservation: scalar-only graders produce the original JSON shape
    (incl. total); the feedback_context_* goldens are unchanged.

Golden snapshots: only the three prompt goldens change (the Layer-2 notice).
The feedback_context_success_* goldens are byte-identical to before — a
scalar-only results.json renders exactly as it did, just with answer-bearing
arrays excluded.

End-to-end proof (real code path, no mocks)

I verified the fix against the actual code path on the bundled longcot-chess
task, with no API keys or mocks:

  1. Ran the real grader (sia/tasks/longcot-chess/data/public/evaluate.py) against
    its real data/private/answers.json, with a deliberately-wrong submission, to
    produce a genuine results.json whose results[].expected fields are the real
    held-out answers (50 questions, 57 distinct gold tokens).
  2. Fed that same results.json through the gold-free summary builder
    (_build_eval_summary, which now replaces the old json.dumps(results.json)
    dump in _build_feedback_context) and measured what survives.

The honest contamination unit is the recoverable association (question_id
its gold answer): an agent can only cheat if it can tell which question has
which answer. A bare token is not enough — a numeric chess answer can coincide
with an aggregate count or a digit of the grader's timestamp, revealing nothing.

Metric (same input results.json) base 38250cb (full dump) this branch (_build_eval_summary)
question_id → gold-answer associations recoverable 50 / 50 🔴 0 / 50 🟢
raw gold-token substring hits 57 / 57 0 / 57 *
expected key / results[] array present yes no
context size ~10.9 KB ~0.5 KB

* Stable at 0 across repeated runs; a substring check can occasionally flicker to
~1/57 when a numeric answer (e.g. 44) lands inside the timestamp (…T11:44:34) —
a coincidence with no question_id attached, so no answer is recoverable.

Same input, opposite outcome: every held-out answer was recoverable before, none
is after. The fix also shrinks the eval section ~20× by dropping the per-item dump (a
turn-pressure / token-cost win).

Scope

This is purely the harness-side leak fix. It is task-agnostic — no coupling to
any specific task. Graders that already avoid emitting gold are unaffected;
graders that want richer feedback can adopt the items[] contract.

Validation

pytest tests/, ruff check, ruff format --check, ty check sia/, and
python -m build all pass.

This contradicts the project's own stated contract — README ("private/ ...
never exposed to the agent") and docs/walkthrough.md ("the LLM is not told about
data/private/ ... prevents the agent from cheating"). It is reproducible on a
bundled task: the `longcot-chess` grader writes results.json as
`{"summary": {...}, "results": [{"expected": <correct answer>, ...}]}`, and
`_build_feedback_context` dumps the whole file — every held-out answer — into the
feedback-agent prompt via `json.dumps(eval_data)`. The feedback agent (told to
improve the target agent, and able to read the filesystem) can then optimize
toward the answer key instead of toward a genuinely better solution — a
reward-hacking / contamination surface. (lawbench exposes only aggregate
per-class accuracy; gpqa writes its gold to a different filename and isn't
dumped — one leaking bundled task is enough.)

This fixes the leak with three composable, task-agnostic layers:

Layer 1 — curated, reference-answer-free eval summary.
  `_build_feedback_context` no longer dumps results.json. It calls a new
  `_build_eval_summary` that, via `_collect_scalars`, emits every SCALAR metric at
  any nesting depth as a JSON object — preserving the original results.json shape
  and field names (top-level `accuracy`/`correct`/`total`, or a nested `summary`
  block; whatever the grader wrote). The reference-answer-free guarantee comes
  from DROPPING every list: per-item record arrays (`results`/`details`, where the
  bundled graders put the answer key) are excluded wholesale; nested dicts are
  recursed for their scalars. The only per-item channel is the generic, opt-in
  `items[]` array, surfaced solely through a fixed render whitelist
  (id/status/group/category/input/output/detail) so no task-specific key can ride
  along; when it carries failures, a capped sample (diversified across status and
  group) plus an anti-reward-hack framing line are appended. A grader that emits
  only scalars gets output equivalent to the original dump, minus answer-bearing
  arrays. `summarize_items` also stores a bounded items digest in context.md.

Layer 2 — held-out-integrity prompt notice.
  A generic, path-free instruction injected into the meta and feedback prompts
  telling the agent that grader-only ground truth exists in a private directory
  and must not be accessed.

Layer 3 — protected-path defense-in-depth (two complementary guards), keyed off
the same generic `compute_protected_paths(task_dir)`:

  3a — claude-impl PreToolUse deny hook.
    Denies any Bash/Read/Edit/Glob/... tool call resolving inside the task's
    held-out dir (`<task_dir>/data/private`), wired via a new `protected_paths`
    kwarg. Defaults to None (no hook), so behavior is byte-identical when unused.
    The Bash check is a path-string match, and only the `claude` impl honors the
    kwarg.

  3b — impl-agnostic OS-level guard.
    A `restricted_access` context manager strips all filesystem permissions on
    `data/private` (chmod 000) for the duration of the sequential meta/feedback
    agent run, restoring the original mode on exit (including on error). It wraps
    only the agent calls — never grading, which legitimately reads the dir. Gated
    by `SIA_PRIVATE_DIR_GUARD` (default on). 3b closes two gaps 3a leaves open:
    it protects the held-out dir for the openai/pydantic-ai/openhands impls that
    ignore the kwarg, and it fails the read at the filesystem layer, which an
    obfuscated shell or subprocess cannot talk around. Residual caveats: a root
    process or a copy-out-then-read still bypasses it, and a SIGKILL mid-run
    leaves the dir stripped until restored — container sandboxing (--sandbox
    docker) remains the strongest boundary.

Three config knobs (`SIA_VERIFIER_PASS_STATUSES`, `SIA_FEEDBACK_FAILURE_SAMPLES`,
`SIA_PRIVATE_DIR_GUARD`) are documented in docs/configuration.md; the first two
default to today's behavior, the third turns on the OS-level guard by default.
Tests cover the leak guard (reference-answer sentinel in results[] never
appears), the real nested-`summary`/`results[]` grader shape, scalar recursion,
the render whitelist, the protected-path matcher matrix, the OS-level guard
(strip/restore, no-op when disabled, restore-on-error, missing-path tolerance),
the prompt notice, and source-level genericity guards. Only the three prompt
goldens change (the Layer-2 notice); the feedback-context goldens are unchanged.
@selvamHexo selvamHexo force-pushed the fix/feedback-gold-leak branch from 751b2a4 to e93c7b5 Compare June 16, 2026 15:50
@selvamHexo selvamHexo requested a review from yogendrahexo June 16, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants