Fix: stop leaking held-out reference answers to the meta/feedback agents#36
Open
francisco-perez-sorrosal wants to merge 1 commit into
Open
Conversation
8RON8
approved these changes
Jun 12, 2026
This contradicts the project's own stated contract — README ("private/ ...
never exposed to the agent") and docs/walkthrough.md ("the LLM is not told about
data/private/ ... prevents the agent from cheating"). It is reproducible on a
bundled task: the `longcot-chess` grader writes results.json as
`{"summary": {...}, "results": [{"expected": <correct answer>, ...}]}`, and
`_build_feedback_context` dumps the whole file — every held-out answer — into the
feedback-agent prompt via `json.dumps(eval_data)`. The feedback agent (told to
improve the target agent, and able to read the filesystem) can then optimize
toward the answer key instead of toward a genuinely better solution — a
reward-hacking / contamination surface. (lawbench exposes only aggregate
per-class accuracy; gpqa writes its gold to a different filename and isn't
dumped — one leaking bundled task is enough.)
This fixes the leak with three composable, task-agnostic layers:
Layer 1 — curated, reference-answer-free eval summary.
`_build_feedback_context` no longer dumps results.json. It calls a new
`_build_eval_summary` that, via `_collect_scalars`, emits every SCALAR metric at
any nesting depth as a JSON object — preserving the original results.json shape
and field names (top-level `accuracy`/`correct`/`total`, or a nested `summary`
block; whatever the grader wrote). The reference-answer-free guarantee comes
from DROPPING every list: per-item record arrays (`results`/`details`, where the
bundled graders put the answer key) are excluded wholesale; nested dicts are
recursed for their scalars. The only per-item channel is the generic, opt-in
`items[]` array, surfaced solely through a fixed render whitelist
(id/status/group/category/input/output/detail) so no task-specific key can ride
along; when it carries failures, a capped sample (diversified across status and
group) plus an anti-reward-hack framing line are appended. A grader that emits
only scalars gets output equivalent to the original dump, minus answer-bearing
arrays. `summarize_items` also stores a bounded items digest in context.md.
Layer 2 — held-out-integrity prompt notice.
A generic, path-free instruction injected into the meta and feedback prompts
telling the agent that grader-only ground truth exists in a private directory
and must not be accessed.
Layer 3 — protected-path defense-in-depth (two complementary guards), keyed off
the same generic `compute_protected_paths(task_dir)`:
3a — claude-impl PreToolUse deny hook.
Denies any Bash/Read/Edit/Glob/... tool call resolving inside the task's
held-out dir (`<task_dir>/data/private`), wired via a new `protected_paths`
kwarg. Defaults to None (no hook), so behavior is byte-identical when unused.
The Bash check is a path-string match, and only the `claude` impl honors the
kwarg.
3b — impl-agnostic OS-level guard.
A `restricted_access` context manager strips all filesystem permissions on
`data/private` (chmod 000) for the duration of the sequential meta/feedback
agent run, restoring the original mode on exit (including on error). It wraps
only the agent calls — never grading, which legitimately reads the dir. Gated
by `SIA_PRIVATE_DIR_GUARD` (default on). 3b closes two gaps 3a leaves open:
it protects the held-out dir for the openai/pydantic-ai/openhands impls that
ignore the kwarg, and it fails the read at the filesystem layer, which an
obfuscated shell or subprocess cannot talk around. Residual caveats: a root
process or a copy-out-then-read still bypasses it, and a SIGKILL mid-run
leaves the dir stripped until restored — container sandboxing (--sandbox
docker) remains the strongest boundary.
Three config knobs (`SIA_VERIFIER_PASS_STATUSES`, `SIA_FEEDBACK_FAILURE_SAMPLES`,
`SIA_PRIVATE_DIR_GUARD`) are documented in docs/configuration.md; the first two
default to today's behavior, the third turns on the OS-level guard by default.
Tests cover the leak guard (reference-answer sentinel in results[] never
appears), the real nested-`summary`/`results[]` grader shape, scalar recursion,
the render whitelist, the protected-path matcher matrix, the OS-level guard
(strip/restore, no-op when disabled, restore-on-error, missing-path tolerance),
the prompt notice, and source-level genericity guards. Only the three prompt
goldens change (the Layer-2 notice); the feedback-context goldens are unchanged.
751b2a4 to
e93c7b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: stop leaking held-out reference answers to the meta/feedback agents
The problem
In the self-improvement loop, the feedback context is built in
sia/orchestrator.py::_build_feedback_contextby dumping the entireresults.jsoninto the feedback-agent prompt:When a task's grader writes per-item records that include the held-out
reference answers (the answer key), those answers flow straight into the
feedback-agent prompt — and, because the meta/feedback agent can read the
filesystem, the held-out ground truth is effectively visible to the very agent
whose job is to improve the solution.
This contradicts the project's own stated contract:
This is reproducible on a bundled task today. The
longcot-chessgraderwrites
results.jsonas{"summary": {...}, "results": [{"question_id", "expected": <correct answer>, "predicted", "status"}, ...]}. The orchestratorruns it via
evaluate.py --gen-dir <gen>(so it lands at<gen>/results.json),then
_build_feedback_contextdumps the whole file — including everyresults[].expectedanswer — intogen_N/feedback_prompt. (Forlawbenchtheexposure is milder — only aggregate
per_classaccuracy;gpqawrites its goldto a different filename and isn't dumped.) One concrete leaking bundled task is
enough to violate the invariant above.
This is a reward-hacking / contamination surface: the agent can optimize
toward matching the answer key rather than toward a genuinely better,
generalizing solution. It silently undermines the integrity of every
self-improvement run on a task that grades against held-out gold.
The fix — three composable, task-agnostic layers
Two distinct leak vectors, named up front. The reported bug is harness-side:
the orchestrator itself hands the agent the answer key by dumping
results.jsoninto the prompt it builds. That is the vector that is reproducible today, and
Layer 1 is the actual fix for it — it curates what the harness emits, so the gold
never enters the prompt. Layers 2 and 3 are defense-in-depth against a second,
latent vector the prompt-side fix doesn't touch — the agent reaching into
data/privateon its own during its run (read/glob/cat). Keeping the twovectors separate is what lets the seal stay task-agnostic: Layer 1 closes the
trusted channel (the harness), Layers 2–3 narrow the agent's own channel.
Each layer is generic (no task-specific coupling) and defaults to today's
behavior where applicable, so existing tasks keep working.
Layer 1 — curated, reference-answer-free eval summary
_build_feedback_contextno longer dumpsresults.json. It calls a new_build_eval_summarythat emits only:original
results.jsonshape and field names (top-levelaccuracy/correct/total, or a nestedsummaryblock — whatever the grader wrote). This is anear drop-in for the old dump for a grader that emits only scalars.
items[]contract: each item is an open dict, of whichonly
id / status / group / category / input / output / detailareframework-recognized.
The reference-answer-free guarantee comes from dropping every list: per-item
record arrays —
results/details, where the bundled graders put the answerkey — are excluded wholesale; nested dicts (aggregate blocks) are recursed for
their scalars. The only per-item channel is the
items[]array, surfaced solelythrough the render whitelist, so no task-specific key can ride along. Failing
items are sampled (capped, diversified across
statusandgroup). A boundedsummarize_itemsdigest is also stored incontext.mdmetrics.Concrete case (
longcot-chess, a bundled task): its grader writesresults.jsonas{"summary": {...}, "results": [{"expected": <correct answer>, …}]},and the old code dumped the whole file — every held-out answer — into the
feedback prompt. After the fix, the feedback agent sees the nested
summaryscalars and none of the
results[].expectedanswers.Opt-in + graceful degradation: a grader opts into rich, reference-free
feedback by emitting
items[]withinput/outputbut not the referenceanswer. Graders that emit only scalars still get the original JSON summary — no
breakage, no format change.
Two opt-in env vars (
SIA_VERIFIER_PASS_STATUSES,SIA_FEEDBACK_FAILURE_SAMPLES)tune the failed-item sample; both default to current behavior. They are
documented in
docs/configuration.md.Layer 2 — held-out-integrity prompt notice
A generic, path-free instruction is injected into both the meta and feedback
prompts, telling the agent that grader-only ground truth lives in a private
directory and must not be read/listed/globbed. It names no concrete path, so it
applies to any task.
Layer 3 — protected-path defense-in-depth (two complementary guards)
During the meta and feedback agent runs, the task's held-out dir
(
<task_dir>/data/private) is protected by two mechanisms, both keyed off the samegeneric
compute_protected_paths(task_dir):3a — claude-impl
PreToolUsedeny hook. Aclaude_agent_sdkhook denies anyBash/Read/Edit/Glob/ … tool call that resolves inside the held-outdir, wired through a
protected_pathskwarg onrun_agent. It defaults toNone(no hook registered), so behavior is byte-identical when unused, and is honored only
by the
claudeimpl (others ignore the kwarg).3b — impl-agnostic OS-level guard. A
restricted_accesscontext manager stripsall filesystem permissions on
data/private(chmod 000) for the duration of thesequential meta/feedback agent run, restoring the original mode on exit (including on
error). It is wrapped only around the agent calls — never around grading, which
legitimately reads the dir. Gated by
SIA_PRIVATE_DIR_GUARD(default on).3b closes two gaps that 3a alone leaves open:
claudeimpl; theopenai/pydantic-ai/openhandsimpls ignore the kwarg entirely. The OS guard is enforced in theorchestrator, so it protects the held-out dir regardless of agent impl.
Bashbranch is a path-string match — robust against a straycat .../data/private/…, but not against an obfuscated shell or a subprocess. TheOS guard fails the read at the filesystem layer, which a string-match cannot be
talked around.
Honest caveat: the OS guard is process-global (it chmods a shared dir), so it is
applied only around the sequential agent calls. It does not defend against a
root process or a copy-out-then-read, and a hard kill (
SIGKILL) mid-run leaves thedir stripped until restored. All of Layer 3 is defense-in-depth behind the primary
seal (Layer 1); container-level sandboxing (
--sandbox docker) remains thestrongest boundary, not a security guarantee on its own.
Why SIA's existing Docker sandbox doesn't close this
SIA already ships a configurable execution sandbox (
SANDBOX_MODE=none|docker,with
DOCKER_IMAGE/DOCKER_MEMORY_LIMIT/ … and a--sandbox dockerinvocation), soit's worth saying explicitly why it does not solve the gold leak. First, the
primary leak (Layer 1) is harness-side: the orchestrator itself dumps the
gold-bearing
results.jsoninto the prompt text it builds, so the answers ride into thefeedback prompt whether or not the agent's code runs inside a container — an execution
sandbox sees none of it. Second, for the secondary self-exfiltration vector (an agent
reading
data/privateon its own), Docker is indeed the strongest containment — but itis off by default (
SANDBOX_MODE="none") and execution-focused: a heavy, opt-in,run-level mechanism. That is exactly the gap Layer 3b fills — a lightweight, default-on,
per-agent-run
chmod 000backstop that needs no container. The two are complements,not substitutes: Docker is the proper OS-level isolation when a run opts into it; 3b is
the in-process, always-on default.
Configuration
Three env vars, all documented in
docs/configuration.md:SIA_VERIFIER_PASS_STATUSES(defaultCORRECT,PASS,correct) — the set ofitems[]statuses that count as passing; any other status is a failure.SIA_FEEDBACK_FAILURE_SAMPLES(default20) — cap on failing items surfacedin the summary.
SIA_PRIVATE_DIR_GUARD(default on) — toggles Layer 3b's OS-level guard.Set
0to disable (e.g. if a hard-kill left a dir stripped and you want to optout while debugging).
The first two default to current behavior; the third adds the OS-level backstop
on by default and is the only runtime-behavior change for tasks that ship a
data/privatedir.Tests
Leak guard: feed
eval_datawhoseresults[]contains a gold sentinel;assert the built feedback context contains none of it.
Render whitelist: non-whitelisted keys (e.g. a reference answer) are dropped.
Stratified sampling: failures spread across status/group under the cap.
Protected-path matcher matrix: file tools, Bash,
../glob resolution,relative-form matching, empty-protected-list no-op.
OS-level guard:
restricted_accessstrips then restores permissions; is ano-op when disabled; restores on exception; tolerates a missing dir.
Prompt notice: present in meta + feedback prompts; names no concrete path.
Source-level genericity guards: the seal code paths reference no
task-specific identifiers.
Format preservation: scalar-only graders produce the original JSON shape
(incl.
total); thefeedback_context_*goldens are unchanged.Golden snapshots: only the three prompt goldens change (the Layer-2 notice).
The
feedback_context_success_*goldens are byte-identical to before — ascalar-only
results.jsonrenders exactly as it did, just with answer-bearingarrays excluded.
End-to-end proof (real code path, no mocks)
I verified the fix against the actual code path on the bundled
longcot-chesstask, with no API keys or mocks:
sia/tasks/longcot-chess/data/public/evaluate.py) againstits real
data/private/answers.json, with a deliberately-wrong submission, toproduce a genuine
results.jsonwhoseresults[].expectedfields are the realheld-out answers (50 questions, 57 distinct gold tokens).
results.jsonthrough the gold-free summary builder(
_build_eval_summary, which now replaces the oldjson.dumps(results.json)dump in
_build_feedback_context) and measured what survives.The honest contamination unit is the recoverable association (
question_id→its gold answer): an agent can only cheat if it can tell which question has
which answer. A bare token is not enough — a numeric chess answer can coincide
with an aggregate count or a digit of the grader's
timestamp, revealing nothing.results.json)38250cb(full dump)_build_eval_summary)question_id→ gold-answer associations recoverableexpectedkey /results[]array present* Stable at 0 across repeated runs; a substring check can occasionally flicker to
~1/57 when a numeric answer (e.g.
44) lands inside the timestamp (…T11:44:34) —a coincidence with no
question_idattached, so no answer is recoverable.Same input, opposite outcome: every held-out answer was recoverable before, none
is after. The fix also shrinks the eval section ~20× by dropping the per-item dump (a
turn-pressure / token-cost win).
Scope
This is purely the harness-side leak fix. It is task-agnostic — no coupling to
any specific task. Graders that already avoid emitting gold are unaffected;
graders that want richer feedback can adopt the
items[]contract.Validation
pytest tests/,ruff check,ruff format --check,ty check sia/, andpython -m buildall pass.