Fix: stop leaking held-out reference answers to the meta/feedback agents by francisco-perez-sorrosal · Pull Request #36 · hexo-ai/sia

francisco-perez-sorrosal · 2026-06-09T19:07:15Z

Fix: stop leaking held-out reference answers to the meta/feedback agents

The problem

In the self-improvement loop, the feedback context is built in
sia/orchestrator.py::_build_feedback_context by dumping the entire
results.json into the feedback-agent prompt:

eval_data = json.load(f)
eval_results_section = f"""
**EVALUATION RESULTS**:
```json
{json.dumps(eval_data, indent=2)}
```
"""

When a task's grader writes per-item records that include the held-out
reference answers (the answer key), those answers flow straight into the
feedback-agent prompt — and, because the meta/feedback agent can read the
filesystem, the held-out ground truth is effectively visible to the very agent
whose job is to improve the solution.

This contradicts the project's own stated contract:

private/ — Held-out eval data; never exposed to the agent — README.md

The LLM is not told about data/private/ during evaluation. This
prevents the agent from cheating and ensures fair scoring. — docs/walkthrough.md

This is reproducible on a bundled task today. The longcot-chess grader
writes results.json as {"summary": {...}, "results": [{"question_id", "expected": <correct answer>, "predicted", "status"}, ...]}. The orchestrator
runs it via evaluate.py --gen-dir <gen> (so it lands at <gen>/results.json),
then _build_feedback_context dumps the whole file — including every
results[].expected answer — into gen_N/feedback_prompt. (For lawbench the
exposure is milder — only aggregate per_class accuracy; gpqa writes its gold
to a different filename and isn't dumped.) One concrete leaking bundled task is
enough to violate the invariant above.

This is a reward-hacking / contamination surface: the agent can optimize
toward matching the answer key rather than toward a genuinely better,
generalizing solution. It silently undermines the integrity of every
self-improvement run on a task that grades against held-out gold.

The fix — three composable, task-agnostic layers

Two distinct leak vectors, named up front. The reported bug is harness-side:
the orchestrator itself hands the agent the answer key by dumping results.json
into the prompt it builds. That is the vector that is reproducible today, and
Layer 1 is the actual fix for it — it curates what the harness emits, so the gold
never enters the prompt. Layers 2 and 3 are defense-in-depth against a second,
latent vector the prompt-side fix doesn't touch — the agent reaching into
data/private on its own during its run (read/glob/cat). Keeping the two
vectors separate is what lets the seal stay task-agnostic: Layer 1 closes the
trusted channel (the harness), Layers 2–3 narrow the agent's own channel.

Each layer is generic (no task-specific coupling) and defaults to today's
behavior where applicable, so existing tasks keep working.

Layer 1 — curated, reference-answer-free eval summary

_build_feedback_context no longer dumps results.json. It calls a new
_build_eval_summary that emits only:

Every scalar metric, at any nesting depth, rendered as a JSON object in the
original results.json shape and field names (top-level accuracy/correct/
total, or a nested summary block — whatever the grader wrote). This is a
near drop-in for the old dump for a grader that emits only scalars.
a generic, opt-in items[] contract: each item is an open dict, of which
only id / status / group / category / input / output / detail are
framework-recognized.

The reference-answer-free guarantee comes from dropping every list: per-item
record arrays — results / details, where the bundled graders put the answer
key — are excluded wholesale; nested dicts (aggregate blocks) are recursed for
their scalars. The only per-item channel is the items[] array, surfaced solely
through the render whitelist, so no task-specific key can ride along. Failing
items are sampled (capped, diversified across status and group). A bounded
summarize_items digest is also stored in context.md metrics.

Concrete case (longcot-chess, a bundled task): its grader writes
results.json as {"summary": {...}, "results": [{"expected": <correct answer>, …}]},
and the old code dumped the whole file — every held-out answer — into the
feedback prompt. After the fix, the feedback agent sees the nested summary
scalars and none of the results[].expected answers.

Opt-in + graceful degradation: a grader opts into rich, reference-free
feedback by emitting items[] with input/output but not the reference
answer. Graders that emit only scalars still get the original JSON summary — no
breakage, no format change.

Two opt-in env vars (SIA_VERIFIER_PASS_STATUSES, SIA_FEEDBACK_FAILURE_SAMPLES)
tune the failed-item sample; both default to current behavior. They are
documented in docs/configuration.md.

Layer 2 — held-out-integrity prompt notice

A generic, path-free instruction is injected into both the meta and feedback
prompts, telling the agent that grader-only ground truth lives in a private
directory and must not be read/listed/globbed. It names no concrete path, so it
applies to any task.

Layer 3 — protected-path defense-in-depth (two complementary guards)

During the meta and feedback agent runs, the task's held-out dir
(<task_dir>/data/private) is protected by two mechanisms, both keyed off the same
generic compute_protected_paths(task_dir):

3a — claude-impl PreToolUse deny hook. A claude_agent_sdk hook denies any
Bash / Read / Edit / Glob / … tool call that resolves inside the held-out
dir, wired through a protected_paths kwarg on run_agent. It defaults to None
(no hook registered), so behavior is byte-identical when unused, and is honored only
by the claude impl (others ignore the kwarg).

3b — impl-agnostic OS-level guard. A restricted_access context manager strips
all filesystem permissions on data/private (chmod 000) for the duration of the
sequential meta/feedback agent run, restoring the original mode on exit (including on
error). It is wrapped only around the agent calls — never around grading, which
legitimately reads the dir. Gated by SIA_PRIVATE_DIR_GUARD (default on).

3b closes two gaps that 3a alone leaves open:

Cross-impl. 3a only protects the claude impl; the openai / pydantic-ai /
openhands impls ignore the kwarg entirely. The OS guard is enforced in the
orchestrator, so it protects the held-out dir regardless of agent impl.
Evasion. 3a's Bash branch is a path-string match — robust against a stray
cat .../data/private/…, but not against an obfuscated shell or a subprocess. The
OS guard fails the read at the filesystem layer, which a string-match cannot be
talked around.

Honest caveat: the OS guard is process-global (it chmods a shared dir), so it is
applied only around the sequential agent calls. It does not defend against a
root process or a copy-out-then-read, and a hard kill (SIGKILL) mid-run leaves the
dir stripped until restored. All of Layer 3 is defense-in-depth behind the primary
seal (Layer 1); container-level sandboxing (--sandbox docker) remains the
strongest boundary, not a security guarantee on its own.

Why SIA's existing Docker sandbox doesn't close this

SIA already ships a configurable execution sandbox (SANDBOX_MODE = none | docker,
with DOCKER_IMAGE / DOCKER_MEMORY_LIMIT / … and a --sandbox docker invocation), so
it's worth saying explicitly why it does not solve the gold leak. First, the
primary leak (Layer 1) is harness-side: the orchestrator itself dumps the
gold-bearing results.json into the prompt text it builds, so the answers ride into the
feedback prompt whether or not the agent's code runs inside a container — an execution
sandbox sees none of it. Second, for the secondary self-exfiltration vector (an agent
reading data/private on its own), Docker is indeed the strongest containment — but it
is off by default (SANDBOX_MODE="none") and execution-focused: a heavy, opt-in,
run-level mechanism. That is exactly the gap Layer 3b fills — a lightweight, default-on,
per-agent-run chmod 000 backstop that needs no container. The two are complements,
not substitutes: Docker is the proper OS-level isolation when a run opts into it; 3b is
the in-process, always-on default.

Configuration

Three env vars, all documented in docs/configuration.md:

SIA_VERIFIER_PASS_STATUSES (default CORRECT,PASS,correct) — the set of
items[] statuses that count as passing; any other status is a failure.
SIA_FEEDBACK_FAILURE_SAMPLES (default 20) — cap on failing items surfaced
in the summary.
SIA_PRIVATE_DIR_GUARD (default on) — toggles Layer 3b's OS-level guard.
Set 0 to disable (e.g. if a hard-kill left a dir stripped and you want to opt
out while debugging).

The first two default to current behavior; the third adds the OS-level backstop
on by default and is the only runtime-behavior change for tasks that ship a
data/private dir.

Tests

Leak guard: feed eval_data whose results[] contains a gold sentinel;
assert the built feedback context contains none of it.
Render whitelist: non-whitelisted keys (e.g. a reference answer) are dropped.
Stratified sampling: failures spread across status/group under the cap.
Protected-path matcher matrix: file tools, Bash, ../glob resolution,
relative-form matching, empty-protected-list no-op.
OS-level guard: restricted_access strips then restores permissions; is a
no-op when disabled; restores on exception; tolerates a missing dir.
Prompt notice: present in meta + feedback prompts; names no concrete path.
Source-level genericity guards: the seal code paths reference no
task-specific identifiers.
Format preservation: scalar-only graders produce the original JSON shape
(incl. total); the feedback_context_* goldens are unchanged.

Golden snapshots: only the three prompt goldens change (the Layer-2 notice).
The feedback_context_success_* goldens are byte-identical to before — a
scalar-only results.json renders exactly as it did, just with answer-bearing
arrays excluded.

End-to-end proof (real code path, no mocks)

I verified the fix against the actual code path on the bundled longcot-chess
task, with no API keys or mocks:

Ran the real grader (sia/tasks/longcot-chess/data/public/evaluate.py) against
its real data/private/answers.json, with a deliberately-wrong submission, to
produce a genuine results.json whose results[].expected fields are the real
held-out answers (50 questions, 57 distinct gold tokens).
Fed that same results.json through the gold-free summary builder
(_build_eval_summary, which now replaces the old json.dumps(results.json)
dump in _build_feedback_context) and measured what survives.

The honest contamination unit is the recoverable association (question_id →
its gold answer): an agent can only cheat if it can tell which question has
which answer. A bare token is not enough — a numeric chess answer can coincide
with an aggregate count or a digit of the grader's timestamp, revealing nothing.

Metric (same input `results.json`)	base `38250cb` (full dump)	this branch (`_build_eval_summary`)
`question_id` → gold-answer associations recoverable	50 / 50 🔴	0 / 50 🟢
raw gold-token substring hits	57 / 57	0 / 57 *
`expected` key / `results[]` array present	yes	no
context size	~10.9 KB	~0.5 KB

* Stable at 0 across repeated runs; a substring check can occasionally flicker to
~1/57 when a numeric answer (e.g. 44) lands inside the timestamp (…T11:44:34) —
a coincidence with no question_id attached, so no answer is recoverable.

Same input, opposite outcome: every held-out answer was recoverable before, none
is after. The fix also shrinks the eval section ~20× by dropping the per-item dump (a
turn-pressure / token-cost win).

Scope

This is purely the harness-side leak fix. It is task-agnostic — no coupling to
any specific task. Graders that already avoid emitting gold are unaffected;
graders that want richer feedback can adopt the items[] contract.

Validation

pytest tests/, ruff check, ruff format --check, ty check sia/, and
python -m build all pass.

This contradicts the project's own stated contract — README ("private/ ... never exposed to the agent") and docs/walkthrough.md ("the LLM is not told about data/private/ ... prevents the agent from cheating"). It is reproducible on a bundled task: the `longcot-chess` grader writes results.json as `{"summary": {...}, "results": [{"expected": <correct answer>, ...}]}`, and `_build_feedback_context` dumps the whole file — every held-out answer — into the feedback-agent prompt via `json.dumps(eval_data)`. The feedback agent (told to improve the target agent, and able to read the filesystem) can then optimize toward the answer key instead of toward a genuinely better solution — a reward-hacking / contamination surface. (lawbench exposes only aggregate per-class accuracy; gpqa writes its gold to a different filename and isn't dumped — one leaking bundled task is enough.) This fixes the leak with three composable, task-agnostic layers: Layer 1 — curated, reference-answer-free eval summary. `_build_feedback_context` no longer dumps results.json. It calls a new `_build_eval_summary` that, via `_collect_scalars`, emits every SCALAR metric at any nesting depth as a JSON object — preserving the original results.json shape and field names (top-level `accuracy`/`correct`/`total`, or a nested `summary` block; whatever the grader wrote). The reference-answer-free guarantee comes from DROPPING every list: per-item record arrays (`results`/`details`, where the bundled graders put the answer key) are excluded wholesale; nested dicts are recursed for their scalars. The only per-item channel is the generic, opt-in `items[]` array, surfaced solely through a fixed render whitelist (id/status/group/category/input/output/detail) so no task-specific key can ride along; when it carries failures, a capped sample (diversified across status and group) plus an anti-reward-hack framing line are appended. A grader that emits only scalars gets output equivalent to the original dump, minus answer-bearing arrays. `summarize_items` also stores a bounded items digest in context.md. Layer 2 — held-out-integrity prompt notice. A generic, path-free instruction injected into the meta and feedback prompts telling the agent that grader-only ground truth exists in a private directory and must not be accessed. Layer 3 — protected-path defense-in-depth (two complementary guards), keyed off the same generic `compute_protected_paths(task_dir)`: 3a — claude-impl PreToolUse deny hook. Denies any Bash/Read/Edit/Glob/... tool call resolving inside the task's held-out dir (`<task_dir>/data/private`), wired via a new `protected_paths` kwarg. Defaults to None (no hook), so behavior is byte-identical when unused. The Bash check is a path-string match, and only the `claude` impl honors the kwarg. 3b — impl-agnostic OS-level guard. A `restricted_access` context manager strips all filesystem permissions on `data/private` (chmod 000) for the duration of the sequential meta/feedback agent run, restoring the original mode on exit (including on error). It wraps only the agent calls — never grading, which legitimately reads the dir. Gated by `SIA_PRIVATE_DIR_GUARD` (default on). 3b closes two gaps 3a leaves open: it protects the held-out dir for the openai/pydantic-ai/openhands impls that ignore the kwarg, and it fails the read at the filesystem layer, which an obfuscated shell or subprocess cannot talk around. Residual caveats: a root process or a copy-out-then-read still bypasses it, and a SIGKILL mid-run leaves the dir stripped until restored — container sandboxing (--sandbox docker) remains the strongest boundary. Three config knobs (`SIA_VERIFIER_PASS_STATUSES`, `SIA_FEEDBACK_FAILURE_SAMPLES`, `SIA_PRIVATE_DIR_GUARD`) are documented in docs/configuration.md; the first two default to today's behavior, the third turns on the OS-level guard by default. Tests cover the leak guard (reference-answer sentinel in results[] never appears), the real nested-`summary`/`results[]` grader shape, scalar recursion, the render whitelist, the protected-path matcher matrix, the OS-level guard (strip/restore, no-op when disabled, restore-on-error, missing-path tolerance), the prompt notice, and source-level genericity guards. Only the three prompt goldens change (the Layer-2 notice); the feedback-context goldens are unchanged.

8RON8 approved these changes Jun 12, 2026

View reviewed changes

selvamHexo force-pushed the fix/feedback-gold-leak branch from 751b2a4 to e93c7b5 Compare June 16, 2026 15:50

selvamHexo requested a review from yogendrahexo June 16, 2026 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: stop leaking held-out reference answers to the meta/feedback agents#36

Fix: stop leaking held-out reference answers to the meta/feedback agents#36
francisco-perez-sorrosal wants to merge 1 commit into
hexo-ai:mainfrom
francisco-perez-sorrosal:fix/feedback-gold-leak

francisco-perez-sorrosal commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

francisco-perez-sorrosal commented Jun 9, 2026

Fix: stop leaking held-out reference answers to the meta/feedback agents

The problem

The fix — three composable, task-agnostic layers

Layer 1 — curated, reference-answer-free eval summary

Layer 2 — held-out-integrity prompt notice

Layer 3 — protected-path defense-in-depth (two complementary guards)

Why SIA's existing Docker sandbox doesn't close this

Configuration

Tests

End-to-end proof (real code path, no mocks)

Scope

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants