Four levels of self-improving code agents, from the simplest loop to a full adversarial arena with self-modifying agents. Each level adds one key idea.
Independent research comparing existing approaches to self-improving code agents and proposing new ones.
- Level 1 -- AutoResearch Loop: LLM improves code against a benchmark (a la Karpathy's AutoResearch).
- Level 2 -- Feedback Loop: A reviewer explains WHY it failed and suggests specific fixes (my contribution, built independently in Feb 2026 -- before AutoResearch and HyperAgents were published).
- Level 3 -- HyperAgent Loop: The agent rewrites its own source code (inspired by Meta's HyperAgents).
- Level 4 -- Arena Loop: Adversarial co-evolution -- code agents and test agents push each other to improve. The benchmark itself evolves (my proposed next step).
- 4a -- Arena Single: 1 code agent vs 1 test agent (isolates the adversarial signal)
- 4b -- Arena Loop: 4 agents compete, worst replaced by mutated winners (adds population dynamics)
| Snake Game AI | Email Validation |
|---|---|
![]() |
![]() |
| Customer Support |
|---|
![]() |
- Fixed benchmarks create false confidence -- Levels 1-3 scored 90-100% on email validation's original tests, but dropped to 62-66% on Arena Loop's adversarial suite. Arena Loop held at 70%. The solutions that looked perfect were brittle.
- Feedback Loop is the best cost-performance tradeoff -- wins snake (31.1 vs Arena's 29.2), ties Arena on support (72.86 vs 69.86, within noise), but costs 20x less ($0.03 vs $0.63 per task)
- Arena Loop builds robustness -- scored 78.16 on support against 28 adversarial questions (vs 10 original), climbing every round as the test suite grew. On the same 10 questions, Feedback Loop (72.86) and Arena Loop (69.86) are within scoring noise of each other.
- Each level has its sweet spot -- simple tasks need simple agents (Level 1), quality tasks need structured feedback (Level 2), and robustness against unseen edge cases needs adversarial testing (Level 4)
See experiment-results.md for full results and cross-validation data.
pip install -r requirements.txtCreate a .env file in the repo root:
GEMINI_API_KEY=your-api-key-from-https://aistudio.google.com/apikey
# Run any level (default task: snake)
python autoresearch/run.py
python feedback-loop/run.py
python hyperagent/run.py
python arena-loop/run.py
# Different tasks
python autoresearch/run.py --task snake # Snake game AI
python autoresearch/run.py --task support # Customer support (LLM-as-judge)
python arena-loop/run.py --task email_validation # Adversarial email validation
# Run experiments with logging and analysis
python autoresearch/experiment.py --task snake --iters 6
# Run ALL experiments across all levels + generate comparison analysis
python run_all.py
# Or pick specific levels/tasks
python run_all.py --levels hyperagent arena-loop --tasks snake
python run_all.py --fresh # ignore previous results, start clean
# Cross-level comparison only (after experiments are done)
python analyze_results.pyAll experiments support checkpoint/resume -- if interrupted, re-run the same
command and it picks up where it left off. Checkpoints are managed by the shared
tasks/checkpoint.py module with atomic writes.
After running a snake experiment, you can watch the AI play in your terminal:
# Watch any level's best solution
python tasks/snake/play.py autoresearch/results/snake/solutions/best.py
python tasks/snake/play.py feedback-loop/results/snake/solutions/best.py
python tasks/snake/play.py hyperagent/results/snake/solutions/best.py
python tasks/snake/play.py arena-loop/results/snake/solutions/best.py
# Slower playback (default is 0.05s per step)
python tasks/snake/play.py arena-loop/results/snake/solutions/best.py --speed 0.1The simplest self-improving loop. One agent, one task, one feedback signal.
LLM proposes code -> write to file -> run & benchmark -> keep if better -> repeat
This implements the verifiable rewards pattern from reinforcement learning (AlphaGo, DeepSeek-R1 RLVR), applied at the agent level: the benchmark provides a machine-checkable reward, enabling fully autonomous operation. Bad proposals revert instantly -- rejection is free, only improvements accumulate.
Adds a reviewer agent that explains WHY something failed.
Instead of just "rejected," the reviewer returns structured feedback:
{
"issue_type": "performance",
"severity": "critical",
"fix_suggestion": "First-element pivot is O(n^2) on reversed input. Use random pivot.",
"confidence": 0.97,
"pattern_detected": "persistent_regression"
}The worker stays focused (small prompt, just the code). The reviewer sees everything (full history, all previous feedback) and spots cross-iteration patterns the worker would miss.
Key idea: asymmetric information -- match context size to role.
True code-rewriting self-improvement. The meta-agent doesn't just update a
text strategy -- it rewrites the actual source code of task_agent.py and
meta_agent.py (including its own code).
seed/ Original agent code (immutable reference)
agent_code/ Live working copies (rewritten by meta-agent)
generations/ Versioned snapshots (gen_000, gen_001, ...)
Two nested loops:
- Inner loop:
task_agent.pyproposes code improvements, benchmark, keep or revert - Outer loop:
meta_agent.pyreads evaluation results and rewrites the agent source
A 3-stage crash recovery validates every rewrite before accepting it:
- Compile check -- catches syntax errors
- Import check -- catches missing dependencies, runtime init errors
- Signature check -- ensures required functions still exist with correct parameters
Invalid rewrites revert to the last valid generation. Every generation is saved
to generations/ with metadata for full traceability.
Inspired by Meta's DGM-H (HyperAgents, arxiv 2603.19461). Simplified: no Docker containers, folder-based versioning instead.
Isolates the adversarial co-evolution mechanism: 1 code agent vs 1 test agent. No tournament, no population dynamics -- just the arms race.
Round 1: Code agent reaches 100% accuracy.
Test agent adds tricky edge cases.
Score DROPS to 80%.
Round 2: Code agent adapts, recovers to 96%.
Test agent finds new weaknesses.
Score DROPS to 80% again.
Round 3: Code agent handles those too. 94%.
Test agent keeps pushing. Repeat.
The arms race IS the training signal. Like GANs (Generative Adversarial Networks) for code -- two agents push each other to improve.
Run with: python arena-loop/experiment.py --code 1 --test 1 --label single
Adds tournament selection on top of the adversarial mechanism. Multiple code agents and test agents compete; worst are replaced by mutated winners every K rounds. Strategies evolve through competition.
Code agents are mini-HyperAgents that can mutate their own propose() function
-- the code that generates code improvements is itself subject to evolution.
Resume is fully supported via serialize/deserialize on all agent objects.
Read arena-loop/CONCEPT.md for the full architectural writeup.
Tasks live in the shared tasks/ folder. Each is a folder with real runnable files:
| Task | Type | Best showcase |
|---|---|---|
snake |
Deterministic (score) | All levels (gradual improvement) |
support |
LLM-as-judge (quality) | Levels 2-3 (structured feedback shines) |
email_validation |
Adversarial (accuracy) | Level 4 (arms race proves robustness) |
The framework is task-agnostic. To add a new task, create a folder in tasks/ with three files:
tasks/your_task/
config.py # TASK_NAME, METRIC_NAME, HIGHER_IS_BETTER, PROMPT_TEMPLATE, build_prompt()
initial_solution.py # Starting code (the baseline the agents will improve)
benchmark.py # Runs the solution, prints "metric_name:value" to stdout
config.py defines what the task is and how to prompt the LLM:
TASK_NAME = "your_task"
METRIC_NAME = "score" # what the benchmark prints
HIGHER_IS_BETTER = True # True = maximize, False = minimize
PERFECT_SCORE = 100.0 # optional: stop early when reached
def build_prompt(code, metric):
return f"Improve this code. Current {METRIC_NAME}: {metric}\n\n{code}"benchmark.py runs the solution and prints the metric:
# Usage: python benchmark.py <solution_file>
# Must print: score:42.5 (or whatever METRIC_NAME is)Then run any level with --task your_task:
python autoresearch/run.py --task your_task
python run_all.py --tasks your_taskFor LLM-as-judge tasks (subjective quality), set USES_LLM_JUDGE = True in config.py
and have benchmark.py print answers: JSON instead of a metric. For more stable scoring,
generate a boolean rubric:
python tasks/generate_rubric.py --task your_taskThis analyzes your test cases and knowledge base, then generates rubric_checks.json
(per-question boolean fact checks with weights) and rubric.py (scoring engine).
Review the generated checks, add USES_RUBRIC = True to config.py, and run your
experiment. See tasks/support/ for an example.
tasks/ Shared task definitions
snake/ initial_solution.py, benchmark.py, config.py, play.py
support/ initial_solution.py, benchmark.py, config.py, ...
email_validation/ initial_solution.py, benchmark.py, config.py, ...
task_runner.py Central: load_task, write_solution, run_solution
checkpoint.py Shared checkpoint/resume (atomic writes, all levels)
autoresearch/ Level 1: AutoResearch Loop
run.py, llm.py, experiment.py
feedback-loop/ Level 2: Feedback Loop
run.py, worker.py, reviewer.py, llm.py, experiment.py
hyperagent/ Level 3: HyperAgent Loop
run.py, llm.py, experiment.py
seed/ Original agent code (immutable)
task_agent.py Seed task agent
meta_agent.py Seed meta-agent
agent_code/ Live working copies (rewritten each generation)
generations/ Versioned snapshots (gen_000/, gen_001/, ...)
arena-loop/ Level 4: Arena (both 4a single and 4b tournament)
run.py, code_agent.py, test_agent.py, arena.py, llm.py, experiment.py, CONCEPT.md
results/ 4b tournament results
results-single/ 4a single (1v1) results
run_all.py Run all experiments + generate analysis (single command)
analyze_results.py Cross-level comparison of experiment results
All levels use a shared checkpoint module (tasks/checkpoint.py) for
atomic, resumable checkpointing. If an experiment is interrupted (crash, API
timeout, Ctrl+C), re-run the same command and it resumes from the last
checkpoint. Checkpoints are sequence-numbered JSON files with atomic writes
(write to .tmp, then os.replace). Only the last 3 checkpoints are kept.
See PRIOR_WORK.md for the full context: verifiable rewards lineage (AlphaGo, RLVR, AutoResearch), independent convergence across groups in early 2026, and how this project's adversarial approach differs from prior code/test co-evolution work.
| Level | Adds | Key insight | Lineage |
|---|---|---|---|
| 1 | The loop | Rejection is free. Only improvements accumulate. | Verifiable rewards (AlphaGo, RLVR) |
| 2 | Structured feedback | Know WHY it failed, not just that it failed. | Asymmetric context windows |
| 3 | Code-rewriting | The agent rewrites its own source code. | Meta's HyperAgents (DGM-H) |
| 4 | Adversarial co-evolution | A fixed benchmark gets gamed. An evolving benchmark doesn't. | GANs, co-evolutionary algorithms |






