Paper · Overview · Quick Start · How It Works · Set Up Your Own Task · Example · Papers · Community · Citation
An illustrative run: each tmux pane is an independent coding agent editing, evaluating, and evolving solutions.
EvE wraps existing, highly capable coding agents into a decentralized evolutionary ensemble that co-evolves two populations: a solver population of functional components within a repository, and an agent population whose guidance and skills are continuously self-refined.
Use EvE for challenging tasks where results can be tested or judged: designing algorithms, improving code, or solving a mathematical problem.
To run EvE, you provide:
-
a working environment, such as a codebase or a mathematical problem in a github repository or a local folder;
-
the solution files or folders that are allowed to edit, such as code files or a proof draft;
-
scoring steps that evaluate each solution, such as shell scripts, agent judge prompts, or a combination of them.
EvE then searches for strong solutions without requiring a task-specific workflow or hand-crafted skills.
Important
EvE orchestrates third-party coding agents; it does not provide unlimited access to any AI service. Each agent session consumes subscription quota or API credits on your own account. EvE does not bypass or modify any provider's authentication, rate limits, or usage restrictions.
-
Install dependencies.
uv sync
-
Agent authentication. The current public release uses Codex as the default agent backend. Install and authenticate Codex with your own login, subscription, or API credentials.
-
Hook trust (for Codex >= 0.130.0). EvE uses hooks for workspace sandboxing and budget prompts. Run once per machine from the repository root if you are using Codex:
uv run python -m scaling_evolve.providers.agent.codex_hooks codex # Type /hooks -> press t to trust all -> Esc -> Ctrl-C -
Verify. Run a short smoke test using Codex to confirm everything works:
uv run python -m scaling_evolve.algorithms.eve.runner \ --config-name=circle_packing.smokeThis runs a short circle packing demo with headless Codex agents.
To try EvE on a mathematical problem, start Codex from the repository root and ask it to use the Math Proof quickstart:
codexRun the Math Proof quickstart for this problem:
<paste the problem statement here>
You may ask codex for more details.
(a) One-shot code proposal. (b) A coding agent works inside a repository. (c) EvE runs many coding agents across many candidate solutions, scores the results, and carries the useful history into the next round.
EvE maintains two co-evolving populations: a solver population containing functional components in a code repository, and an agent population where each agent carries cumulative working logs and an Elo-based score. EvE fixes the base agent substrate and focuses on evolving the cumulative guidance and skills that dictate agent behaviors.
Each agent operates in a dedicated workspace with all dependencies included, and its modification scope is explicitly restricted to designated files and enforced by post-generation checks. Solver improvement and self-referential agent optimization happen in one unified step: agents improve guidance and skills while editing code repositories, and this guidance is then repeatedly evaluated during future iterations, with concrete scores that drive sampling probability.
The formal procedure is given in the algorithm below.
In each iteration, EvE samples a set of high-performing working agents, along with reference sets of solvers and agents, which are combined with the base code repository to provide context. A synchronous race is then conducted: each working agent operates within its own workspace on the same reference set, producing a new solver candidate and a potentially revised agent. By forcing all agents to refine the same references, variance in solver quality is directly attributed to the effectiveness of each agent's strategy. After evaluation, a pairwise win-loss matrix is constructed and agent Elo ratings are updated. Agents that revised their guidance are integrated back into the population, preserving new strategies and their procedural evidence.
To run EvE on your own codebase, you need a source repository, an application config, an evaluation config, initial guidance, and a top-level experiment config.
Point EvE at your code repository and declare which files agents may edit:
# configs/eve/application/your_task.yaml
application:
name: your-task
github_url: https://github.com/your-org/your-repo
commit: abc123... # pin to a specific commit
editable:
files:
- src/model.py
- configs/params.yaml
folders: []
boundary_failure_score:
score: 0.0
summary: boundary check failedAgents are strictly confined to editable.files and editable.folders; any
modification outside this surface is rejected by the boundary checker.
For a local task tree instead of a Git checkout, set application.path and omit
github_url and commit; EvE accepts either path or the pair
github_url/commit, but not both.
Provide one or more evaluation steps. A shell step runs inside the candidate
workspace and must write a score.yaml file with at least score: <float> and
summary: <string> to the evaluation log directory.
Evaluation steps can also be judge-agent steps written as {prompt, immutable?} mappings; see
implement-evaluation-steps for the
full evaluation-step surface.
# configs/eve/evaluation/your_task.yaml
evaluation:
steps:
- configs/eve/evaluation/your_task/evaluation.sh
failure_score:
score: 0.0
summary: evaluation failed
seed_solver_score: null
seed_solver_skip_evaluation: falseWrite Markdown documents and optional skills that describe the task and search
directions for the agents. Place them in an initial_guidance/ directory:
configs/eve/optimizer/your_task/initial_guidance/
docs/ # task context, search directions, background knowledge
skills/
your-skill/SKILL.md # reusable instructions agents will read and evolve
Configure the initial guidance, immutable prompt assets, and worker prompts:
# configs/eve/optimizer/your_task.yaml
optimizer:
initial_guidance: configs/eve/optimizer/your_task/initial_guidance
workers:
selection: random
items:
- name: normal
weight: 1.0
immutable: configs/eve/optimizer/your_task/immutable
prompt: configs/eve/optimizer/your_task/prompt
immutable_renderer:
_target_: scaling_evolve.algorithms.eve.workspace.immutable_renderers.default.DefaultRenderer
evaluation:
_target_: scaling_evolve.algorithms.eve.populations.evaluators.elo.ScalarEloEvaluator
k_factor: 32.0
initial_score:
elo: 1500.0Files under initial_guidance/ seed the initial agent population. As agents
discover what works and what does not, they revise the guidance for future
iterations.
The default loop controls how many workers run per iteration and how many population examples each worker sees:
# configs/eve/loop/default.yaml
loop:
max_iterations: 25
n_workers_phase2: 2
n_solver_examples_phase2: 4
n_optimizer_examples_phase2: 4
produce_optimizer_in_phase2: ${loop.n_workers_phase2}At each iteration, EvE launches n_workers_phase2 solver workers in parallel.
Each worker gets sampled solver references under solver_examples/ and sampled
guidance references under guidance_examples/. Workers race on the same
iteration without communicating directly; the resulting solvers are evaluated,
and produced guidance candidates are kept according to
produce_optimizer_in_phase2.
Compose everything with Hydra and launch:
# configs/eve/your_task.yaml
defaults:
- runtime: default
- loop: default
- driver: codex_max
- logger: many_loggers
- application: your_task
- optimizer: your_task
- evaluation: your_task
- _self_
label: your-taskuv run python -m scaling_evolve.algorithms.eve.runner --config-name=your_taskSee configs/eve/circle_packing.yaml, configs/eve/icon.yaml, and
configs/eve/math_proof_quickstart.yaml for complete working examples.
Applied to ICON (In-Context
Operator Networks), EvE autonomously discovered a novel positional-encoding
mechanism that reduced generalization error by over 80% compared to the
hand-designed baseline, turning a catastrophic out-of-distribution failure into
robust performance. See examples/icon/README.md for reproduction instructions.
We compare three experimental conditions, each run twice independently under identical compute budgets:
- EvE: the full ensemble with continuous agent evolution.
- Static-Initial: the initial agent is used throughout the entire search, with no agent evolution.
- Static-Final: the single best-rated agent from the corresponding completed EvE run is extracted and frozen for a fresh search.
Search trajectories for all variants (two independent runs each). The y-axis is the running minimum of mean error (lower is better); the x-axis is cumulative equivalent tokens in millions. The gray dashed line marks the Seed baseline.
The two EvE runs descend in near-lockstep, converging to almost identical final errors. The Static-Initial runs diverge: one eventually approaches EvE while the other plateaus at a higher level. Static-Final, despite starting from a higher-rated agent, suffers from phase mismatch: the frozen agent was optimized for the late stage of the original EvE run but a fresh search requires early-stage exploration strategies that this agent no longer carries. Continuous evolution is indispensable for both performance and robustness.
The complete raw search traces for all six runs (every solver's source code, agent conversations, guidance updates, and evaluation scores) are available in the v0.1.0 release.
Questions, feedback, or running into issues? Join our Slack workspace.
Please cite our paper if you use EvE in your research:
@article{yu2026eve,
title = {Evolutionary Ensemble of Agents},
author = {Yu, Zongmin and Yang, Liu},
year = {2026},
url = {https://arxiv.org/abs/2605.09018},
eprint = {2605.09018},
archivePrefix = {arXiv},
primaryClass = {cs.NE}
}EvE has been applied to a growing set of scientific problems.
See additional papers using EvE and their BibTeX entries in docs/papers.md.
If you use EvE in your work, we welcome pull requests to add your paper to the list.
This project is part of the Scientific Computing and Intelligence Group (Scaling Group) at the National University of Singapore.
Please include the NOTICE file (already included in this repository) in your code base that uses this repository.





