Skip to content

scaling-group/eve

Repository files navigation

Paper · Overview · Quick Start · How It Works · Set Up Your Own Task · Example · Papers · Community · Citation

An illustrative run: each tmux pane is an independent coding agent editing, evaluating, and evolving solutions.

Overview

EvE wraps existing, highly capable coding agents into a decentralized evolutionary ensemble that co-evolves two populations: a solver population of functional components within a repository, and an agent population whose guidance and skills are continuously self-refined.

Use EvE for challenging tasks where results can be tested or judged: designing algorithms, improving code, or solving a mathematical problem.

To run EvE, you provide:

  • a working environment, such as a codebase or a mathematical problem in a github repository or a local folder;

  • the solution files or folders that are allowed to edit, such as code files or a proof draft;

  • scoring steps that evaluate each solution, such as shell scripts, agent judge prompts, or a combination of them.

EvE then searches for strong solutions without requiring a task-specific workflow or hand-crafted skills.

Quick Start

Important

EvE orchestrates third-party coding agents; it does not provide unlimited access to any AI service. Each agent session consumes subscription quota or API credits on your own account. EvE does not bypass or modify any provider's authentication, rate limits, or usage restrictions.

First-time setup

  1. Install dependencies.

    uv sync

  2. Agent authentication. The current public release uses Codex as the default agent backend. Install and authenticate Codex with your own login, subscription, or API credentials.

  3. Hook trust (for Codex >= 0.130.0). EvE uses hooks for workspace sandboxing and budget prompts. Run once per machine from the repository root if you are using Codex:

    uv run python -m scaling_evolve.providers.agent.codex_hooks
    codex
    # Type /hooks -> press t to trust all -> Esc -> Ctrl-C
    
  4. Verify. Run a short smoke test using Codex to confirm everything works:

    uv run python -m scaling_evolve.algorithms.eve.runner \
      --config-name=circle_packing.smoke
    

    This runs a short circle packing demo with headless Codex agents.

Math Proof quickstart

To try EvE on a mathematical problem, start Codex from the repository root and ask it to use the Math Proof quickstart:

codex
Run the Math Proof quickstart for this problem:

<paste the problem statement here>

You may ask codex for more details.

How It Works

(a) One-shot code proposal. (b) A coding agent works inside a repository. (c) EvE runs many coding agents across many candidate solutions, scores the results, and carries the useful history into the next round.

EvE maintains two co-evolving populations: a solver population containing functional components in a code repository, and an agent population where each agent carries cumulative working logs and an Elo-based score. EvE fixes the base agent substrate and focuses on evolving the cumulative guidance and skills that dictate agent behaviors.

Each agent operates in a dedicated workspace with all dependencies included, and its modification scope is explicitly restricted to designated files and enforced by post-generation checks. Solver improvement and self-referential agent optimization happen in one unified step: agents improve guidance and skills while editing code repositories, and this guidance is then repeatedly evaluated during future iterations, with concrete scores that drive sampling probability.

The formal procedure is given in the algorithm below.

Algorithm: Evolutionary Ensemble of Agents

In each iteration, EvE samples a set of high-performing working agents, along with reference sets of solvers and agents, which are combined with the base code repository to provide context. A synchronous race is then conducted: each working agent operates within its own workspace on the same reference set, producing a new solver candidate and a potentially revised agent. By forcing all agents to refine the same references, variance in solver quality is directly attributed to the effectiveness of each agent's strategy. After evaluation, a pairwise win-loss matrix is constructed and agent Elo ratings are updated. Agents that revised their guidance are integrated back into the population, preserving new strategies and their procedural evidence.

Set Up Your Own Task

To run EvE on your own codebase, you need a source repository, an application config, an evaluation config, initial guidance, and a top-level experiment config.

1. Application config

Point EvE at your code repository and declare which files agents may edit:

# configs/eve/application/your_task.yaml
application:
  name: your-task
  github_url: https://github.com/your-org/your-repo
  commit: abc123... # pin to a specific commit
  editable:
    files:
      - src/model.py
      - configs/params.yaml
    folders: []
  boundary_failure_score:
    score: 0.0
    summary: boundary check failed

Agents are strictly confined to editable.files and editable.folders; any modification outside this surface is rejected by the boundary checker. For a local task tree instead of a Git checkout, set application.path and omit github_url and commit; EvE accepts either path or the pair github_url/commit, but not both.

2. Evaluation config

Provide one or more evaluation steps. A shell step runs inside the candidate workspace and must write a score.yaml file with at least score: <float> and summary: <string> to the evaluation log directory. Evaluation steps can also be judge-agent steps written as {prompt, immutable?} mappings; see implement-evaluation-steps for the full evaluation-step surface.

# configs/eve/evaluation/your_task.yaml
evaluation:
  steps:
    - configs/eve/evaluation/your_task/evaluation.sh
  failure_score:
    score: 0.0
    summary: evaluation failed
  seed_solver_score: null
  seed_solver_skip_evaluation: false

3. Initial guidance

Write Markdown documents and optional skills that describe the task and search directions for the agents. Place them in an initial_guidance/ directory:

configs/eve/optimizer/your_task/initial_guidance/
  docs/                          # task context, search directions, background knowledge
  skills/
    your-skill/SKILL.md          # reusable instructions agents will read and evolve

Configure the initial guidance, immutable prompt assets, and worker prompts:

# configs/eve/optimizer/your_task.yaml
optimizer:
  initial_guidance: configs/eve/optimizer/your_task/initial_guidance
  workers:
    selection: random
    items:
      - name: normal
        weight: 1.0
        immutable: configs/eve/optimizer/your_task/immutable
        prompt: configs/eve/optimizer/your_task/prompt
  immutable_renderer:
    _target_: scaling_evolve.algorithms.eve.workspace.immutable_renderers.default.DefaultRenderer
  evaluation:
    _target_: scaling_evolve.algorithms.eve.populations.evaluators.elo.ScalarEloEvaluator
    k_factor: 32.0
    initial_score:
      elo: 1500.0

Files under initial_guidance/ seed the initial agent population. As agents discover what works and what does not, they revise the guidance for future iterations.

4. Loop config

The default loop controls how many workers run per iteration and how many population examples each worker sees:

# configs/eve/loop/default.yaml
loop:
  max_iterations: 25
  n_workers_phase2: 2
  n_solver_examples_phase2: 4
  n_optimizer_examples_phase2: 4
  produce_optimizer_in_phase2: ${loop.n_workers_phase2}

At each iteration, EvE launches n_workers_phase2 solver workers in parallel. Each worker gets sampled solver references under solver_examples/ and sampled guidance references under guidance_examples/. Workers race on the same iteration without communicating directly; the resulting solvers are evaluated, and produced guidance candidates are kept according to produce_optimizer_in_phase2.

5. Experiment config

Compose everything with Hydra and launch:

# configs/eve/your_task.yaml
defaults:
  - runtime: default
  - loop: default
  - driver: codex_max
  - logger: many_loggers
  - application: your_task
  - optimizer: your_task
  - evaluation: your_task
  - _self_

label: your-task
uv run python -m scaling_evolve.algorithms.eve.runner --config-name=your_task

See configs/eve/circle_packing.yaml, configs/eve/icon.yaml, and configs/eve/math_proof_quickstart.yaml for complete working examples.

Example: Model Positional Embedding Design

Applied to ICON (In-Context Operator Networks), EvE autonomously discovered a novel positional-encoding mechanism that reduced generalization error by over 80% compared to the hand-designed baseline, turning a catastrophic out-of-distribution failure into robust performance. See examples/icon/README.md for reproduction instructions.

We compare three experimental conditions, each run twice independently under identical compute budgets:

  • EvE: the full ensemble with continuous agent evolution.
  • Static-Initial: the initial agent is used throughout the entire search, with no agent evolution.
  • Static-Final: the single best-rated agent from the corresponding completed EvE run is extracted and frozen for a fresh search.

Search trajectories for all variants (two independent runs each). The y-axis is the running minimum of mean error (lower is better); the x-axis is cumulative equivalent tokens in millions. The gray dashed line marks the Seed baseline.

The two EvE runs descend in near-lockstep, converging to almost identical final errors. The Static-Initial runs diverge: one eventually approaches EvE while the other plateaus at a higher level. Static-Final, despite starting from a higher-rated agent, suffers from phase mismatch: the frozen agent was optimized for the late stage of the original EvE run but a fresh search requires early-stage exploration strategies that this agent no longer carries. Continuous evolution is indispensable for both performance and robustness.

The complete raw search traces for all six runs (every solver's source code, agent conversations, guidance updates, and evaluation scores) are available in the v0.1.0 release.

Community

Questions, feedback, or running into issues? Join our Slack workspace.

Citation

Please cite our paper if you use EvE in your research:

@article{yu2026eve,
  title         = {Evolutionary Ensemble of Agents},
  author        = {Yu, Zongmin and Yang, Liu},
  year          = {2026},
  url           = {https://arxiv.org/abs/2605.09018},
  eprint        = {2605.09018},
  archivePrefix = {arXiv},
  primaryClass  = {cs.NE}
}

Papers Using EvE

EvE has been applied to a growing set of scientific problems. See additional papers using EvE and their BibTeX entries in docs/papers.md. If you use EvE in your work, we welcome pull requests to add your paper to the list.

Acknowledgement

This project is part of the Scientific Computing and Intelligence Group (Scaling Group) at the National University of Singapore.

Please include the NOTICE file (already included in this repository) in your code base that uses this repository.

About

EvE is an open-source framework for co-evolving ensembles of coding agents. Wrap any coding agent — Codex, Claude Code — into an evolutionary loop that co-evolves solvers and agent guidance.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors