Skip to content

aims-foundations/evarium

Repository files navigation

AI Evaluation Ecosystem Simulation

A generative agent-based model (GABM) of the AI evaluation ecosystem: model providers, evaluators, consumers, regulators, funders, and media. Provider strategy emerges from portfolio allocation (R&D / safety / product) and per-benchmark focus weights. Goodhart-style score--satisfaction divergence arises when benchmark-need weights misalign with consumer-need weights, not from any explicit gaming lever.

The simulation supports two execution modes:

  • Heuristic mode — deterministic rules at every actor; no-LLM baseline.
  • LLM mode — LLM-driven planning at provider, regulator, funder, and (optionally) evaluator actors.

In the lineage of Generative Agents and the broader generative agent-based modeling tradition.

Explore runs interactively

A static-first browser for the bundled paper runs lives under website/ — condition card grid, single-run animation, multi-run comparison. Once deployed, will live at https://aimslab.stanford.edu/evaluarium/. See website/README.md for pages, URL formats, and local-serving instructions.

Installation

pip install -r requirements.txt

For LLM mode, configure a provider in .env:

  • LLM_PROVIDER=anthropic + ANTHROPIC_API_KEY=...
  • LLM_PROVIDER=openai + OPENAI_API_KEY=...
  • LLM_PROVIDER=gemini + GOOGLE_API_KEY=...

Running an experiment

# heuristic baseline, 40-round single seed
python scripts/run_experiment.py --condition baseline --mode heuristic --seed 42

# LLM mode (Sonnet 4.6 default)
python scripts/run_experiment.py --condition baseline --mode llm --seed 42

# privacy-ladder ablation
python scripts/run_experiment.py --condition private_only --mode llm --seed 42

Output lands at hf_data_staging/<bucket>/.../seed_N/ in the canonical 7-file slim shape (config.json, metadata.json, rounds.jsonl, summary.json, ground_truth.json, game_log.md, dashboard.png).

Run python scripts/run_experiment.py --help for the full condition list.

Reproducing paper results

The dataset of paper-supporting runs is hosted on HuggingFace at evaluation-ecosystem/evaluation-ecosystem-data, released under CC-BY-4.0.

To regenerate plots or re-run a saved config:

# Re-run a saved experiment from its config.json
python scripts/rerun_experiment.py path/to/seed_dir/config.json

# Regenerate plots for an existing seed dir
python scripts/replot_experiment.py path/to/seed_dir

Conditions reported in the paper:

  • Privacy ladder — 5 conditions (public_only, baseline, private_dominant, private_only, iid_holdout) × 10 Sonnet seeds with paired Opus / GPT-5.5 cross-model coverage
  • Sequence robustness — 9 ecosystem-composition counterfactuals (s0–s8) × 5 privacy rungs
  • Structural ablationsno_incidents, no_funders, no_regulator, no_media, no_opensource, homogeneous_consumers, initial_uniform_capability, plus interaction conditions
  • Evaluator capture — case study at evaluation_lag=0
  • Exogenous shockev1_deepseek_shock capability-release validation

Project structure

src/                  Core simulation library (actors, scoring, visibility, incidents, LLM integration)
scripts/              CLI entry points + analysis tools
  run_experiment.py       main experiment runner
  rerun_experiment.py     re-run a saved config
  build_hf_metadata.py    regenerate dataset registry + manifest
  plots/                  paper-figure generators
  dashboard/              single-run 9-panel dashboard generator
  aggregate/              cross-run aggregation helpers
docs/
  stakeholders.md         canonical architecture reference
  case_studies/           policy-adjacent case study designs
website/              Interactive explorer (static; future Pyodide / BYOK LLM tiers)
  client/                 HTML/JS/CSS + bundled run data
external-validation/  Real-world data + validation scripts

Documentation

  • docs/stakeholders.md — full architecture reference (actors, state, scoring formula, parameters, design rationale)
  • docs/case_studies/ — designed and worked case studies: privacy ladder, evaluator capture, media shadow, benchmark sponsorship, audit & verification, transparency mandates, EU vs US regulation

Citation

Citation details are withheld during double-blind review.

About

Generative Simulation of the AI Evaluation Ecosystem

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages