AI Evaluation Ecosystem Simulation

A generative agent-based model (GABM) of the AI evaluation ecosystem: model providers, evaluators, consumers, regulators, funders, and media. Provider strategy emerges from portfolio allocation (R&D / safety / product) and per-benchmark focus weights. Goodhart-style score--satisfaction divergence arises when benchmark-need weights misalign with consumer-need weights, not from any explicit gaming lever.

The simulation supports two execution modes:

Heuristic mode — deterministic rules at every actor; no-LLM baseline.
LLM mode — LLM-driven planning at provider, regulator, funder, and (optionally) evaluator actors.

In the lineage of Generative Agents and the broader generative agent-based modeling tradition.

Explore runs interactively

A static-first browser for the bundled paper runs lives under website/ — condition card grid, single-run animation, multi-run comparison. Once deployed, will live at https://aimslab.stanford.edu/evaluarium/. See website/README.md for pages, URL formats, and local-serving instructions.

Installation

pip install -r requirements.txt

For LLM mode, configure a provider in .env:

LLM_PROVIDER=anthropic + ANTHROPIC_API_KEY=...
LLM_PROVIDER=openai + OPENAI_API_KEY=...
LLM_PROVIDER=gemini + GOOGLE_API_KEY=...

Running an experiment

# heuristic baseline, 40-round single seed
python scripts/run_experiment.py --condition baseline --mode heuristic --seed 42

# LLM mode (Sonnet 4.6 default)
python scripts/run_experiment.py --condition baseline --mode llm --seed 42

# privacy-ladder ablation
python scripts/run_experiment.py --condition private_only --mode llm --seed 42

Output lands at hf_data_staging/<bucket>/.../seed_N/ in the canonical 7-file slim shape (config.json, metadata.json, rounds.jsonl, summary.json, ground_truth.json, game_log.md, dashboard.png).

Run python scripts/run_experiment.py --help for the full condition list.

Reproducing paper results

The dataset of paper-supporting runs is hosted on HuggingFace at evaluation-ecosystem/evaluation-ecosystem-data, released under CC-BY-4.0.

To regenerate plots or re-run a saved config:

# Re-run a saved experiment from its config.json
python scripts/rerun_experiment.py path/to/seed_dir/config.json

# Regenerate plots for an existing seed dir
python scripts/replot_experiment.py path/to/seed_dir

Conditions reported in the paper:

Privacy ladder — 5 conditions (public_only, baseline, private_dominant, private_only, iid_holdout) × 10 Sonnet seeds with paired Opus / GPT-5.5 cross-model coverage
Sequence robustness — 9 ecosystem-composition counterfactuals (s0–s8) × 5 privacy rungs
Structural ablations — no_incidents, no_funders, no_regulator, no_media, no_opensource, homogeneous_consumers, initial_uniform_capability, plus interaction conditions
Evaluator capture — case study at evaluation_lag=0
Exogenous shock — ev1_deepseek_shock capability-release validation

Project structure

src/                  Core simulation library (actors, scoring, visibility, incidents, LLM integration)
scripts/              CLI entry points + analysis tools
  run_experiment.py       main experiment runner
  rerun_experiment.py     re-run a saved config
  build_hf_metadata.py    regenerate dataset registry + manifest
  plots/                  paper-figure generators
  dashboard/              single-run 9-panel dashboard generator
  aggregate/              cross-run aggregation helpers
docs/
  stakeholders.md         canonical architecture reference
  case_studies/           policy-adjacent case study designs
website/              Interactive explorer (static; future Pyodide / BYOK LLM tiers)
  client/                 HTML/JS/CSS + bundled run data
external-validation/  Real-world data + validation scripts

Documentation

docs/stakeholders.md — full architecture reference (actors, state, scoring formula, parameters, design rationale)
docs/case_studies/ — designed and worked case studies: privacy ladder, evaluator capture, media shadow, benchmark sponsorship, audit & verification, transparency mandates, EU vs US regulation

Citation

Citation details are withheld during double-blind review.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
docs		docs
external-validation		external-validation
scripts		scripts
src		src
website		website
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
croissant.json		croissant.json
render.yaml		render.yaml
reproduce.sh		reproduce.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Evaluation Ecosystem Simulation

Explore runs interactively

Installation

Running an experiment

Reproducing paper results

Project structure

Documentation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Evaluation Ecosystem Simulation

Explore runs interactively

Installation

Running an experiment

Reproducing paper results

Project structure

Documentation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages