A generative agent-based model (GABM) of the AI evaluation ecosystem: model providers, evaluators, consumers, regulators, funders, and media. Provider strategy emerges from portfolio allocation (R&D / safety / product) and per-benchmark focus weights. Goodhart-style score--satisfaction divergence arises when benchmark-need weights misalign with consumer-need weights, not from any explicit gaming lever.
The simulation supports two execution modes:
- Heuristic mode — deterministic rules at every actor; no-LLM baseline.
- LLM mode — LLM-driven planning at provider, regulator, funder, and (optionally) evaluator actors.
In the lineage of Generative Agents and the broader generative agent-based modeling tradition.
A static-first browser for the bundled paper runs lives under website/ — condition card grid, single-run animation, multi-run comparison. Once deployed, will live at https://aimslab.stanford.edu/evaluarium/. See website/README.md for pages, URL formats, and local-serving instructions.
pip install -r requirements.txtFor LLM mode, configure a provider in .env:
LLM_PROVIDER=anthropic+ANTHROPIC_API_KEY=...LLM_PROVIDER=openai+OPENAI_API_KEY=...LLM_PROVIDER=gemini+GOOGLE_API_KEY=...
# heuristic baseline, 40-round single seed
python scripts/run_experiment.py --condition baseline --mode heuristic --seed 42
# LLM mode (Sonnet 4.6 default)
python scripts/run_experiment.py --condition baseline --mode llm --seed 42
# privacy-ladder ablation
python scripts/run_experiment.py --condition private_only --mode llm --seed 42Output lands at hf_data_staging/<bucket>/.../seed_N/ in the canonical 7-file slim shape (config.json, metadata.json, rounds.jsonl, summary.json, ground_truth.json, game_log.md, dashboard.png).
Run python scripts/run_experiment.py --help for the full condition list.
The dataset of paper-supporting runs is hosted on HuggingFace at evaluation-ecosystem/evaluation-ecosystem-data, released under CC-BY-4.0.
To regenerate plots or re-run a saved config:
# Re-run a saved experiment from its config.json
python scripts/rerun_experiment.py path/to/seed_dir/config.json
# Regenerate plots for an existing seed dir
python scripts/replot_experiment.py path/to/seed_dirConditions reported in the paper:
- Privacy ladder — 5 conditions (
public_only,baseline,private_dominant,private_only,iid_holdout) × 10 Sonnet seeds with paired Opus / GPT-5.5 cross-model coverage - Sequence robustness — 9 ecosystem-composition counterfactuals (s0–s8) × 5 privacy rungs
- Structural ablations —
no_incidents,no_funders,no_regulator,no_media,no_opensource,homogeneous_consumers,initial_uniform_capability, plus interaction conditions - Evaluator capture — case study at
evaluation_lag=0 - Exogenous shock —
ev1_deepseek_shockcapability-release validation
src/ Core simulation library (actors, scoring, visibility, incidents, LLM integration)
scripts/ CLI entry points + analysis tools
run_experiment.py main experiment runner
rerun_experiment.py re-run a saved config
build_hf_metadata.py regenerate dataset registry + manifest
plots/ paper-figure generators
dashboard/ single-run 9-panel dashboard generator
aggregate/ cross-run aggregation helpers
docs/
stakeholders.md canonical architecture reference
case_studies/ policy-adjacent case study designs
website/ Interactive explorer (static; future Pyodide / BYOK LLM tiers)
client/ HTML/JS/CSS + bundled run data
external-validation/ Real-world data + validation scripts
docs/stakeholders.md— full architecture reference (actors, state, scoring formula, parameters, design rationale)docs/case_studies/— designed and worked case studies: privacy ladder, evaluator capture, media shadow, benchmark sponsorship, audit & verification, transparency mandates, EU vs US regulation
Citation details are withheld during double-blind review.