This directory contains all simulation code for the paper.
experiments/
├── reproduce.sh # End-to-end build: pools + 5 simulations in parallel
├── simulations/
│ ├── core.py # Core primitives (task pool, best response, utility, correction)
│ ├── one_shot.py # One-shot evaluation game
│ ├── repeated.py # Repeated game with correction
│ ├── closedform.py # Closed-form k* diagnostics
│ ├── semisynthetic.py # MMLU-Pro pool builder (FIML + rotation + sparsify)
│ ├── synthetic_pool.py # Fully-synthetic stratified pool builder
│ └── __init__.py
├── figures/ # Generated figures (.pdf) and pool caches (.npz)
└── README.md
All commands should be run from the experiment/ directory:
cd experimentThe canonical entry point is reproduce.sh, which builds both pools (cached if present) and runs all five simulations in parallel:
bash reproduce.sh # full MMLU-Pro pool, T=200
FAST=1 bash reproduce.sh # pre-FIML subsample to 400 (faster fit)
REBUILD_POOL=1 bash reproduce.sh # force-rebuild the cached poolsThis produces 12 figures in figures/ (and copies them to ../figures/ for the paper).
semisynthetic.py pulls long-form binary responses from aims-foundations/measurement-db-private on the HuggingFace Hub (requires huggingface-cli login for the private repo), then runs FIML + varimax rotation + positive-unit-ball projection + threshold sparsification + axis mass-sort. See simulations/semisynthetic.py for the full pipeline.
# Full pool (FIML on all ~13.5k items; needs GPU)
python -m simulations.semisynthetic --benchmarks mmlupro \
--device cuda --fa_max_epochs 3000 --n_pool 400 --save_heatmaps
# Smoke test (subsample 400 items pre-FIML)
python -m simulations.semisynthetic --benchmarks mmlupro --subsample 400 --n_pool 400Key flags:
--benchmarks: comma-separated benchmark slugs (required)--n_factors auto: use parallel analysis to choosed(default)--tau: relative-contribution threshold for sparsification (default 0.1)--n_pool: post-sparsification subsample size (default: keep all)--subsample: pre-FIML item subsample (for fast iteration)--device cuda: GPU acceleration for FIML--save_heatmaps: also write loading-stage heatmaps and the parallel-analysis scree plot
For the fully-synthetic stratified baseline (used in the appendix), use simulations.synthetic_pool instead:
python -m simulations.synthetic_pool --d 8 --n_per_layer 80The full pipeline (MMLU-Pro: 48 subjects × 13,542 items, ~565k observations) requires a GPU node for the FIML step. Parallel analysis is fast (<1s) thanks to the dual eigenvalue trick (see below), so the bottleneck is FIML fitting.
What we know:
- MMLU-Pro: 48 subjects, 13,542 items
- Parallel analysis (first-crossing rule) → d = 8 factors (see
figures/fig_parallel_analysis.pdf) - FIML fitting on 13.5k items × 48 subjects × 8 factors needs GPU
The cluster invocation matches the reproduce.sh pool-build step:
python -m simulations.semisynthetic \
--benchmarks mmlupro \
--device cuda --fa_max_epochs 3000 \
--n_pool 400 --save_heatmaps \
--pool_path figures/pool_mmlupro.npzThis will:
- Run parallel analysis to determine d (instant via dual eigenvalue trick, see below)
- Fit FIML with d=8 factors (GPU-accelerated via
--device cuda) - Apply quartimin (or varimax fallback) rotation for simple structure
- Project rotated loadings to positive unit ball → z-vectors
- Threshold-sparsify (zero entries with relative squared contribution <
--tau) and renormalize - Mass-sort axes (so axis 0 is the most heavily loaded skill)
- Optionally subsample post-FIML to
--n_poolitems - Export
figures/pool_mmlupro.npzfor downstream scripts
After the pool is exported, copy pool_mmlupro.npz back to your local machine and regenerate all figures via reproduce.sh (which runs the five downstream scripts in parallel and copies output to ../figures/).
Parallel analysis eigendecomposes the item correlation matrix (n_items × n_items) 20+ times. For MMLU-Pro this would be a 13k × 13k matrix — expensive. But with only 48 subjects, the correlation matrix has rank ≤ 48. We exploit this exactly (not an approximation): the non-zero eigenvalues of Z Z^T (13k × 13k) are identical to those of Z^T Z (48 × 48), by the SVD. This reduces cost from O(n_items³) to O(n_subjects³), making parallel analysis instant (~0.2s for MMLU-Pro).
To check the recommended number of factors without running the full pipeline (works locally, no GPU needed):
from simulations.semisynthetic import load_responses_hf, index_long_form, parallel_analysis
import numpy as np
long_df = load_responses_hf(["mmlupro"])
ix = index_long_form(long_df)
d = parallel_analysis(ix["subject_idx"], ix["item_idx"], ix["response"],
ix["n_subjects"], ix["n_items"],
n_replications=20, percentile=95,
rng=np.random.default_rng(0))
print(f"Recommended factors: {d}") # Expected: 8Compares ERM at varying developer awareness vs deterministic transparency.
python -m simulations.one_shot --seed 0 --pool_path figures/pool_mmlupro.npzOutputs: fig_oneshot_transparency.pdf, fig_oneshot_erm_vs_transparency.pdf, fig_oneshot_asymmetry.pdf
Runs the T-round repeated game with gap-targeted Gaussian correction. Both the main and synthetic versions just take --pool_path; the --suffix _synthetic flag controls the output filename.
# Main (MMLU-Pro semisynthetic pool)
python -m simulations.repeated --seed 0 --pool_path figures/pool_mmlupro.npz
# Appendix (fully-synthetic stratified pool, built via simulations.synthetic_pool)
python -m simulations.repeated --seed 0 --pool_path figures/pool_synthetic.npz --suffix _syntheticOutputs: fig_repeated_delta_vs_k.pdf, fig_repeated_rho_k_colormap.pdf, fig_repeated_rho_k_colormap_total.pdf, fig_repeated_validation.pdf
Validates the closed-form optimal sample size against empirical loss.
# Main (MMLU-Pro)
python -m simulations.closedform --seed 0 --pool_path figures/pool_mmlupro.npz --T 200 --rhos 0.01 0.1
# Appendix (synthetic stratified)
python -m simulations.closedform --seed 0 --pool_path figures/pool_synthetic.npz --T 200 --rhos 0.01 0.1 --suffix _syntheticOutputs: fig_repeated_closedform_cumulative_rho_*.pdf
| Paper Figure | File | Pool |
|---|---|---|
| Fig 2 (one-shot ERM vs transparency, wrapped) | fig_oneshot_erm_vs_transparency.pdf |
MMLU-Pro semisynthetic |
| Fig 3 ( |
fig_repeated_delta_vs_k.pdf |
MMLU-Pro semisynthetic |
| Fig 4 ( |
fig_repeated_rho_k_colormap.pdf |
MMLU-Pro semisynthetic |
| App: parallel-analysis scree | fig_parallel_analysis.pdf |
MMLU-Pro semisynthetic |
| App: loading-stage heatmaps | fig_loadings_{1..4}.pdf |
MMLU-Pro semisynthetic |
| App: one-shot transparency | fig_oneshot_transparency.pdf |
MMLU-Pro semisynthetic |
| App: one-shot asymmetry | fig_oneshot_asymmetry.pdf |
MMLU-Pro semisynthetic |
| App: bound validation | fig_repeated_validation.pdf |
MMLU-Pro semisynthetic |
| App: total-cost colormap | fig_repeated_rho_k_colormap_total.pdf |
MMLU-Pro semisynthetic |
| App: closed-form |
fig_repeated_closedform_cumulative_rho_*.pdf |
MMLU-Pro semisynthetic |
| App: synthetic baselines |
*_synthetic.pdf variants |
Synthetic stratified ( |
reproduce.sh copies all PDFs from figures/ to ../figures/ automatically. To copy manually:
cp figures/fig_*.pdf ../figures/