Experiments Directory

This directory contains all simulation code for the paper.

Structure

experiments/
├── reproduce.sh             # End-to-end build: pools + 5 simulations in parallel
├── simulations/
│   ├── core.py              # Core primitives (task pool, best response, utility, correction)
│   ├── one_shot.py          # One-shot evaluation game
│   ├── repeated.py          # Repeated game with correction
│   ├── closedform.py        # Closed-form k* diagnostics
│   ├── semisynthetic.py     # MMLU-Pro pool builder (FIML + rotation + sparsify)
│   ├── synthetic_pool.py    # Fully-synthetic stratified pool builder
│   └── __init__.py
├── figures/                 # Generated figures (.pdf) and pool caches (.npz)
└── README.md

Running the Simulations

All commands should be run from the experiment/ directory:

cd experiment

Reproducing all figures

The canonical entry point is reproduce.sh, which builds both pools (cached if present) and runs all five simulations in parallel:

bash reproduce.sh                # full MMLU-Pro pool, T=200
FAST=1 bash reproduce.sh         # pre-FIML subsample to 400 (faster fit)
REBUILD_POOL=1 bash reproduce.sh # force-rebuild the cached pools

This produces 12 figures in figures/ (and copies them to ../figures/ for the paper).

Semisynthetic pool generation

semisynthetic.py pulls long-form binary responses from aims-foundations/measurement-db-private on the HuggingFace Hub (requires huggingface-cli login for the private repo), then runs FIML + varimax rotation + positive-unit-ball projection + threshold sparsification + axis mass-sort. See simulations/semisynthetic.py for the full pipeline.

# Full pool (FIML on all ~13.5k items; needs GPU)
python -m simulations.semisynthetic --benchmarks mmlupro \
    --device cuda --fa_max_epochs 3000 --n_pool 400 --save_heatmaps

# Smoke test (subsample 400 items pre-FIML)
python -m simulations.semisynthetic --benchmarks mmlupro --subsample 400 --n_pool 400

Key flags:

--benchmarks: comma-separated benchmark slugs (required)
--n_factors auto: use parallel analysis to choose d (default)
--tau: relative-contribution threshold for sparsification (default 0.1)
--n_pool: post-sparsification subsample size (default: keep all)
--subsample: pre-FIML item subsample (for fast iteration)
--device cuda: GPU acceleration for FIML
--save_heatmaps: also write loading-stage heatmaps and the parallel-analysis scree plot

For the fully-synthetic stratified baseline (used in the appendix), use simulations.synthetic_pool instead:

python -m simulations.synthetic_pool --d 8 --n_per_layer 80

Running on the cluster

The full pipeline (MMLU-Pro: 48 subjects × 13,542 items, ~565k observations) requires a GPU node for the FIML step. Parallel analysis is fast (<1s) thanks to the dual eigenvalue trick (see below), so the bottleneck is FIML fitting.

What we know:

MMLU-Pro: 48 subjects, 13,542 items
Parallel analysis (first-crossing rule) → d = 8 factors (see figures/fig_parallel_analysis.pdf)
FIML fitting on 13.5k items × 48 subjects × 8 factors needs GPU

The cluster invocation matches the reproduce.sh pool-build step:

python -m simulations.semisynthetic \
    --benchmarks mmlupro \
    --device cuda --fa_max_epochs 3000 \
    --n_pool 400 --save_heatmaps \
    --pool_path figures/pool_mmlupro.npz

This will:

Run parallel analysis to determine d (instant via dual eigenvalue trick, see below)
Fit FIML with d=8 factors (GPU-accelerated via --device cuda)
Apply quartimin (or varimax fallback) rotation for simple structure
Project rotated loadings to positive unit ball → z-vectors
Threshold-sparsify (zero entries with relative squared contribution < --tau) and renormalize
Mass-sort axes (so axis 0 is the most heavily loaded skill)
Optionally subsample post-FIML to --n_pool items
Export figures/pool_mmlupro.npz for downstream scripts

After the pool is exported, copy pool_mmlupro.npz back to your local machine and regenerate all figures via reproduce.sh (which runs the five downstream scripts in parallel and copies output to ../figures/).

Parallel analysis: dual eigenvalue trick

Parallel analysis eigendecomposes the item correlation matrix (n_items × n_items) 20+ times. For MMLU-Pro this would be a 13k × 13k matrix — expensive. But with only 48 subjects, the correlation matrix has rank ≤ 48. We exploit this exactly (not an approximation): the non-zero eigenvalues of Z Z^T (13k × 13k) are identical to those of Z^T Z (48 × 48), by the SVD. This reduces cost from O(n_items³) to O(n_subjects³), making parallel analysis instant (~0.2s for MMLU-Pro).

Parallel analysis only (standalone)

To check the recommended number of factors without running the full pipeline (works locally, no GPU needed):

from simulations.semisynthetic import load_responses_hf, index_long_form, parallel_analysis
import numpy as np

long_df = load_responses_hf(["mmlupro"])
ix = index_long_form(long_df)
d = parallel_analysis(ix["subject_idx"], ix["item_idx"], ix["response"],
                      ix["n_subjects"], ix["n_items"],
                      n_replications=20, percentile=95,
                      rng=np.random.default_rng(0))
print(f"Recommended factors: {d}")  # Expected: 8

One-shot game (Figure 2)

Compares ERM at varying developer awareness vs deterministic transparency.

python -m simulations.one_shot --seed 0 --pool_path figures/pool_mmlupro.npz

Outputs: fig_oneshot_transparency.pdf, fig_oneshot_erm_vs_transparency.pdf, fig_oneshot_asymmetry.pdf

Repeated game (Figures 3–4)

Runs the T-round repeated game with gap-targeted Gaussian correction. Both the main and synthetic versions just take --pool_path; the --suffix _synthetic flag controls the output filename.

# Main (MMLU-Pro semisynthetic pool)
python -m simulations.repeated --seed 0 --pool_path figures/pool_mmlupro.npz

# Appendix (fully-synthetic stratified pool, built via simulations.synthetic_pool)
python -m simulations.repeated --seed 0 --pool_path figures/pool_synthetic.npz --suffix _synthetic

Outputs: fig_repeated_delta_vs_k.pdf, fig_repeated_rho_k_colormap.pdf, fig_repeated_rho_k_colormap_total.pdf, fig_repeated_validation.pdf

Closed-form k* diagnostics (Appendix)

Validates the closed-form optimal sample size against empirical loss.

# Main (MMLU-Pro)
python -m simulations.closedform --seed 0 --pool_path figures/pool_mmlupro.npz --T 200 --rhos 0.01 0.1

# Appendix (synthetic stratified)
python -m simulations.closedform --seed 0 --pool_path figures/pool_synthetic.npz --T 200 --rhos 0.01 0.1 --suffix _synthetic

Outputs: fig_repeated_closedform_cumulative_rho_*.pdf

Figure-to-Paper Mapping

Paper Figure	File	Pool
Fig 2 (one-shot ERM vs transparency, wrapped)	`fig_oneshot_erm_vs_transparency.pdf`	MMLU-Pro semisynthetic
Fig 3 ($\Delta_t$ vs time)	`fig_repeated_delta_vs_k.pdf`	MMLU-Pro semisynthetic
Fig 4 ($\rho, k$ colormap, wrapped)	`fig_repeated_rho_k_colormap.pdf`	MMLU-Pro semisynthetic
App: parallel-analysis scree	`fig_parallel_analysis.pdf`	MMLU-Pro semisynthetic
App: loading-stage heatmaps	`fig_loadings_{1..4}.pdf`	MMLU-Pro semisynthetic
App: one-shot transparency	`fig_oneshot_transparency.pdf`	MMLU-Pro semisynthetic
App: one-shot asymmetry	`fig_oneshot_asymmetry.pdf`	MMLU-Pro semisynthetic
App: bound validation	`fig_repeated_validation.pdf`	MMLU-Pro semisynthetic
App: total-cost colormap	`fig_repeated_rho_k_colormap_total.pdf`	MMLU-Pro semisynthetic
App: closed-form $k^*$	`fig_repeated_closedform_cumulative_rho_*.pdf`	MMLU-Pro semisynthetic
App: synthetic baselines	`*_synthetic.pdf` variants	Synthetic stratified ($d=8$, $

Copying Figures to Paper

reproduce.sh copies all PDFs from figures/ to ../figures/ automatically. To copy manually:

cp figures/fig_*.pdf ../figures/

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
figures		figures
simulations		simulations
.gitignore		.gitignore
README.md		README.md
reproduce.sh		reproduce.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experiments Directory

Structure

Running the Simulations

Reproducing all figures

Semisynthetic pool generation

Running on the cluster

Parallel analysis: dual eigenvalue trick

Parallel analysis only (standalone)

One-shot game (Figure 2)

Repeated game (Figures 3–4)

Closed-form k* diagnostics (Appendix)

Figure-to-Paper Mapping

Copying Figures to Paper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Experiments Directory

Structure

Running the Simulations

Reproducing all figures

Semisynthetic pool generation

Running on the cluster

Parallel analysis: dual eigenvalue trick

Parallel analysis only (standalone)

One-shot game (Figure 2)

Repeated game (Figures 3–4)

Closed-form k* diagnostics (Appendix)

Figure-to-Paper Mapping

Copying Figures to Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages