Skip to content

aims-foundations/strategic_evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Experiments Directory

This directory contains all simulation code for the paper.

Structure

experiments/
├── reproduce.sh             # End-to-end build: pools + 5 simulations in parallel
├── simulations/
│   ├── core.py              # Core primitives (task pool, best response, utility, correction)
│   ├── one_shot.py          # One-shot evaluation game
│   ├── repeated.py          # Repeated game with correction
│   ├── closedform.py        # Closed-form k* diagnostics
│   ├── semisynthetic.py     # MMLU-Pro pool builder (FIML + rotation + sparsify)
│   ├── synthetic_pool.py    # Fully-synthetic stratified pool builder
│   └── __init__.py
├── figures/                 # Generated figures (.pdf) and pool caches (.npz)
└── README.md

Running the Simulations

All commands should be run from the experiment/ directory:

cd experiment

Reproducing all figures

The canonical entry point is reproduce.sh, which builds both pools (cached if present) and runs all five simulations in parallel:

bash reproduce.sh                # full MMLU-Pro pool, T=200
FAST=1 bash reproduce.sh         # pre-FIML subsample to 400 (faster fit)
REBUILD_POOL=1 bash reproduce.sh # force-rebuild the cached pools

This produces 12 figures in figures/ (and copies them to ../figures/ for the paper).

Semisynthetic pool generation

semisynthetic.py pulls long-form binary responses from aims-foundations/measurement-db-private on the HuggingFace Hub (requires huggingface-cli login for the private repo), then runs FIML + varimax rotation + positive-unit-ball projection + threshold sparsification + axis mass-sort. See simulations/semisynthetic.py for the full pipeline.

# Full pool (FIML on all ~13.5k items; needs GPU)
python -m simulations.semisynthetic --benchmarks mmlupro \
    --device cuda --fa_max_epochs 3000 --n_pool 400 --save_heatmaps

# Smoke test (subsample 400 items pre-FIML)
python -m simulations.semisynthetic --benchmarks mmlupro --subsample 400 --n_pool 400

Key flags:

  • --benchmarks: comma-separated benchmark slugs (required)
  • --n_factors auto: use parallel analysis to choose d (default)
  • --tau: relative-contribution threshold for sparsification (default 0.1)
  • --n_pool: post-sparsification subsample size (default: keep all)
  • --subsample: pre-FIML item subsample (for fast iteration)
  • --device cuda: GPU acceleration for FIML
  • --save_heatmaps: also write loading-stage heatmaps and the parallel-analysis scree plot

For the fully-synthetic stratified baseline (used in the appendix), use simulations.synthetic_pool instead:

python -m simulations.synthetic_pool --d 8 --n_per_layer 80

Running on the cluster

The full pipeline (MMLU-Pro: 48 subjects × 13,542 items, ~565k observations) requires a GPU node for the FIML step. Parallel analysis is fast (<1s) thanks to the dual eigenvalue trick (see below), so the bottleneck is FIML fitting.

What we know:

  • MMLU-Pro: 48 subjects, 13,542 items
  • Parallel analysis (first-crossing rule) → d = 8 factors (see figures/fig_parallel_analysis.pdf)
  • FIML fitting on 13.5k items × 48 subjects × 8 factors needs GPU

The cluster invocation matches the reproduce.sh pool-build step:

python -m simulations.semisynthetic \
    --benchmarks mmlupro \
    --device cuda --fa_max_epochs 3000 \
    --n_pool 400 --save_heatmaps \
    --pool_path figures/pool_mmlupro.npz

This will:

  1. Run parallel analysis to determine d (instant via dual eigenvalue trick, see below)
  2. Fit FIML with d=8 factors (GPU-accelerated via --device cuda)
  3. Apply quartimin (or varimax fallback) rotation for simple structure
  4. Project rotated loadings to positive unit ball → z-vectors
  5. Threshold-sparsify (zero entries with relative squared contribution < --tau) and renormalize
  6. Mass-sort axes (so axis 0 is the most heavily loaded skill)
  7. Optionally subsample post-FIML to --n_pool items
  8. Export figures/pool_mmlupro.npz for downstream scripts

After the pool is exported, copy pool_mmlupro.npz back to your local machine and regenerate all figures via reproduce.sh (which runs the five downstream scripts in parallel and copies output to ../figures/).

Parallel analysis: dual eigenvalue trick

Parallel analysis eigendecomposes the item correlation matrix (n_items × n_items) 20+ times. For MMLU-Pro this would be a 13k × 13k matrix — expensive. But with only 48 subjects, the correlation matrix has rank ≤ 48. We exploit this exactly (not an approximation): the non-zero eigenvalues of Z Z^T (13k × 13k) are identical to those of Z^T Z (48 × 48), by the SVD. This reduces cost from O(n_items³) to O(n_subjects³), making parallel analysis instant (~0.2s for MMLU-Pro).

Parallel analysis only (standalone)

To check the recommended number of factors without running the full pipeline (works locally, no GPU needed):

from simulations.semisynthetic import load_responses_hf, index_long_form, parallel_analysis
import numpy as np

long_df = load_responses_hf(["mmlupro"])
ix = index_long_form(long_df)
d = parallel_analysis(ix["subject_idx"], ix["item_idx"], ix["response"],
                      ix["n_subjects"], ix["n_items"],
                      n_replications=20, percentile=95,
                      rng=np.random.default_rng(0))
print(f"Recommended factors: {d}")  # Expected: 8

One-shot game (Figure 2)

Compares ERM at varying developer awareness vs deterministic transparency.

python -m simulations.one_shot --seed 0 --pool_path figures/pool_mmlupro.npz

Outputs: fig_oneshot_transparency.pdf, fig_oneshot_erm_vs_transparency.pdf, fig_oneshot_asymmetry.pdf

Repeated game (Figures 3–4)

Runs the T-round repeated game with gap-targeted Gaussian correction. Both the main and synthetic versions just take --pool_path; the --suffix _synthetic flag controls the output filename.

# Main (MMLU-Pro semisynthetic pool)
python -m simulations.repeated --seed 0 --pool_path figures/pool_mmlupro.npz

# Appendix (fully-synthetic stratified pool, built via simulations.synthetic_pool)
python -m simulations.repeated --seed 0 --pool_path figures/pool_synthetic.npz --suffix _synthetic

Outputs: fig_repeated_delta_vs_k.pdf, fig_repeated_rho_k_colormap.pdf, fig_repeated_rho_k_colormap_total.pdf, fig_repeated_validation.pdf

Closed-form k* diagnostics (Appendix)

Validates the closed-form optimal sample size against empirical loss.

# Main (MMLU-Pro)
python -m simulations.closedform --seed 0 --pool_path figures/pool_mmlupro.npz --T 200 --rhos 0.01 0.1

# Appendix (synthetic stratified)
python -m simulations.closedform --seed 0 --pool_path figures/pool_synthetic.npz --T 200 --rhos 0.01 0.1 --suffix _synthetic

Outputs: fig_repeated_closedform_cumulative_rho_*.pdf

Figure-to-Paper Mapping

Paper Figure File Pool
Fig 2 (one-shot ERM vs transparency, wrapped) fig_oneshot_erm_vs_transparency.pdf MMLU-Pro semisynthetic
Fig 3 ($\Delta_t$ vs time) fig_repeated_delta_vs_k.pdf MMLU-Pro semisynthetic
Fig 4 ($\rho, k$ colormap, wrapped) fig_repeated_rho_k_colormap.pdf MMLU-Pro semisynthetic
App: parallel-analysis scree fig_parallel_analysis.pdf MMLU-Pro semisynthetic
App: loading-stage heatmaps fig_loadings_{1..4}.pdf MMLU-Pro semisynthetic
App: one-shot transparency fig_oneshot_transparency.pdf MMLU-Pro semisynthetic
App: one-shot asymmetry fig_oneshot_asymmetry.pdf MMLU-Pro semisynthetic
App: bound validation fig_repeated_validation.pdf MMLU-Pro semisynthetic
App: total-cost colormap fig_repeated_rho_k_colormap_total.pdf MMLU-Pro semisynthetic
App: closed-form $k^*$ fig_repeated_closedform_cumulative_rho_*.pdf MMLU-Pro semisynthetic
App: synthetic baselines *_synthetic.pdf variants Synthetic stratified ($d=8$, $

Copying Figures to Paper

reproduce.sh copies all PDFs from figures/ to ../figures/ automatically. To copy manually:

cp figures/fig_*.pdf ../figures/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors