Item Response Scaling Laws

Item Response Scaling Laws: A Measurement Theory Approach to Generalizable Neural Performance Prediction

This repository contains the code for our ICML 2026 submission applying Item Response Theory (IRT) from educational testing to model LLM scaling behavior.

Overview

Classical neural scaling laws describe how LLM performance improves with increased compute, but they typically treat benchmarks as homogeneous and overlook individual question characteristics. We propose Item Response Scaling Laws, which leverage IRT to model:

Pre-training Downstream Scaling: How model performance on downstream tasks improves as a function of computational effort (FLOPs) during pre-training
Test-time Scaling: How inference performance grows with the number of independent samples at test time

By modeling per-question characteristics (difficulty, discrimination), our approach produces scaling laws that are:

Interpretable: Question difficulties and model abilities have clear meanings
Generalizable: Transfer across questions with different difficulty levels
Reliable: More stable estimates with lower variance than aggregate metrics

Key Contributions

Binary-IRT: Apply IRT with Bernoulli loss to binary response matrices
Beta-IRT: Extend IRT with Beta loss to model empirical probability responses
Large-scale validation:
- Pre-training: 25 models from 6 families, up to 359 checkpoints, 15 NLP datasets
- Test-time: Two datasets with 8+12 models, 10 NLP datasets, up to 10,000 samples per question

Repository Structure

.
├── README.md                              # This file
├── requirements.txt                       # Python dependencies
├── utils.py                              # Core IRT implementations (calibration, CAT, losses)
│
├── 67ff391e1fb5f3e66c8a79c9/             # LaTeX manuscript (ICML 2026 submission)
│   ├── main.tex                          # Main document
│   ├── sections/                         # Paper sections (abstract, intro, method, etc.)
│   └── figures/                          # Paper figures
│
├── downstream/                           # Pre-training downstream scaling experiments
│   ├── pretrain_cat_helm_binary.py      # Binary-IRT CAT (Figure 1)
│   ├── pretrain_metric_tv_binary.py     # Total variation metrics
│   ├── pretrain_organize_and_plotnaivelaw_resmat2.py  # Naive scaling laws (Figure 2)
│   ├── pretrain_cat_resmat2.py          # Beta-IRT CAT (Figure 3)
│   ├── pretrain_metric_resmat2.py       # Beta-IRT metrics
│   ├── downstream_scaleup.py            # Generalization experiments (Figures 4, 7)
│   ├── downstream_scaleup_plot.py       # Plot generation
│   └── downstream_scaleup_plot_heatmap.py  # Heatmap visualization
│
├── monkey/                               # Test-time scaling experiments
│   ├── monkey_query/                    # Query LLMs → ResMAT1 (10k samples, 8 models)
│   ├── monkey_query_2/                  # Query LLMs → ResMAT2 (2.5k samples, 12 models)
│   ├── monkey_gather_data/              # Collect data → ResMAT1 (HELM validation)
│   ├── monkey_gather_data_2/            # Collect data → ResMAT2 (recent models)
│   └── monkey_analysis/                 # Analysis and plotting (uses both ResMAT1 & ResMAT2)
│       ├── testtime_calibrate.py        # Calibrate item parameters
│       ├── testtime_cat.py              # Run CAT on test-time data
│       ├── testtime_aggregate_results.py  # Generate Figures 5, 8, 9, 10
│       └── monkey_analysis_denoise_data.py  # Denoising analysis (Figure 6)
│
├── ai2/                                  # Allen AI DataDecide integration
│   ├── DataDecide/                      # Full DataDecide project (25 corpora, 14 model sizes)
│   │   ├── eval/                        # Evaluation scripts for checkpoints
│   │   ├── pretraining/                 # Training configurations
│   │   ├── release/                     # Model release utilities
│   │   ├── scaling_laws/                # Scaling law analysis notebooks
│   │   ├── viz/                         # Visualization notebooks
│   │   └── README.md                    # DataDecide documentation
│   └── clean_datadecide/                # Data preprocessing pipeline
│       ├── 1_download_and_unzip.py     # Download raw data
│       ├── 2_get_long_table.py         # Convert to long format
│       ├── 3_clean_and_pivot.py        # Clean and pivot data
│       ├── 4_calibrate.py              # Calibrate IRT parameters
│       └── 5_cat.py                    # Run CAT
│
├── gather_ckpt_data/                    # Checkpoint evaluation data collection
│   ├── pretrain_helm_json2csv.py       # Convert HELM JSON outputs to CSV
│   ├── pretrain_helm_csv2matrix.py     # Aggregate into response matrices
│   └── ...
│
└── gather_helm_data/                    # HELM benchmark data collection
    ├── json2csv.py                      # Convert HELM JSON to CSV
    ├── csv2matrix.py                    # Create response matrices
    ├── irt_calibrate.py                 # Calibrate IRT on HELM data
    └── ...

Note on Directory Names

67ff391e1fb5f3e66c8a79c9/: Auto-generated directory name (likely from Overleaf) containing the LaTeX paper
monkey/: Named after "monkey testing" or exhaustive sampling with repeated queries
monkey_*_2/: Not version 2, but generates dataset 2 (ResMAT2) as described in paper Section 4.2

Installation

Prerequisites

⚠️ Important Requirements:

Python 3.10 (Required - Python 3.11+ has dependency compatibility issues)
LaTeX (Required for plot generation with matplotlib)
CUDA (Optional but recommended for faster computation)

System Dependencies

LaTeX is required for rendering plots with mathematical notation. Install before proceeding:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y texlive-latex-base texlive-latex-extra cm-super dvipng

macOS:

brew install --cask mactex
# Or for a minimal installation:
brew install basictex

Windows:

Install MiKTeX or TeX Live

Python Environment Setup

# Create conda environment with Python 3.10
conda create --name deval python=3.10 -y
conda activate deval

# Install dependencies
pip install -r requirements.txt

Verify Installation

Use the provided validation script to check if everything is set up correctly:

python scripts/check_environment.py

This will check:

Python version (should be 3.10.x)
LaTeX installation
Required Python packages
PyTorch and CUDA availability
HuggingFace Hub access
Available disk space

If all checks pass, you're ready to run experiments!

Dependencies

Key packages include:

torch>=2.4.0 - PyTorch for model training and IRT optimization
transformers>=4.30.0 - Hugging Face transformers for LLM inference
scikit-learn>=1.3.0 - Machine learning utilities
matplotlib>=3.7.0, seaborn>=0.12.0, tueplots>=0.2.0 - Visualization
statsmodels>=0.14.0 - Statistical modeling
scipy>=1.10.0 - Scientific computing (for Spearman correlation, etc.)

Optional packages for inference acceleration (requires CUDA):

flash-attn>=2.5.0 - Fast attention implementation
vllm>=0.6.3 - Fast LLM inference

See requirements.txt for complete list.

Quick Start

1. Understanding the Core IRT Implementation

The main IRT functions are in utils.py:

calibrate(probmat, ...): Calibrate item difficulty parameters from probability response matrix
cat_binary_1pl(ys, zs, ...): Run Binary-IRT CAT (Computerized Adaptive Testing); returns (thetas, selected_zs)
beta_nll(...): Beta negative log-likelihood loss for probability responses
IRT models: Rasch (1PL), 2PL models with sigmoid link functions

2. Pre-training Downstream Scaling

To reproduce the pre-training scaling experiments:

cd downstream

# Binary-IRT experiments (Figure 1)
python pretrain_cat_helm_binary.py

# Naive probability scaling laws (Figure 2)
python pretrain_organize_and_plotnaivelaw_resmat2.py

# Beta-IRT experiments (Figure 3)
python pretrain_cat_resmat2.py
python pretrain_metric_resmat2.py

# Generalization experiments (Figures 4, 7)
python downstream_scaleup.py
python downstream_scaleup_plot.py
python downstream_scaleup_plot_heatmap.py

3. Test-time Scaling

To reproduce the test-time scaling experiments:

cd monkey/monkey_analysis

# Run the complete pipeline (Figures 5, 8, 9, 10)
python testtime_calibrate.py      # Step 1: Calibrate item parameters
python testtime_cat.py             # Step 2: Run CAT
python testtime_aggregate_results.py  # Step 3: Aggregate and plot

# Denoising analysis (Figure 6)
python monkey_analysis_denoise_data.py

Reproducing Paper Figures

Pre-training Downstream Scaling

Figure	Description	Script(s)
Figure 1	Binary-IRT vs Accuracy curves	`downstream/pretrain_cat_helm_binary.py`
Figure 2	Naive scaling behavior of p(Correct Choice)	`downstream/pretrain_organize_and_plotnaivelaw_resmat2.py`
Figure 3	Beta-IRT CAT scaling curves	`downstream/pretrain_cat_resmat2.py` → `pretrain_metric_resmat2.py`
Figure 4, 7	Generalization across datasets	`downstream/downstream_scaleup.py` → `downstream_scaleup_plot.py` → `downstream_scaleup_plot_heatmap.py`
Total Variation	Evaluation stability metrics	`downstream/pretrain_metric_tv_binary.py`

Test-time Scaling

Figure	Description	Script(s)
Figure 5, 8, 9, 10	Test-time scaling laws	`monkey/monkey_analysis/testtime_calibrate.py` → `testtime_cat.py` → `testtime_aggregate_results.py`
Figure 6	Denoising analysis	`monkey/monkey_analysis/monkey_analysis_denoise_data.py`

Data

Pre-training Experiments

The experiments use response matrices from:

HELM binary responses: 13 models from 3 families (Pythia, Amber, SmolLM) with up to 359 checkpoints on 13 datasets
Probability responses: 5 model families on 12 multiple-choice datasets (from Schaeffer et al. 2024)

Pre-calibrated item difficulties are obtained from:

HELM leaderboard (42-91 diverse LLMs)
Open LLM Leaderboard (102 diverse LLMs)

Data is hosted on Hugging Face: stair-lab/irsl_downstream_resmat

Test-time Experiments

The paper uses TWO test-time datasets (as described in Section 4.2):

Dataset 1: $P^{\text{testtime}}_1$ (ResMAT1)

Purpose: HELM validation, z-score correlation analysis
Models: 8 LLMs (Llama-3, Qwen3, DeepSeek, Gemma)
Questions: 552 questions from 6 benchmarks (commonsense, legalbench, med_qa, bbq, lsat_qa, legal_support)
Samples: 10,000 per question
Prompting: Few-shot, temperature 1.0
Data source: stair-lab/irsl_testtime_resmat1
Generated by: monkey/monkey_query/ + monkey/monkey_gather_data/

Dataset 2: $P^{\text{testtime}}_2$ (ResMAT2)

Purpose: Recent model experiments, main paper figures
Models: 12 latest LLMs (DeepSeek-R1, Qwen3, QwQ, Gemma-3)
Questions: 120 questions from 4 latest benchmarks (AIME 2024, AIME 2025, MMLU Pro, Global MMLU Lite)
Samples: ~2,500 per question (actual: 2,560 in most cases)
Prompting: Zero-shot chain-of-thought, temperature 0.6
Data source: stair-lab/irsl_testtime_resmat2
Generated by: monkey/monkey_query_2/ + monkey/monkey_gather_data_2/
Quality control: 12 models excluded, 19 MMLU Pro questions excluded (see README)

DataDecide Integration

The ai2/DataDecide/ directory integrates with the DataDecide project:

25 pretraining corpora (Dolma, C4, FineWeb, Falcon, DCLM, etc.)
14 model sizes (4M to 1B parameters)
30k+ model checkpoints
Evaluated on OLMES suite of 10 benchmarks

See ai2/DataDecide/README.md for more details.

Key Concepts

Item Response Theory (IRT)

IRT models the interaction between test-takers (LLMs) and questions:

Rasch Model (1PL):

P(correct | θ, z) = σ(θ - z)

θ (theta): Model ability (higher = more capable)
z: Question difficulty (higher = more difficult)
σ: Sigmoid function

2PL Model:

P(correct | θ, z, a) = σ(a(θ - z))

a: Discrimination parameter (how well the question differentiates abilities)

Binary-IRT vs Beta-IRT

Binary-IRT: Uses Bernoulli loss on binary responses (correct/incorrect)
- Good for: Traditional evaluation with single-pass inference
- Limitation: Binary responses can show emergent behaviors
Beta-IRT: Uses Beta loss on empirical probability responses
- Good for: Modeling p(Correct Choice) from model logits or repeated sampling
- Advantage: Smoother scaling curves, less emergence, better predictability

Computerized Adaptive Testing (CAT)

CAT efficiently estimates model ability with fewer questions:

Select most informative question (highest Fisher information)
Query model and observe response
Update ability estimate
Repeat until budget exhausted

Model Families

Pre-training Experiments

Family	Sizes	Checkpoints	Source
Pythia	14M - 12B	143	EleutherAI
Amber	6.7B	358	LLM360
SmolLM2	135M - 1.7B	125-240	HuggingFace

Test-time Experiments

15 models across various families (see paper for complete list)

Citation

If you use this code or find our work helpful, please cite:

@article{truong2026irsl,
    title={Item Response Scaling Laws: A Measurement Theory Approach to Generalizable Neural Performance Prediction},
    author={[Authors TBD]},
    year={2026},
    journal={ICML},
}

Project Status

This is research code for an ICML 2026 submission. The repository contains:

✅ Complete experiment scripts for reproducing all paper figures
✅ Core IRT implementation in utils.py
✅ Integration with DataDecide checkpoint suite
⚠️ Data files not included (download from Hugging Face)
⚠️ Results directory structure not tracked in git

Related Work

DataDecide: allenai/DataDecide
HELM: HELM Benchmark
IRT for LLMs: Reliable, Efficient CAT
Scaling Laws: Chinchilla, Pythia

Troubleshooting

Common Issues

1. PyTorch 2.6+ `weights_only` Error

Error:

_pickle.UnpicklingError: Weights only load failed

Solution: PyTorch 2.6+ changed the default for torch.load(). If you see this error, either:

Use PyTorch 2.5 or earlier: pip install torch==2.5.0
Or update scripts to add weights_only=False to all torch.load() calls

2. LaTeX Not Found

Error:

RuntimeError: Failed to process string with tex because latex could not be found

Solution: Install LaTeX system dependencies (see Installation section above)

3. Python Version Issues

Error: Package installation fails or import errors

Solution: Ensure you're using Python 3.10:

python --version  # Should show 3.10.x

If using wrong version, recreate environment:

conda create --name deval python=3.10 -y
conda activate deval

4. CUDA Device Errors

Error:

RuntimeError: CUDA error: invalid device ordinal

Solution: Some scripts have hard-coded CUDA devices (e.g., cuda:7). If you have fewer GPUs, edit the script to use cuda:0 or cpu.

5. HuggingFace Download Issues

Error: Timeouts or connection errors when downloading data

Solution:

# Pre-download datasets
huggingface-cli download stair-lab/irsl_downstream_resmat --repo-type dataset
huggingface-cli download stair-lab/irsl_testtime_resmat2 --repo-type dataset

Contact

For questions about the code or paper, please open an issue in this repository.

License

[License TBD - Add before public release]

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
ai2		ai2
clean_pretrain_downstream		clean_pretrain_downstream
clean_synthetic		clean_synthetic
clean_testtime		clean_testtime
downstream		downstream
gather_ckpt_data		gather_ckpt_data
gather_helm_data		gather_helm_data
monkey		monkey
rebuttal_figures		rebuttal_figures
scripts		scripts
.gitignore		.gitignore
README.md		README.md
README_resmat1_responses.md		README_resmat1_responses.md
README_resmat2_responses.md		README_resmat2_responses.md
additional_data.py		additional_data.py
check_z_convention.py		check_z_convention.py
data.py		data.py
one_question.ipynb		one_question.ipynb
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Item Response Scaling Laws

Overview

Key Contributions

Repository Structure

Note on Directory Names

Installation

Prerequisites

System Dependencies

Ubuntu/Debian:

macOS:

Windows:

Python Environment Setup

Verify Installation

Dependencies

Quick Start

1. Understanding the Core IRT Implementation

2. Pre-training Downstream Scaling

3. Test-time Scaling

Reproducing Paper Figures

Pre-training Downstream Scaling

Test-time Scaling

Data

Pre-training Experiments

Test-time Experiments

Dataset 1: $P^{\text{testtime}}_1$ (ResMAT1)

Dataset 2: $P^{\text{testtime}}_2$ (ResMAT2)

DataDecide Integration

Key Concepts

Item Response Theory (IRT)

Binary-IRT vs Beta-IRT

Computerized Adaptive Testing (CAT)

Model Families

Pre-training Experiments

Test-time Experiments

Citation

Project Status

Related Work

Troubleshooting

Common Issues

1. PyTorch 2.6+ weights_only Error

2. LaTeX Not Found

3. Python Version Issues

4. CUDA Device Errors

5. HuggingFace Download Issues

Contact

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. PyTorch 2.6+ `weights_only` Error

Packages