Calibration Under API Logprob Truncation

Evidence from Multiple-Choice QA Across 16 Model-Provider Conditions

Robby Sneiderman, April 2026

LLM inference APIs return only the top-k token log-probabilities (typically k = 5 or 20), discarding the remaining 99.99% of the vocabulary distribution. This project asks whether that truncation degrades confidence calibration for multiple-choice question answering.

Short answer: no. Top-5 truncation produces statistically equivalent calibration to full-vocabulary logprobs across all tested models.

The longer answer is more interesting: calibration varies dramatically across models and providers (ECE from 0.066 to 0.564), worsens on harder questions, and depends heavily on how you aggregate multi-token confidences.

Read the paper (PDF)

Key Results

Finding	Detail
Truncation doesn't matter for MCQA	TOST equivalence confirmed at margin 0.02 ECE (p < 0.001) across 4 vLLM models with full-vocab access
Calibration heterogeneity	ECE spans 0.066 (DeepSeek-V3.1) to 0.564 (Gemma-3n) on CalProbe-167
Difficulty dependence	Mean ECE rises from 0.218 (CalProbe-167) to 0.385 (GPQA Diamond)
Selective prediction works	95% accuracy at 97% coverage using logprob confidence for abstention
Aggregator choice matters	For multi-token answers, aggregation method swings ECE by 0.15-0.29

Figures

Truncation has no effect on MCQA calibration

ECE is identical for k >= 5 across all models. The API's top-20 limit introduces zero calibration distortion.

Calibration varies widely across models and providers

Reliability diagrams for 10 model-provider conditions on CalProbe-167. OpenAI models cluster near the diagonal (well-calibrated); Together AI models show bimodal overconfidence.

Harder questions break calibration

Same models on GPQA Diamond (graduate-level science). Nearly all models shift toward overconfidence.

Logprob confidence enables selective prediction

Abstaining on low-confidence predictions achieves near-perfect accuracy with minimal coverage loss.

Experimental Setup

16 model-provider conditions: 11 models across OpenAI, Together AI, Groq, and self-hosted vLLM
3 benchmarks: CalProbe-167 (custom MCQA), MMLU-Pro (500 questions, reduced to 4-choice), GPQA Diamond (198 graduate-level science)
Self-hosted: 8x H100 80GB SXM via RunPod, vLLM serving with FP8/FP16 quantization, full-vocabulary logprobs (~99K tokens)
Statistical methods: TOST equivalence testing (margin 0.02 ECE), BCa bootstrap CIs (10,000 resamples), permutation tests

Models Tested

Provider	Models
OpenAI	GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini
Together AI	Qwen3-235B, DeepSeek-V3.1, Gemma-3n-E4B, LFM2-24B
Groq	Qwen3-32B (CoT)
vLLM (self-hosted)	Qwen3-235B (FP8), Llama-3.3-70B (FP16)

Repository Structure

paper/                 # LaTeX source and compiled PDF
scripts/               # All experiment and analysis code
  common.py            # Shared utilities (ECE, Brier, plotting)
  rigorous_stats.py    # Bootstrap CIs, TOST, power analysis
  exp_a_*.py           # Experiment A: Truncation sweep
  exp_b_*.py           # Experiment B: Token variant analysis
  exp_c_*.py           # Experiment C: GroupKFold cross-validation
  exp_d_*.py           # Experiment D: Selective prediction
  exp_e_*.py           # Experiment E: Entropy bounds
  exp_f_*.py           # Experiment F: MMLU-Pro 10-choice
  exp_g_*.py           # Experiment G: Temperature sweep
  exp_h_*.py           # Experiment H: Aggregator comparison
  calibration_v3_*.py  # CalProbe-167 data collection and analysis
  gpqa_*.py            # GPQA Diamond experiments
  boundary_*.py        # Boundary experiments (letter vs fulltext)
figures/               # All generated plots and analysis reports
runpod/                # 8x H100 vLLM setup and experiment runners

Reproducing Results

pip install -r requirements.txt

# Set API keys for data collection
export OPENAI_API_KEY="sk-..."

# Run analysis on existing data (figures/ directory)
python scripts/calibration_v3_analyze.py
python scripts/gpqa_analyze.py
python scripts/exp_a_truncation_sweep.py
python scripts/exp_equivalence_tests.py

# Collect new data (requires API keys and/or vLLM)
python scripts/calibration_v3_collect.py
python scripts/gpqa_collect_openai.py

License

Research code. Contact author for collaboration or reuse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calibration Under API Logprob Truncation

Key Results

Figures

Truncation has no effect on MCQA calibration

Calibration varies widely across models and providers

Harder questions break calibration

Logprob confidence enables selective prediction

Experimental Setup

Models Tested

Repository Structure

Reproducing Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
paper		paper
runpod		runpod
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Calibration Under API Logprob Truncation

Key Results

Figures

Truncation has no effect on MCQA calibration

Calibration varies widely across models and providers

Harder questions break calibration

Logprob confidence enables selective prediction

Experimental Setup

Models Tested

Repository Structure

Reproducing Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages