Evidence from Multiple-Choice QA Across 16 Model-Provider Conditions
Robby Sneiderman, April 2026
LLM inference APIs return only the top-k token log-probabilities (typically k = 5 or 20), discarding the remaining 99.99% of the vocabulary distribution. This project asks whether that truncation degrades confidence calibration for multiple-choice question answering.
Short answer: no. Top-5 truncation produces statistically equivalent calibration to full-vocabulary logprobs across all tested models.
The longer answer is more interesting: calibration varies dramatically across models and providers (ECE from 0.066 to 0.564), worsens on harder questions, and depends heavily on how you aggregate multi-token confidences.
| Finding | Detail |
|---|---|
| Truncation doesn't matter for MCQA | TOST equivalence confirmed at margin 0.02 ECE (p < 0.001) across 4 vLLM models with full-vocab access |
| Calibration heterogeneity | ECE spans 0.066 (DeepSeek-V3.1) to 0.564 (Gemma-3n) on CalProbe-167 |
| Difficulty dependence | Mean ECE rises from 0.218 (CalProbe-167) to 0.385 (GPQA Diamond) |
| Selective prediction works | 95% accuracy at 97% coverage using logprob confidence for abstention |
| Aggregator choice matters | For multi-token answers, aggregation method swings ECE by 0.15-0.29 |
ECE is identical for k >= 5 across all models. The API's top-20 limit introduces zero calibration distortion.
Reliability diagrams for 10 model-provider conditions on CalProbe-167. OpenAI models cluster near the diagonal (well-calibrated); Together AI models show bimodal overconfidence.
Same models on GPQA Diamond (graduate-level science). Nearly all models shift toward overconfidence.
Abstaining on low-confidence predictions achieves near-perfect accuracy with minimal coverage loss.
- 16 model-provider conditions: 11 models across OpenAI, Together AI, Groq, and self-hosted vLLM
- 3 benchmarks: CalProbe-167 (custom MCQA), MMLU-Pro (500 questions, reduced to 4-choice), GPQA Diamond (198 graduate-level science)
- Self-hosted: 8x H100 80GB SXM via RunPod, vLLM serving with FP8/FP16 quantization, full-vocabulary logprobs (~99K tokens)
- Statistical methods: TOST equivalence testing (margin 0.02 ECE), BCa bootstrap CIs (10,000 resamples), permutation tests
| Provider | Models |
|---|---|
| OpenAI | GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini |
| Together AI | Qwen3-235B, DeepSeek-V3.1, Gemma-3n-E4B, LFM2-24B |
| Groq | Qwen3-32B (CoT) |
| vLLM (self-hosted) | Qwen3-235B (FP8), Llama-3.3-70B (FP16) |
paper/ # LaTeX source and compiled PDF
scripts/ # All experiment and analysis code
common.py # Shared utilities (ECE, Brier, plotting)
rigorous_stats.py # Bootstrap CIs, TOST, power analysis
exp_a_*.py # Experiment A: Truncation sweep
exp_b_*.py # Experiment B: Token variant analysis
exp_c_*.py # Experiment C: GroupKFold cross-validation
exp_d_*.py # Experiment D: Selective prediction
exp_e_*.py # Experiment E: Entropy bounds
exp_f_*.py # Experiment F: MMLU-Pro 10-choice
exp_g_*.py # Experiment G: Temperature sweep
exp_h_*.py # Experiment H: Aggregator comparison
calibration_v3_*.py # CalProbe-167 data collection and analysis
gpqa_*.py # GPQA Diamond experiments
boundary_*.py # Boundary experiments (letter vs fulltext)
figures/ # All generated plots and analysis reports
runpod/ # 8x H100 vLLM setup and experiment runners
pip install -r requirements.txt
# Set API keys for data collection
export OPENAI_API_KEY="sk-..."
# Run analysis on existing data (figures/ directory)
python scripts/calibration_v3_analyze.py
python scripts/gpqa_analyze.py
python scripts/exp_a_truncation_sweep.py
python scripts/exp_equivalence_tests.py
# Collect new data (requires API keys and/or vLLM)
python scripts/calibration_v3_collect.py
python scripts/gpqa_collect_openai.pyResearch code. Contact author for collaboration or reuse.



