Skip to content

Robby955/logprobe-calibration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Calibration Under API Logprob Truncation

Evidence from Multiple-Choice QA Across 16 Model-Provider Conditions

Robby Sneiderman, April 2026


LLM inference APIs return only the top-k token log-probabilities (typically k = 5 or 20), discarding the remaining 99.99% of the vocabulary distribution. This project asks whether that truncation degrades confidence calibration for multiple-choice question answering.

Short answer: no. Top-5 truncation produces statistically equivalent calibration to full-vocabulary logprobs across all tested models.

The longer answer is more interesting: calibration varies dramatically across models and providers (ECE from 0.066 to 0.564), worsens on harder questions, and depends heavily on how you aggregate multi-token confidences.

Read the paper (PDF)


Key Results

Finding Detail
Truncation doesn't matter for MCQA TOST equivalence confirmed at margin 0.02 ECE (p < 0.001) across 4 vLLM models with full-vocab access
Calibration heterogeneity ECE spans 0.066 (DeepSeek-V3.1) to 0.564 (Gemma-3n) on CalProbe-167
Difficulty dependence Mean ECE rises from 0.218 (CalProbe-167) to 0.385 (GPQA Diamond)
Selective prediction works 95% accuracy at 97% coverage using logprob confidence for abstention
Aggregator choice matters For multi-token answers, aggregation method swings ECE by 0.15-0.29

Figures

Truncation has no effect on MCQA calibration

ECE is identical for k >= 5 across all models. The API's top-20 limit introduces zero calibration distortion.

ECE vs truncation level k

Calibration varies widely across models and providers

Reliability diagrams for 10 model-provider conditions on CalProbe-167. OpenAI models cluster near the diagonal (well-calibrated); Together AI models show bimodal overconfidence.

Cross-provider reliability diagrams

Harder questions break calibration

Same models on GPQA Diamond (graduate-level science). Nearly all models shift toward overconfidence.

GPQA reliability diagrams

Logprob confidence enables selective prediction

Abstaining on low-confidence predictions achieves near-perfect accuracy with minimal coverage loss.

Accuracy vs coverage


Experimental Setup

  • 16 model-provider conditions: 11 models across OpenAI, Together AI, Groq, and self-hosted vLLM
  • 3 benchmarks: CalProbe-167 (custom MCQA), MMLU-Pro (500 questions, reduced to 4-choice), GPQA Diamond (198 graduate-level science)
  • Self-hosted: 8x H100 80GB SXM via RunPod, vLLM serving with FP8/FP16 quantization, full-vocabulary logprobs (~99K tokens)
  • Statistical methods: TOST equivalence testing (margin 0.02 ECE), BCa bootstrap CIs (10,000 resamples), permutation tests

Models Tested

Provider Models
OpenAI GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini
Together AI Qwen3-235B, DeepSeek-V3.1, Gemma-3n-E4B, LFM2-24B
Groq Qwen3-32B (CoT)
vLLM (self-hosted) Qwen3-235B (FP8), Llama-3.3-70B (FP16)

Repository Structure

paper/                 # LaTeX source and compiled PDF
scripts/               # All experiment and analysis code
  common.py            # Shared utilities (ECE, Brier, plotting)
  rigorous_stats.py    # Bootstrap CIs, TOST, power analysis
  exp_a_*.py           # Experiment A: Truncation sweep
  exp_b_*.py           # Experiment B: Token variant analysis
  exp_c_*.py           # Experiment C: GroupKFold cross-validation
  exp_d_*.py           # Experiment D: Selective prediction
  exp_e_*.py           # Experiment E: Entropy bounds
  exp_f_*.py           # Experiment F: MMLU-Pro 10-choice
  exp_g_*.py           # Experiment G: Temperature sweep
  exp_h_*.py           # Experiment H: Aggregator comparison
  calibration_v3_*.py  # CalProbe-167 data collection and analysis
  gpqa_*.py            # GPQA Diamond experiments
  boundary_*.py        # Boundary experiments (letter vs fulltext)
figures/               # All generated plots and analysis reports
runpod/                # 8x H100 vLLM setup and experiment runners

Reproducing Results

pip install -r requirements.txt

# Set API keys for data collection
export OPENAI_API_KEY="sk-..."

# Run analysis on existing data (figures/ directory)
python scripts/calibration_v3_analyze.py
python scripts/gpqa_analyze.py
python scripts/exp_a_truncation_sweep.py
python scripts/exp_equivalence_tests.py

# Collect new data (requires API keys and/or vLLM)
python scripts/calibration_v3_collect.py
python scripts/gpqa_collect_openai.py

License

Research code. Contact author for collaboration or reuse.

About

Calibration Under API Logprob Truncation — evidence from 16 model-provider conditions across OpenAI, Together AI, Groq, and self-hosted vLLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors