feat(diagnostics): add item-level diagnostics module (Fantastic Bugs [arxiv.org/abs/2511.16842]) by sparkyluvscode · Pull Request #39 · aims-foundations/torch_measure

sparkyluvscode · 2026-05-28T15:55:05Z

Summary

Adds a new torch_measure.diagnostics subpackage that implements the item-flagging procedure from Fantastic Bugs and Where to Find Them in AI Benchmarks (Truong et al., NeurIPS 2025, arXiv:2511.16842).

The design was discussed with the maintainer over email prior to implementation and was approved as-is ("the plan looks reasonable, please go ahead whenever you have a chance"). No issue exists for this work.

The module is a thin orchestration layer over the existing PyTorch metrics in torch_measure.metrics - the three statistical signals from the paper (tetrachoric correlation, Mokken scalability, item-total correlation) are already implemented there, so this PR reuses them rather than re-implementing in numpy. This keeps the library single-stack on PyTorch and avoids duplicating the math.

What is included

New module — src/torch_measure/diagnostics/

_signals.py — re-exports the three signals from torch_measure.metrics and adds average_tetrachoric_correlation, which averages each item's row of the tetrachoric matrix over its off-diagonal entries. item_scalability is a thin wrapper around mokken_scalability that returns the per-item H_items tensor.
_ensemble.py - gaussian_rank() implements the rank transform A_i = Φ⁻¹((r_i - 0.5)/N) with average-rank tie-breaking and NaN-safe behaviour (NaN inputs map to 0 so they neither flag nor protect an item). flag_items() is the public entry point: it computes the chosen signals, negates them so the anomaly direction matches the paper's "non-negative under Rasch" framing, applies the Gaussian rank transform, and combines per-signal anomalies via one of four rules: mean, or, and, majority. Returns a pandas.DataFrame sorted by ensemble anomaly descending.
_judge.py - defines ItemJudge, a runtime_checkable Protocol with signature (item_text, item_idx, anomaly_score) -> str. Any callable with that signature satisfies it; no specific LLM provider is assumed.
__init__.py — exposes flag_items, gaussian_rank, ItemJudge, plus the three signals and average_tetrachoric_correlation.

torch_measure/__init__.py now imports torch_measure.diagnostics and lists it in __all__.

DataFrame schema returned by `flag_items`

Column	When present	Description
`item_idx`	always	Position in the original response matrix.
`item_name`	if `item_names` supplied	User-supplied label.
`tetrachoric_score`	if `tetrachoric` in signals	Raw per-item average tetrachoric correlation.
`scalability_score`	if `scalability` in signals	Raw per-item Mokken `H_j`.
`item_total_score`	if `item_total` in signals	Raw item-total correlation.
`ensemble_score`	always	Mean of Gaussian-rank anomalies across selected signals.
`flagged`	always	Boolean decision under the chosen ensemble rule.
`judge_output`	if `judge` supplied	String returned by the judge; `None` for unflagged rows.

Tests

tests/test_diagnostics/test_diagnostics.py — 18 tests, all passing locally:

Rasch-positivity sanity checks for all three signals (item_total, item_scalability, average_tetrachoric_correlation).
Negative item-total detection on an injected anti-correlated item.
Tetrachoric matrix shape, symmetry, and unit diagonal.
Gaussian rank monotonicity, finiteness, NaN handling.
flag_items returns a DataFrame with the documented columns, ranks an injected bad item into the top-5, is sorted descending by ensemble_score, propagates item_names, populates judge_output only on flagged rows and calls the judge exactly once per flagged item, and or flags ⊇ and flags.
Signal subsetting and three input-validation errors (unknown signal, unknown ensemble_method, mismatched item_names length).
ItemJudge Protocol matches a plain function via isinstance.

Verification

pytest tests/test_diagnostics/ -v   # 18 passed
pytest tests/ -q                     # 334 passed, 0 failed
ruff check src/torch_measure/diagnostics/ tests/test_diagnostics/
ruff format --check src/torch_measure/diagnostics/ tests/test_diagnostics/

feat(diagnostics): add item-level diagnostics module (Fantastic Bugs)

f484652

sparkyluvscode mentioned this pull request May 29, 2026

feat(tutorials): GSM8K benchmark diagnostics notebook reproducing Truong et al. 2025 #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(diagnostics): add item-level diagnostics module (Fantastic Bugs [arxiv.org/abs/2511.16842])#39

feat(diagnostics): add item-level diagnostics module (Fantastic Bugs [arxiv.org/abs/2511.16842])#39
sparkyluvscode wants to merge 1 commit into
aims-foundations:mainfrom
sparkyluvscode:feat/diagnostics-fantastic-bugs

sparkyluvscode commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sparkyluvscode commented May 28, 2026

Summary

What is included

DataFrame schema returned by flag_items

Tests

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DataFrame schema returned by `flag_items`