Skip to content

feat(diagnostics): add item-level diagnostics module (Fantastic Bugs [arxiv.org/abs/2511.16842])#39

Open
sparkyluvscode wants to merge 1 commit into
aims-foundations:mainfrom
sparkyluvscode:feat/diagnostics-fantastic-bugs
Open

feat(diagnostics): add item-level diagnostics module (Fantastic Bugs [arxiv.org/abs/2511.16842])#39
sparkyluvscode wants to merge 1 commit into
aims-foundations:mainfrom
sparkyluvscode:feat/diagnostics-fantastic-bugs

Conversation

@sparkyluvscode
Copy link
Copy Markdown
Contributor

Summary

Adds a new torch_measure.diagnostics subpackage that implements the item-flagging procedure from Fantastic Bugs and Where to Find Them in AI Benchmarks (Truong et al., NeurIPS 2025, arXiv:2511.16842).

The design was discussed with the maintainer over email prior to implementation and was approved as-is ("the plan looks reasonable, please go ahead whenever you have a chance"). No issue exists for this work.

The module is a thin orchestration layer over the existing PyTorch metrics in torch_measure.metrics - the three statistical signals from the paper (tetrachoric correlation, Mokken scalability, item-total correlation) are already implemented there, so this PR reuses them rather than re-implementing in numpy. This keeps the library single-stack on PyTorch and avoids duplicating the math.

What is included

New modulesrc/torch_measure/diagnostics/

  • _signals.py — re-exports the three signals from torch_measure.metrics and adds average_tetrachoric_correlation, which averages each item's row of the tetrachoric matrix over its off-diagonal entries. item_scalability is a thin wrapper around mokken_scalability that returns the per-item H_items tensor.
  • _ensemble.py - gaussian_rank() implements the rank transform A_i = Φ⁻¹((r_i - 0.5)/N) with average-rank tie-breaking and NaN-safe behaviour (NaN inputs map to 0 so they neither flag nor protect an item). flag_items() is the public entry point: it computes the chosen signals, negates them so the anomaly direction matches the paper's "non-negative under Rasch" framing, applies the Gaussian rank transform, and combines per-signal anomalies via one of four rules: mean, or, and, majority. Returns a pandas.DataFrame sorted by ensemble anomaly descending.
  • _judge.py - defines ItemJudge, a runtime_checkable Protocol with signature (item_text, item_idx, anomaly_score) -> str. Any callable with that signature satisfies it; no specific LLM provider is assumed.
  • __init__.py — exposes flag_items, gaussian_rank, ItemJudge, plus the three signals and average_tetrachoric_correlation.

torch_measure/__init__.py now imports torch_measure.diagnostics and lists it in __all__.

DataFrame schema returned by flag_items

Column When present Description
item_idx always Position in the original response matrix.
item_name if item_names supplied User-supplied label.
tetrachoric_score if tetrachoric in signals Raw per-item average tetrachoric correlation.
scalability_score if scalability in signals Raw per-item Mokken H_j.
item_total_score if item_total in signals Raw item-total correlation.
ensemble_score always Mean of Gaussian-rank anomalies across selected signals.
flagged always Boolean decision under the chosen ensemble rule.
judge_output if judge supplied String returned by the judge; None for unflagged rows.

Tests

tests/test_diagnostics/test_diagnostics.py — 18 tests, all passing locally:

  • Rasch-positivity sanity checks for all three signals (item_total, item_scalability, average_tetrachoric_correlation).
  • Negative item-total detection on an injected anti-correlated item.
  • Tetrachoric matrix shape, symmetry, and unit diagonal.
  • Gaussian rank monotonicity, finiteness, NaN handling.
  • flag_items returns a DataFrame with the documented columns, ranks an injected bad item into the top-5, is sorted descending by ensemble_score, propagates item_names, populates judge_output only on flagged rows and calls the judge exactly once per flagged item, and or flags ⊇ and flags.
  • Signal subsetting and three input-validation errors (unknown signal, unknown ensemble_method, mismatched item_names length).
  • ItemJudge Protocol matches a plain function via isinstance.

Verification

pytest tests/test_diagnostics/ -v   # 18 passed
pytest tests/ -q                     # 334 passed, 0 failed
ruff check src/torch_measure/diagnostics/ tests/test_diagnostics/
ruff format --check src/torch_measure/diagnostics/ tests/test_diagnostics/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant