test(qwen3): HF-golden logits gate, prune low-value tests across crates#194
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a new "logits golden gate" correctness testing framework for the Qwen3-4B model, replacing the fragile exact-text greedy regression tests with a tolerance-based check against a stored HuggingFace bf16 golden. Key additions include the golden gate methodology documentation, a Python tool to dump the reference safetensors, and new integration tests (hf_golden_gate.rs and scheduler_robustness.rs). Additionally, numerous obsolete tests across several crates (including pegainfer-core, deepseek-v2-lite, kernels, and kimi-k2) have been removed or consolidated. There are no review comments to assess, so I have no feedback to provide on the review itself.
Replace Qwen3-4B's hardware-bound exact-text e2e / logprob-hash / executor-equivalence tests with tests/hf_golden_gate.rs: a tolerance check against a stored HuggingFace bf16 logits golden. It teacher-forces 48 seed-fixed sequences (816 positions) and asserts pegainfer's logprobs stay at the bf16 noise floor of HF across bs=1 / batched eager / CUDA-graph, with strict guards calibrated from the measured floor: - structural regret check on the argmax (<= 0.20 nat) - mean delta <= 0.06, p99 delta <= 0.20 - absolute max printed but not asserted (it grows with coverage) The old exact batch==sequential invariant is false: batch composition changes the reduction order and drifts logits ~1 ULP, so the gate absorbs that benign noise instead of red-lining on it. Golden generated by tools/accuracy/dump_qwen3_4b_hf_golden.py (HF bf16, device_map=auto). Methodology in docs/subsystems/correctness/logits-golden-gate.md; Qwen3-4B specifics in docs/models/qwen3/accuracy-gate.md. Also prune low-value unit tests across pegainfer-core, kernels, deepseek-v2-lite, kimi-k2, sim, and the vllm-frontend, and add focused scheduler admission/plan tests that exercise real state transitions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ad456c3 to
bdd297b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Cleans up the test suite, model-by-model (Qwen3-4B first), replacing hardware-bound exact-output tests with a portable, strict numerical gate.
Qwen3-4B logits golden gate
Replaces the exact-text
e2e.rs, the logprob hash, and theexecutor_equivalencebit-identity check — all of which false-positive across GPUs — withtests/hf_golden_gate.rs: a tolerance check against a stored HuggingFace bf16 logits golden.batched == sequentialbit-identity invariant is false (batch composition changes the reduction order → ~1 ULP drift); the gate absorbs that benign noise instead of red-lining on it.Golden generated by
tools/accuracy/dump_qwen3_4b_hf_golden.py(HF bf16,device_map=autoso it scales to large models).Docs
docs/subsystems/correctness/logits-golden-gate.md— the reusable methodology (the "why"), so other model lines can adopt the same pattern.docs/models/qwen3/accuracy-gate.md— the Qwen3-4B specifics (constants, verified noise-floor table, regen/run commands).Test pruning
Removes low-value unit tests across
pegainfer-core,pegainfer-kernels,deepseek-v2-lite,kimi-k2,pegainfer-sim, and the vllm-frontend, and adds focused scheduler admission/plan tests that exercise real state transitions rather than mocked internals.Verification (RTX 5070 Ti, sm_120)
hf_golden_gategreen in 26s; mean/p99 flat across all four passes:cargo test --release --libgreen for all default-feature crates in the changeset; fmt + clippy clean (pre-commit hook passed).🤖 Generated with Claude Code