test(qwen3): HF-golden logits gate, prune low-value tests across crates by xiaguan · Pull Request #194 · xiaguan/pegainfer

xiaguan · 2026-05-29T10:29:39Z

What

Cleans up the test suite, model-by-model (Qwen3-4B first), replacing hardware-bound exact-output tests with a portable, strict numerical gate.

Qwen3-4B logits golden gate

Replaces the exact-text e2e.rs, the logprob hash, and the executor_equivalence bit-identity check — all of which false-positive across GPUs — with tests/hf_golden_gate.rs: a tolerance check against a stored HuggingFace bf16 logits golden.

HF is the numerical golden truth. Bit-wise baselines (text/hash) drift across cards because per-GPU bf16 GEMM differs in the low mantissa bits. The gate instead asserts pegainfer lands within HF's bf16 noise floor, which every numerically-correct GPU satisfies.
Teacher-forced, 48 seed-fixed sequences / 816 positions, prompts 1–256 tokens (up to 16 KV blocks) + 16 decode tokens.
Strict guards, calibrated from the measured floor — not guessed:
- structural regret check on the argmax (≤ 0.20 nat) — magnitude-independent, catches a confidently-wrong token even if it's absent from HF's top-K
- mean delta ≤ 0.06 (≈2× floor), p99 delta ≤ 0.20 (≈1.6× floor)
- absolute max printed but not asserted — it grows with sample count (irreducible bf16 tail), while mean/p99 stay flat
Replayed across the failure surfaces: bs=1 (+ determinism rerun) · batched eager (cross-request isolation) · CUDA-graph at bucket straddles 9→16 / 5→8 (padding-slot leaks).
The old batched == sequential bit-identity invariant is false (batch composition changes the reduction order → ~1 ULP drift); the gate absorbs that benign noise instead of red-lining on it.

Golden generated by tools/accuracy/dump_qwen3_4b_hf_golden.py (HF bf16, device_map=auto so it scales to large models).

Docs

docs/subsystems/correctness/logits-golden-gate.md — the reusable methodology (the "why"), so other model lines can adopt the same pattern.
docs/models/qwen3/accuracy-gate.md — the Qwen3-4B specifics (constants, verified noise-floor table, regen/run commands).

Test pruning

Removes low-value unit tests across pegainfer-core, pegainfer-kernels, deepseek-v2-lite, kimi-k2, pegainfer-sim, and the vllm-frontend, and adds focused scheduler admission/plan tests that exercise real state transitions rather than mocked internals.

Verification (RTX 5070 Ti, sm_120)

hf_golden_gate green in 26s; mean/p99 flat across all four passes:

Pass	positions	mean	p99	max
bs=1 eager	816	0.0317	0.1196	0.3749
batched eager (9)	153	0.0337	0.1297	0.4374
graph (9 padded)	153	0.0337	0.1297	0.4374
graph (5 padded)	85	0.0316	0.1080	0.1410

cargo test --release --lib green for all default-feature crates in the changeset; fmt + clippy clean (pre-commit hook passed).

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces a new "logits golden gate" correctness testing framework for the Qwen3-4B model, replacing the fragile exact-text greedy regression tests with a tolerance-based check against a stored HuggingFace bf16 golden. Key additions include the golden gate methodology documentation, a Python tool to dump the reference safetensors, and new integration tests (hf_golden_gate.rs and scheduler_robustness.rs). Additionally, numerous obsolete tests across several crates (including pegainfer-core, deepseek-v2-lite, kernels, and kimi-k2) have been removed or consolidated. There are no review comments to assess, so I have no feedback to provide on the review itself.

Replace Qwen3-4B's hardware-bound exact-text e2e / logprob-hash / executor-equivalence tests with tests/hf_golden_gate.rs: a tolerance check against a stored HuggingFace bf16 logits golden. It teacher-forces 48 seed-fixed sequences (816 positions) and asserts pegainfer's logprobs stay at the bf16 noise floor of HF across bs=1 / batched eager / CUDA-graph, with strict guards calibrated from the measured floor: - structural regret check on the argmax (<= 0.20 nat) - mean delta <= 0.06, p99 delta <= 0.20 - absolute max printed but not asserted (it grows with coverage) The old exact batch==sequential invariant is false: batch composition changes the reduction order and drifts logits ~1 ULP, so the gate absorbs that benign noise instead of red-lining on it. Golden generated by tools/accuracy/dump_qwen3_4b_hf_golden.py (HF bf16, device_map=auto). Methodology in docs/subsystems/correctness/logits-golden-gate.md; Qwen3-4B specifics in docs/models/qwen3/accuracy-gate.md. Also prune low-value unit tests across pegainfer-core, kernels, deepseek-v2-lite, kimi-k2, sim, and the vllm-frontend, and add focused scheduler admission/plan tests that exercise real state transitions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

xiaguan force-pushed the chore/prune-low-value-tests branch from ad456c3 to bdd297b Compare May 29, 2026 10:41

xiaguan merged commit 8eab9b3 into main May 29, 2026
1 check passed

xiaguan deleted the chore/prune-low-value-tests branch May 29, 2026 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(qwen3): HF-golden logits gate, prune low-value tests across crates#194

test(qwen3): HF-golden logits gate, prune low-value tests across crates#194
xiaguan merged 1 commit into
mainfrom
chore/prune-low-value-tests

xiaguan commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaguan commented May 29, 2026

What

Qwen3-4B logits golden gate

Docs

Test pruning

Verification (RTX 5070 Ti, sm_120)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant