Skip to content

test(qwen3): HF-golden logits gate, prune low-value tests across crates#194

Merged
xiaguan merged 1 commit into
mainfrom
chore/prune-low-value-tests
May 29, 2026
Merged

test(qwen3): HF-golden logits gate, prune low-value tests across crates#194
xiaguan merged 1 commit into
mainfrom
chore/prune-low-value-tests

Conversation

@xiaguan
Copy link
Copy Markdown
Owner

@xiaguan xiaguan commented May 29, 2026

What

Cleans up the test suite, model-by-model (Qwen3-4B first), replacing hardware-bound exact-output tests with a portable, strict numerical gate.

Qwen3-4B logits golden gate

Replaces the exact-text e2e.rs, the logprob hash, and the executor_equivalence bit-identity check — all of which false-positive across GPUs — with tests/hf_golden_gate.rs: a tolerance check against a stored HuggingFace bf16 logits golden.

  • HF is the numerical golden truth. Bit-wise baselines (text/hash) drift across cards because per-GPU bf16 GEMM differs in the low mantissa bits. The gate instead asserts pegainfer lands within HF's bf16 noise floor, which every numerically-correct GPU satisfies.
  • Teacher-forced, 48 seed-fixed sequences / 816 positions, prompts 1–256 tokens (up to 16 KV blocks) + 16 decode tokens.
  • Strict guards, calibrated from the measured floor — not guessed:
    • structural regret check on the argmax (≤ 0.20 nat) — magnitude-independent, catches a confidently-wrong token even if it's absent from HF's top-K
    • mean delta ≤ 0.06 (≈2× floor), p99 delta ≤ 0.20 (≈1.6× floor)
    • absolute max printed but not asserted — it grows with sample count (irreducible bf16 tail), while mean/p99 stay flat
  • Replayed across the failure surfaces: bs=1 (+ determinism rerun) · batched eager (cross-request isolation) · CUDA-graph at bucket straddles 9→16 / 5→8 (padding-slot leaks).
  • The old batched == sequential bit-identity invariant is false (batch composition changes the reduction order → ~1 ULP drift); the gate absorbs that benign noise instead of red-lining on it.

Golden generated by tools/accuracy/dump_qwen3_4b_hf_golden.py (HF bf16, device_map=auto so it scales to large models).

Docs

  • docs/subsystems/correctness/logits-golden-gate.md — the reusable methodology (the "why"), so other model lines can adopt the same pattern.
  • docs/models/qwen3/accuracy-gate.md — the Qwen3-4B specifics (constants, verified noise-floor table, regen/run commands).

Test pruning

Removes low-value unit tests across pegainfer-core, pegainfer-kernels, deepseek-v2-lite, kimi-k2, pegainfer-sim, and the vllm-frontend, and adds focused scheduler admission/plan tests that exercise real state transitions rather than mocked internals.

Verification (RTX 5070 Ti, sm_120)

hf_golden_gate green in 26s; mean/p99 flat across all four passes:

Pass positions mean p99 max
bs=1 eager 816 0.0317 0.1196 0.3749
batched eager (9) 153 0.0337 0.1297 0.4374
graph (9 padded) 153 0.0337 0.1297 0.4374
graph (5 padded) 85 0.0316 0.1080 0.1410

cargo test --release --lib green for all default-feature crates in the changeset; fmt + clippy clean (pre-commit hook passed).

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new "logits golden gate" correctness testing framework for the Qwen3-4B model, replacing the fragile exact-text greedy regression tests with a tolerance-based check against a stored HuggingFace bf16 golden. Key additions include the golden gate methodology documentation, a Python tool to dump the reference safetensors, and new integration tests (hf_golden_gate.rs and scheduler_robustness.rs). Additionally, numerous obsolete tests across several crates (including pegainfer-core, deepseek-v2-lite, kernels, and kimi-k2) have been removed or consolidated. There are no review comments to assess, so I have no feedback to provide on the review itself.

Replace Qwen3-4B's hardware-bound exact-text e2e / logprob-hash /
executor-equivalence tests with tests/hf_golden_gate.rs: a tolerance
check against a stored HuggingFace bf16 logits golden. It teacher-forces
48 seed-fixed sequences (816 positions) and asserts pegainfer's logprobs
stay at the bf16 noise floor of HF across bs=1 / batched eager /
CUDA-graph, with strict guards calibrated from the measured floor:

  - structural regret check on the argmax (<= 0.20 nat)
  - mean delta <= 0.06, p99 delta <= 0.20
  - absolute max printed but not asserted (it grows with coverage)

The old exact batch==sequential invariant is false: batch composition
changes the reduction order and drifts logits ~1 ULP, so the gate
absorbs that benign noise instead of red-lining on it.

Golden generated by tools/accuracy/dump_qwen3_4b_hf_golden.py (HF bf16,
device_map=auto). Methodology in
docs/subsystems/correctness/logits-golden-gate.md; Qwen3-4B specifics in
docs/models/qwen3/accuracy-gate.md.

Also prune low-value unit tests across pegainfer-core, kernels,
deepseek-v2-lite, kimi-k2, sim, and the vllm-frontend, and add focused
scheduler admission/plan tests that exercise real state transitions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@xiaguan xiaguan force-pushed the chore/prune-low-value-tests branch from ad456c3 to bdd297b Compare May 29, 2026 10:41
@xiaguan xiaguan merged commit 8eab9b3 into main May 29, 2026
1 check passed
@xiaguan xiaguan deleted the chore/prune-low-value-tests branch May 29, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant