fix(sampling): fused_topk_topp PDL race causing IMA#536
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 28dfcc1172
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
When pdl_enabled() is false, the new flag only disables the launches in this file; this call still enters air_top_p::launchRadixOnly, whose local launcher hard-codes cudaLaunchAttributeProgrammaticStreamSerialization (air_top_p.cuh:505-516). That means --disable-pdl or non-Hopper NVIDIA runs still issue PDL launches whenever the fused path runs, so the intended fallback can still hit unsupported/disabled PDL instead of behaving like the rest of the gated kernels. Pass enable_pdl through to launchRadixOnly and use it for those cudaLaunchKernelEx calls too.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Summary
applyKernel(launched withcudaLaunchAttributeProgrammaticStreamSerialization) readtop_k_idx[]beforeair_topk_11bits_fused_lastfinished writing it. Under heavy HBM contention (kvstore D2H writeback), the race window widened enough thatapplyKernelread the uninitialised sentinel valueINT32_MAXas a token index, producing an out-of-bounds store toout_probs + 130 TB→ Illegal Memory Access (IMA) crash.cudaGridDependencySynchronize()atapplyKernelentry so the kernel correctly waits for its PDL producer before reading top-k outputs.launchPDL→launchKernel(enable_pdl, …)and thread the globalpdl_enabled()flag from Python callers throughfused_topk_topp_renorm(enable_pdl=)down to everycudaLaunchKernelExcall, consistent with the rest of the codebase.idx < vocab_sizecheck in theapplyKernelwrite loop as a belt-and-suspenders guard against any future sentinel leak.Test Plan