Skip to content

fix(sampling): fused_topk_topp PDL race causing IMA#536

Merged
lightseek-bot merged 4 commits into
mainfrom
jay/fix-pdl-for-fused-topk-topp
Jun 27, 2026
Merged

fix(sampling): fused_topk_topp PDL race causing IMA#536
lightseek-bot merged 4 commits into
mainfrom
jay/fix-pdl-for-fused-topk-topp

Conversation

@jaywme

@jaywme jaywme commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: applyKernel (launched with cudaLaunchAttributeProgrammaticStreamSerialization) read top_k_idx[] before air_topk_11bits_fused_last finished writing it. Under heavy HBM contention (kvstore D2H writeback), the race window widened enough that applyKernel read the uninitialised sentinel value INT32_MAX as a token index, producing an out-of-bounds store to out_probs + 130 TB → Illegal Memory Access (IMA) crash.
  • Fix: Add cudaGridDependencySynchronize() at applyKernel entry so the kernel correctly waits for its PDL producer before reading top-k outputs.
  • pdl_enabled() toggle: Rename launchPDLlaunchKernel(enable_pdl, …) and thread the global pdl_enabled() flag from Python callers through fused_topk_topp_renorm(enable_pdl=) down to every cudaLaunchKernelEx call, consistent with the rest of the codebase.
  • Bounds guard: Add idx < vocab_size check in the applyKernel write loop as a belt-and-suspenders guard against any future sentinel leak.

Test Plan

@jaywme jaywme requested a review from a team as a code owner June 26, 2026 19:35
@jaywme jaywme requested a review from yweng0828 June 26, 2026 19:35

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28dfcc1172

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

air_top_p::launchRadixOnly<float>(toppCounters, toppHistograms, toppCountHistograms,
toppBuf1, toppBuf2, batchSize, vocabSize, msStream);

P1 Badge Thread enable_pdl into the top-p radix launch

When pdl_enabled() is false, the new flag only disables the launches in this file; this call still enters air_top_p::launchRadixOnly, whose local launcher hard-codes cudaLaunchAttributeProgrammaticStreamSerialization (air_top_p.cuh:505-516). That means --disable-pdl or non-Hopper NVIDIA runs still issue PDL launches whenever the fused path runs, so the intended fallback can still hit unsupported/disabled PDL instead of behaving like the rest of the gated kernels. Pass enable_pdl through to launchRadixOnly and use it for those cudaLaunchKernelEx calls too.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@lightseek-bot lightseek-bot merged commit 2ed8197 into main Jun 27, 2026
33 of 38 checks passed
@lightseek-bot lightseek-bot deleted the jay/fix-pdl-for-fused-topk-topp branch June 27, 2026 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants