[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

Daisy-Ma-coder · 2025-10-17T22:05:36Z

Bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490

Run into illegal memory access error when testing some prompts with prefix caching enabled on Flash Attention MLA backend

Log below is generated with CUDA_LAUNCH_BLOCKING=1 which indicating it's flash attn mla.

INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:[1;36m(EngineCore_0 pid=481)[0;0m ERROR 10-13 10:51:40 [multiproc_executor.py:146] Worker proc VllmWorker-5 died unexpectedly, shutting down executor.
...

And realized it's the same root cause as #25490 where get_scheduler_metadata was being called with a different max_num_splits than what was being passed to FlashAttnMLAMetadata.

gemini-code-assist

Code Review

This pull request aims to fix an illegal memory access error in Flash Attention MLA with full CUDA graph support by ensuring get_scheduler_metadata and FlashAttnMLAMetadata receive the same max_num_splits value. The changes correctly refactor the logic to calculate max_num_splits before it's used. However, I've identified a remaining logic issue where a similar discrepancy can occur when vllm_is_batch_invariant() is true, which could lead to the same bug under different conditions. I've provided a suggestion to fully resolve this.

…25490 Signed-off-by: qqma <[email protected]>

Signed-off-by: qqma <[email protected]>

LucasWilkinson

LGTM; thanks!

Daisy-Ma-coder · 2025-10-18T00:55:44Z

seems like the failed tests are unrelated, is it fine to still merge it?

Daisy-Ma-coder requested a review from LucasWilkinson as a code owner October 17, 2025 22:05

mergify bot added the v1 label Oct 17, 2025

gemini-code-assist bot reviewed Oct 17, 2025

View reviewed changes

bugfix for Flash Attention MLA with full cuda graph IMA following pr-…

a8cdaba

…25490 Signed-off-by: qqma <[email protected]>

Daisy-Ma-coder force-pushed the flash_attn_mla_ima_fix branch from 257c4e8 to a8cdaba Compare October 17, 2025 22:08

qqma added 2 commits October 17, 2025 15:36

fix linting check

376c203

Signed-off-by: qqma <[email protected]>

fix linting check

8c25145

Signed-off-by: qqma <[email protected]>

LucasWilkinson approved these changes Oct 17, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) October 17, 2025 23:00

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2025

Daisy-Ma-coder added 2 commits October 17, 2025 21:42

Merge branch 'main' into flash_attn_mla_ima_fix

f8f0132

Merge branch 'main' into flash_attn_mla_ima_fix

bc7b111

Daisy-Ma-coder requested a review from pavanimajety as a code owner October 22, 2025 16:34

LucasWilkinson merged commit 5beacce into vllm-project:main Oct 22, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

Daisy-Ma-coder commented Oct 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

Daisy-Ma-coder commented Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

Conversation

Daisy-Ma-coder commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Daisy-Ma-coder commented Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Daisy-Ma-coder commented Oct 17, 2025 •

edited by github-actions bot

Loading