-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
make flash_attn ViT upgrade opt-in #27124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request makes the FlashAttention backend upgrade for Vision Transformer (ViT) models an opt-in behavior, addressing an issue where it was unconditionally attempted, causing problems on AMD platforms. The change is implemented by introducing a try_switch_to_fa
flag in maybe_get_vit_flash_attn_backend
and updating the call sites in various models.
The overall approach is sound and correctly addresses the reported issue. However, I've identified a critical bug in the new implementation that could lead to crashes on platforms not supporting FlashAttention, like XPU. I've also pointed out a high-severity maintainability issue regarding the modification of function parameters, which could make the code harder to reason about. Addressing these points will improve the robustness and clarity of the code.
vllm/attention/layer.py
Outdated
if try_switch_to_fa and not is_fa_backend(attn_backend): | ||
attn_backend = _Backend.FLASH_ATTN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic unconditionally switches the backend to FLASH_ATTN
if try_switch_to_fa
is true. This can cause a crash on platforms that do not support FlashAttention, such as XPU, because the subsequent import of vllm.vllm_flash_attn
will fail. The switch should be guarded to only occur on supported platforms (CUDA and ROCm).
if try_switch_to_fa and not is_fa_backend(attn_backend): | |
attn_backend = _Backend.FLASH_ATTN | |
if try_switch_to_fa and not is_fa_backend(attn_backend) and ( | |
current_platform.is_cuda() or current_platform.is_rocm()): | |
attn_backend = _Backend.FLASH_ATTN |
vllm/attention/layer.py
Outdated
attn_backend == _Backend.FLASH_ATTN: | ||
# Always try upstream on ROCM. | ||
logger.info_once("maybe_get_vit_flash_attn_backend: forcing upstream FlashAttn on ROCM.") | ||
try_use_upstream_fa = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying an input parameter try_use_upstream_fa
directly is confusing and can lead to unexpected side effects. It's better to use a local variable to track the state within the function. For example, you could introduce use_upstream_fa = try_use_upstream_fa
at the beginning of the function and then modify and use use_upstream_fa
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Summary: In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Differential Revision: D84946967
cf6b734
to
3405c66
Compare
Updated to try and mimic #26104 as closely as possible to make this an equivalent change. Not sure the behavior in the original PR's is good / should be preserved, though. |
This PR also fix existing AMD failures(example):
|
Summary:
In #26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when
VLLM_ATTENTION_BACKEND
is set.This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior.
Differential Revision: D84946967