[misc] FlashAttention 4 support by XOR-op · Pull Request #1114 · hao-ai-lab/FastVideo

XOR-op · 2026-02-19T05:46:49Z

Install from https://github.com/Dao-AILab/flash-attention/tree/main/flash_attn/cute

What's in this PR:

Integrate into FlashAttnBackend now, given no specific reason to choose fa2 over fa4 when fa4 is available
Use torch.custom_op for torch.compile support

gemini-code-assist · 2026-02-19T05:47:08Z

Summary of Changes

Hello @XOR-op, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the FlashAttention backend by integrating support for FlashAttention 4, aiming to leverage the latest performance enhancements. It establishes a flexible version detection and fallback system to maintain compatibility across various FlashAttention installations. Additionally, the changes adapt the attention function's output processing to align with potential API changes in newer FlashAttention versions.

Highlights

FlashAttention 4 Support: Added support for FlashAttention 4 by attempting to import flash_attn_func from flash_attn.cute.
Version Fallback Mechanism: Implemented a robust fallback mechanism that prioritizes FlashAttention 4, then FlashAttention 3, and finally FlashAttention 2 if newer versions are not available.
Output Handling: Updated the forward method to correctly handle potential tuple outputs from flash_attn_func, ensuring only the primary output is used.
Version Logging: Introduced logging to clearly indicate which FlashAttention backend version is currently being utilized.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

fastvideo/attention/backends/flash_attn.py
- Implemented a version-aware import strategy for FlashAttention, prioritizing v4, then v3, then v2.
- Added logging to report the active FlashAttention version.
- Adjusted the forward method to correctly process tuple outputs from FlashAttention functions.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for FlashAttention v4 by updating the import logic to dynamically detect and use the appropriate version of flash_attn_func. It also correctly handles the output of the attention function, which may be a tuple. The changes look good, but I have a couple of suggestions to improve code clarity and reduce duplication, which should enhance maintainability.

gemini-code-assist · 2026-02-19T05:48:50Z

fastvideo/attention/backends/flash_attn.py

+fa_ver = 2
 try:
-    from flash_attn_interface import flash_attn_func as flash_attn_3_func
+    from flash_attn.cute import flash_attn_func

-    # flash_attn 3 no longer have a different API, see following commit:
-    # https://github.com/Dao-AILab/flash-attention/commit/ed209409acedbb2379f870bbd03abce31a7a51b7
-    flash_attn_func = flash_attn_3_func
+    fa_ver = 4
 except ImportError:
-    flash_attn_func = flash_attn_2_func
+    try:
+        from flash_attn_interface import flash_attn_func as flash_attn_3_func

-from fastvideo.attention.backends.abstract import (AttentionBackend,
-                                                   AttentionImpl,
-                                                   AttentionMetadata,
-                                                   AttentionMetadataBuilder)
-from fastvideo.logger import init_logger
+        # flash_attn 3 no longer have a different API, see following commit:
+        # https://github.com/Dao-AILab/flash-attention/commit/ed209409acedbb2379f870bbd03abce31a7a51b7
+        flash_attn_func = flash_attn_3_func
+        fa_ver = 3
+    except ImportError:
+        flash_attn_func = flash_attn_2_func


The logic for detecting the FlashAttention version can be slightly simplified for better readability. By setting fa_ver within each block where a version is successfully imported, you can remove the initial assignment and make it clearer which version is being used.

Suggested change

fa_ver = 2

try:

from flash_attn_interface import flash_attn_func as flash_attn_3_func

from flash_attn.cute import flash_attn_func

# flash_attn 3 no longer have a different API, see following commit:

# https://github.com/Dao-AILab/flash-attention/commit/ed209409acedbb2379f870bbd03abce31a7a51b7

flash_attn_func = flash_attn_3_func

fa_ver = 4

except ImportError:

flash_attn_func = flash_attn_2_func

try:

from flash_attn_interface import flash_attn_func as flash_attn_3_func

from fastvideo.attention.backends.abstract import (AttentionBackend,

AttentionImpl,

AttentionMetadata,

AttentionMetadataBuilder)

from fastvideo.logger import init_logger

# flash_attn 3 no longer have a different API, see following commit:

# https://github.com/Dao-AILab/flash-attention/commit/ed209409acedbb2379f870bbd03abce31a7a51b7

flash_attn_func = flash_attn_3_func

fa_ver = 3

except ImportError:

flash_attn_func = flash_attn_2_func

try:

from flash_attn.cute import flash_attn_func

fa_ver = 4

except ImportError:

try:

from flash_attn_interface import flash_attn_func as flash_attn_3_func

# flash_attn 3 no longer have a different API, see following commit:

# https://github.com/Dao-AILab/flash-attention/commit/ed209409acedbb2379f870bbd03abce31a7a51b7

flash_attn_func = flash_attn_3_func

fa_ver = 3

except ImportError:

flash_attn_func = flash_attn_2_func

fa_ver = 2

XOR-op · 2026-02-19T21:10:51Z

Converted to WIP: after discussion with @jzhang38, we decide we should add a new FA4 backend instead and wrap it in a custom op

SolitaryThinker · 2026-02-25T23:07:12Z

Do we need to gate fa4 behind cuda arch?

XOR-op · 2026-02-25T23:37:20Z

Added the gate

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

SolitaryThinker added the go Trigger Buildkite CI label Feb 19, 2026

XOR-op marked this pull request as draft February 19, 2026 21:05

XOR-op removed the go Trigger Buildkite CI label Feb 19, 2026

XOR-op force-pushed the fa4-support branch from 26b20d9 to 2ef3052 Compare February 24, 2026 23:32

XOR-op marked this pull request as ready for review February 24, 2026 23:36

XOR-op added the go Trigger Buildkite CI label Feb 24, 2026

XOR-op requested a review from jzhang38 February 25, 2026 20:29

XOR-op requested a review from SolitaryThinker March 2, 2026 18:48

XOR-op force-pushed the fa4-support branch from 78c83ae to 7c4d1e5 Compare March 2, 2026 18:52

XOR-op added 2 commits March 2, 2026 13:54

add: fa4 support as custom_ops

aeb790e

add: cuda gate for fa4

727d6ca

XOR-op force-pushed the fa4-support branch from 7c4d1e5 to 727d6ca Compare March 2, 2026 18:55

fix: format

baf4010

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] FlashAttention 4 support#1114

[misc] FlashAttention 4 support#1114
XOR-op wants to merge 3 commits intohao-ai-lab:mainfrom
XOR-op:fa4-support

XOR-op commented Feb 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Uh oh!

XOR-op commented Feb 19, 2026

Uh oh!

SolitaryThinker commented Feb 25, 2026

Uh oh!

XOR-op commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XOR-op commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

XOR-op commented Feb 19, 2026

Uh oh!

SolitaryThinker commented Feb 25, 2026

Uh oh!

XOR-op commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XOR-op commented Feb 19, 2026 •

edited

Loading