feat: added low VRAM flash attention backend #314

timmy-feng · 2025-11-20T06:24:16Z

Motivation

The two existing attention backends both exhibit inefficiencies which inhibit the training experience.

sdpa backend materializes the full bsz x num_heads x q_len x kv_len attention score matrix in VRAM, severely inhibiting max sequence length.
flex_attention backend is very particular about the linux environment and often requires different compilation flags depending on package versions. We were not able to get this kernel to compile reliably on torch==2.8.0.

Using a log sum exp trick, we can avoid materializing any attention matrix while handling TTT KV cache with very minimal overhead. We support this using the flash attention backend since it readily provides us with an LSE tensor along with the O tensor. Flash attention 4 is also SOTA for training on Blackwell and while porting FA4 is out of scope of this PR, supporting the flash attention interface is a first step.

Modifications

Added a new LlamaFlashAttention module which has the same api as LlamaAttention (using a manual hidden cache).

Within the forward pass, we:

Calculate the partial attention output with only the target model's KV cache using flash attention
Create singleton partial attention outputs for each of the successive TTT iterations
Combine all partials via weighted sum with their LSE's

Added a test file test_flash_attention.py which verifies equivalence with the SDPA backend (up to bf16 numerical stability).

Related Issues

Accuracy Test

Ran python -m tests.test_utils.test_flash_attention:

test_backward_pass_gradient_comparison (__main__.TestFlashAttention.test_backward_pass_gradient_comparison)
Test backward pass comparing gradients between LlamaAttention and LlamaFlashAttention. ... ok
test_forward_pass_comparison (__main__.TestFlashAttention.test_forward_pass_comparison)
Test forward pass comparison between LlamaAttention and LlamaFlashAttention. ... ok

----------------------------------------------------------------------
Ran 2 tests in 16.257s

OK

Benchmark & Profiling

Trained a speculator on custom data for GLM 4.5 on 8xH200 with batch size per GPU of 1 and sequence length of 32K. Here are the performance comparisons to flex attention:

Method	VRAM Usage	Speed (s/it)
flex-attention	888 GB	9.5
flash-attention	854 GB	7.2

We also trained for one epoch on perfectblend and achieved accept length of 3 on GSM8K with chain spec of 3 steps.

GLM 4.5 support was added in a custom branch built on top of this PR here.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2025-11-20T06:24:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

sleepcoo · 2025-11-20T07:28:23Z

How much performance improvement does flex attention offer in comparison?

timmy-feng · 2025-11-20T22:12:54Z

I ran a comparison on 8xH200 and added it to the benchmarks section. I had a slight improvement to flex attention in both VRAM usage and speed (25% faster). We could probably push it up further by supporting fa3 and fa4.

I was not able to get flex-attention to compile on B200, one of the core motivations for this feature.

FrankLeeeee · 2025-11-21T06:12:29Z

Thanks！ I was not able to use flex-attention on B200, too. Meanwhile, can you pre-commit your code?

FrankLeeeee · 2025-11-23T14:41:43Z

There is still conflict with the main branch.

yubofredwang · 2025-11-25T00:07:36Z

tests/test_utils/test_flash_attention.py

+torch.manual_seed(0)
+
+
+def assert_similar(ref, out):


looks like there is a more significant numeric differences between the two approaches. How much of a difference if we do point to point comparison between ref and out?

Abigbigbig · 2025-11-28T06:54:19Z

I trained qwen2.5-vl-7B-eagle3 using the latest specforge 0.1.1 and sglang 0.5.5, and encountered ”AttributeError: 'Qwen2_5_VLForConditionalGeneration' object has no attribute 'set_aux_hidden_states_layers'“. I didn't have this issue when using the version before the fix. What could be the reason?

yubofredwang · 2025-11-28T07:54:57Z

@Abigbigbig This looks like a different issue from this PR. Let's move to a different issue. I can point you the fix

timmy-feng requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners November 20, 2025 06:24

timmy-feng force-pushed the flash_attn branch from 0279e03 to c9dbc1f Compare November 21, 2025 03:20

timmy-feng added 3 commits November 21, 2025 20:47

added flash_attn backend

5225fec

fix pre commit

141123e

replace requirements.txt with pyproject.toml for flash attn compilation

0e0bba9

timmy-feng force-pushed the flash_attn branch from d660579 to 0e0bba9 Compare November 21, 2025 20:48

fix pre commit

fcf6cda

timmy-feng requested a review from zyksir as a code owner November 21, 2025 20:52

yubofredwang mentioned this pull request Nov 25, 2025

[Feature] Use FlexAttention from FlashAttention CuTe DSL #305

Closed

2 tasks

yubofredwang reviewed Nov 25, 2025

View reviewed changes

Abigbigbig mentioned this pull request Nov 28, 2025

@Abigbigbig This looks like a different issue from this PR. Let's move to a different issue. I can point you the fix #336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: added low VRAM flash attention backend #314

feat: added low VRAM flash attention backend #314

Uh oh!

timmy-feng commented Nov 20, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Uh oh!

sleepcoo commented Nov 20, 2025

Uh oh!

timmy-feng commented Nov 20, 2025

Uh oh!

FrankLeeeee commented Nov 21, 2025

Uh oh!

FrankLeeeee commented Nov 23, 2025

Uh oh!

yubofredwang Nov 25, 2025

Uh oh!

Abigbigbig commented Nov 28, 2025

Uh oh!

yubofredwang commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		torch.manual_seed(0)


		def assert_similar(ref, out):

feat: added low VRAM flash attention backend #314

Are you sure you want to change the base?

feat: added low VRAM flash attention backend #314

Uh oh!

Conversation

timmy-feng commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Uh oh!

sleepcoo commented Nov 20, 2025

Uh oh!

timmy-feng commented Nov 20, 2025

Uh oh!

FrankLeeeee commented Nov 21, 2025

Uh oh!

FrankLeeeee commented Nov 23, 2025

Uh oh!

yubofredwang Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Abigbigbig commented Nov 28, 2025

Uh oh!

yubofredwang commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

timmy-feng commented Nov 20, 2025 •

edited

Loading