[llama3-8B] add flex_attention model flavor #1884

XilunWu · 2025-10-15T17:46:27Z

Stack from ghstack (oldest at bottom):

Summary
Enable llama3-8B model to use flex_attention.

Test
CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_train.sh --model.flavor "8B_flex_attn" --activation_checkpoint.mode "full" --parallelism.context_parallel_degree 2

[rank3]:[titan] 2025-10-15 10:37:01,390 - root - INFO - step:  1  loss: 12.2465  grad_norm:  4.5253  memory: 33.21GiB(34.95%)  tps: 86  tflops: 4.98  mfu: 0.50%
[rank0]:[titan] 2025-10-15 10:37:01,391 - root - INFO - step:  1  loss: 12.2465  grad_norm:  4.5253  memory: 33.21GiB(34.95%)  tps: 87  tflops: 5.02  mfu: 0.51%
[rank1]:[titan] 2025-10-15 10:37:01,391 - root - INFO - step:  1  loss: 12.2465  grad_norm:  4.5253  memory: 33.21GiB(34.95%)  tps: 87  tflops: 5.03  mfu: 0.51%
[rank2]:[titan] 2025-10-15 10:37:01,391 - root - INFO - step:  1  loss: 12.2465  grad_norm:  4.5253  memory: 33.21GiB(34.95%)  tps: 86  tflops: 4.99  mfu: 0.50%
[rank2]:[titan] 2025-10-15 10:37:07,957 - root - INFO - step: 10  loss: 10.2391  grad_norm: 16.1413  memory: 44.84GiB(47.19%)  tps: 5,615  tflops: 325.20  mfu: 32.88%
[rank3]:[titan] 2025-10-15 10:37:07,957 - root - INFO - step: 10  loss: 10.2391  grad_norm: 16.1413  memory: 44.84GiB(47.19%)  tps: 5,615  tflops: 325.20  mfu: 32.88%
[rank0]:[titan] 2025-10-15 10:37:07,957 - root - INFO - step: 10  loss: 10.2391  grad_norm: 16.1413  memory: 44.84GiB(47.19%)  tps: 5,616  tflops: 325.22  mfu: 32.88%
[rank1]:[titan] 2025-10-15 10:37:07,956 - root - INFO - step: 10  loss: 10.2391  grad_norm: 16.1413  memory: 44.84GiB(47.19%)  tps: 5,615  tflops: 325.20  mfu: 32.88%
[rank3]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20  loss:  8.5913  grad_norm:  7.1942  memory: 44.84GiB(47.19%)  tps: 5,573  tflops: 322.73  mfu: 32.63%
[rank0]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20  loss:  8.5913  grad_norm:  7.1942  memory: 44.84GiB(47.19%)  tps: 5,573  tflops: 322.74  mfu: 32.63%
[rank1]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20  loss:  8.5913  grad_norm:  7.1942  memory: 44.84GiB(47.19%)  tps: 5,573  tflops: 322.73  mfu: 32.63%
[rank2]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20  loss:  8.5913  grad_norm:  7.1942  memory: 44.84GiB(47.19%)  tps: 5,573  tflops: 322.73  mfu: 32.63%
[rank3]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30  loss:  7.7257  grad_norm:  2.7261  memory: 44.84GiB(47.19%)  tps: 5,550  tflops: 321.44  mfu: 32.50%
[rank0]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30  loss:  7.7257  grad_norm:  2.7261  memory: 44.84GiB(47.19%)  tps: 5,550  tflops: 321.44  mfu: 32.50%
[rank1]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30  loss:  7.7257  grad_norm:  2.7261  memory: 44.84GiB(47.19%)  tps: 5,550  tflops: 321.44  mfu: 32.50%
[rank2]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30  loss:  7.7257  grad_norm:  2.7261  memory: 44.84GiB(47.19%)  tps: 5,550  tflops: 321.44  mfu: 32.50%
[rank3]:[titan] 2025-10-15 10:37:30,028 - root - INFO - step: 40  loss:  7.4543  grad_norm:  3.6042  memory: 44.84GiB(47.19%)  tps: 5,582  tflops: 323.26  mfu: 32.69%
[rank0]:[titan] 2025-10-15 10:37:30,028 - root - INFO - step: 40  loss:  7.4543  grad_norm:  3.6042  memory: 44.84GiB(47.19%)  tps: 5,582  tflops: 323.27  mfu: 32.69%
[rank1]:[titan] 2025-10-15 10:37:30,028 - root - INFO - step: 40  loss:  7.4543  grad_norm:  3.6042  memory: 44.84GiB(47.19%)  tps: 5,582  tflops: 323.25  mfu: 32.69%

[ghstack-poisoned]

ghstack-source-id: 2e7013d Pull Request resolved: #1884

**Summary** Enable llama3-8B model to use `flex_attention`. **Test** `CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_train.sh --model.flavor "8B_flex_attn" --activation_checkpoint.mode "full" --parallelism.context_parallel_degree 2` ``` [rank3]:[titan] 2025-10-15 10:37:01,390 - root - INFO - step: 1 loss: 12.2465 grad_norm: 4.5253 memory: 33.21GiB(34.95%) tps: 86 tflops: 4.98 mfu: 0.50% [rank0]:[titan] 2025-10-15 10:37:01,391 - root - INFO - step: 1 loss: 12.2465 grad_norm: 4.5253 memory: 33.21GiB(34.95%) tps: 87 tflops: 5.02 mfu: 0.51% [rank1]:[titan] 2025-10-15 10:37:01,391 - root - INFO - step: 1 loss: 12.2465 grad_norm: 4.5253 memory: 33.21GiB(34.95%) tps: 87 tflops: 5.03 mfu: 0.51% [rank2]:[titan] 2025-10-15 10:37:01,391 - root - INFO - step: 1 loss: 12.2465 grad_norm: 4.5253 memory: 33.21GiB(34.95%) tps: 86 tflops: 4.99 mfu: 0.50% [rank2]:[titan] 2025-10-15 10:37:07,957 - root - INFO - step: 10 loss: 10.2391 grad_norm: 16.1413 memory: 44.84GiB(47.19%) tps: 5,615 tflops: 325.20 mfu: 32.88% [rank3]:[titan] 2025-10-15 10:37:07,957 - root - INFO - step: 10 loss: 10.2391 grad_norm: 16.1413 memory: 44.84GiB(47.19%) tps: 5,615 tflops: 325.20 mfu: 32.88% [rank0]:[titan] 2025-10-15 10:37:07,957 - root - INFO - step: 10 loss: 10.2391 grad_norm: 16.1413 memory: 44.84GiB(47.19%) tps: 5,616 tflops: 325.22 mfu: 32.88% [rank1]:[titan] 2025-10-15 10:37:07,956 - root - INFO - step: 10 loss: 10.2391 grad_norm: 16.1413 memory: 44.84GiB(47.19%) tps: 5,615 tflops: 325.20 mfu: 32.88% [rank3]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20 loss: 8.5913 grad_norm: 7.1942 memory: 44.84GiB(47.19%) tps: 5,573 tflops: 322.73 mfu: 32.63% [rank0]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20 loss: 8.5913 grad_norm: 7.1942 memory: 44.84GiB(47.19%) tps: 5,573 tflops: 322.74 mfu: 32.63% [rank1]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20 loss: 8.5913 grad_norm: 7.1942 memory: 44.84GiB(47.19%) tps: 5,573 tflops: 322.73 mfu: 32.63% [rank2]:[titan] 2025-10-15 10:37:15,308 - root - INFO - step: 20 loss: 8.5913 grad_norm: 7.1942 memory: 44.84GiB(47.19%) tps: 5,573 tflops: 322.73 mfu: 32.63% [rank3]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30 loss: 7.7257 grad_norm: 2.7261 memory: 44.84GiB(47.19%) tps: 5,550 tflops: 321.44 mfu: 32.50% [rank0]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30 loss: 7.7257 grad_norm: 2.7261 memory: 44.84GiB(47.19%) tps: 5,550 tflops: 321.44 mfu: 32.50% [rank1]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30 loss: 7.7257 grad_norm: 2.7261 memory: 44.84GiB(47.19%) tps: 5,550 tflops: 321.44 mfu: 32.50% [rank2]:[titan] 2025-10-15 10:37:22,689 - root - INFO - step: 30 loss: 7.7257 grad_norm: 2.7261 memory: 44.84GiB(47.19%) tps: 5,550 tflops: 321.44 mfu: 32.50% [rank3]:[titan] 2025-10-15 10:37:30,028 - root - INFO - step: 40 loss: 7.4543 grad_norm: 3.6042 memory: 44.84GiB(47.19%) tps: 5,582 tflops: 323.26 mfu: 32.69% [rank0]:[titan] 2025-10-15 10:37:30,028 - root - INFO - step: 40 loss: 7.4543 grad_norm: 3.6042 memory: 44.84GiB(47.19%) tps: 5,582 tflops: 323.27 mfu: 32.69% [rank1]:[titan] 2025-10-15 10:37:30,028 - root - INFO - step: 40 loss: 7.4543 grad_norm: 3.6042 memory: 44.84GiB(47.19%) tps: 5,582 tflops: 323.25 mfu: 32.69% ``` [ghstack-poisoned]

[llama3-8B] add flex_attention model flavor

68ff8c6

[ghstack-poisoned]

XilunWu requested review from fegin, tianyu-l, wconstab and wwwjn as code owners October 15, 2025 17:46

This was referenced Oct 15, 2025

[RFC] Lift freqs_cis as an input of models #1882

Draft

[RFC][WIP][CP] Enable FlexAttention CP for llama3 #1883

Draft

XilunWu added a commit that referenced this pull request Oct 15, 2025

[llama3-8B] add flex_attention model flavor

e819d38

ghstack-source-id: 2e7013d Pull Request resolved: #1884

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2025

XilunWu marked this pull request as draft October 15, 2025 18:07

This was referenced Oct 16, 2025

[CP] test load-balance on llama3-8B #1897

Draft

[draft] print blockmask sprsity #1901

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llama3-8B] add flex_attention model flavor #1884

[llama3-8B] add flex_attention model flavor #1884

Uh oh!

XilunWu commented Oct 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[llama3-8B] add flex_attention model flavor #1884

Are you sure you want to change the base?

[llama3-8B] add flex_attention model flavor #1884

Uh oh!

Conversation

XilunWu commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XilunWu commented Oct 15, 2025 •

edited

Loading