add option to save profiling traces in inference roofline script #3196

vkuzo · 2025-10-17T00:44:36Z

Summary:

convenient to analyze differences between roofline and observed

tl;dr; of findings:

mxfp8

need to pre-swizzle weights
torch.compile gives us two kernels, will repurpose the manual
training kernel for this, will need to add pre-swizzling. Longer
term, can see if fbgemm_gpu one is faster.

mxfp4

need to pre-swizzle weights
need a faster gemm (can use fbgemm_gpu)
need a fused activation quant kernel (can use fbgemm_gpu)

nvfp4

need to speed up existing triton activation quant kernel, currently
it doesn't autotune anything so probably some easy wins here. Longer
term can also benchmark vs fbgemm_gpu

Test Plan:

CUDA_VISIBLE_DEVICES=5 python benchmarks/float8/float8_inference_roofline.py ~/local/tmp/20251016_inference_nvfp4.csv --recipe_name nvfp4 --save_profile_traces True

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-10-17T00:44:37Z

Stack from ghstack (oldest at bottom):

Summary: convenient to analyze differences between roofline and observed tl;dr; of findings: mxfp8 1. need to pre-swizzle weights 2. torch.compile gives us two kernels, will repurpose the manual training kernel for this, will need to add pre-swizzling. Longer term, can see if fbgemm_gpu one is faster. mxfp4 1. need a faster gemm (can use fbgemm_gpu) 2. need a fused activation quant kernel (can use fbgemm_gpu) nvfp4 1. need to speed up existing triton activation quant kernel, currently it doesn't autotune anything so probably some easy wins here. Longer term can also benchmark vs fbgemm_gpu Test Plan: ```bash CUDA_VISIBLE_DEVICES=5 python benchmarks/float8/float8_inference_roofline.py ~/local/tmp/20251016_inference_nvfp4.csv --recipe_name nvfp4 --save_profile_traces True ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: a942de7 ghstack-comment-id: 3413384438 Pull-Request: #3196

pytorch-bot · 2025-10-17T00:44:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3196

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 438b35e with merge base d1a7fbc ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/dtypes/test_nf4.py::TestComm::test_comm

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

vkuzo added 4 commits October 16, 2025 07:41

Update

821bd2b

[ghstack-poisoned]

Update

5bd4e3b

[ghstack-poisoned]

Update

ea2d54f

[ghstack-poisoned]

Update

b88850f

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025

This was referenced Oct 17, 2025

extend mxfp8 roofline with more recipes #3190

Merged

extend inference roofline with real benchmarks #3194

Merged

vkuzo added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Oct 17, 2025

vkuzo mentioned this pull request Oct 17, 2025

mxtensor: make scale shape match qdata #3198

Merged

vkuzo added 3 commits October 17, 2025 05:01

Update

87acf97

[ghstack-poisoned]

Update

e481a6b

[ghstack-poisoned]

Update

438b35e

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/148/head to main October 17, 2025 15:06

vkuzo mentioned this pull request Oct 17, 2025

mxtensor: add pre-swizzle support #3200

Merged

danielvegamyhre approved these changes Oct 17, 2025

View reviewed changes

vkuzo merged commit b50e37a into main Oct 17, 2025
47 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add option to save profiling traces in inference roofline script #3196

add option to save profiling traces in inference roofline script #3196

Uh oh!

vkuzo commented Oct 17, 2025 •

edited

Loading

Uh oh!

vkuzo commented Oct 17, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add option to save profiling traces in inference roofline script #3196

add option to save profiling traces in inference roofline script #3196

Uh oh!

Conversation

vkuzo commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3196

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vkuzo commented Oct 17, 2025 •

edited

Loading

vkuzo commented Oct 17, 2025 •

edited

Loading

pytorch-bot bot commented Oct 17, 2025 •

edited

Loading