[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

danielvegamyhre · 2025-10-16T23:39:26Z

Stacked PRs:

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

Summary

Traces show dequantization kernel in mxfp8 a2a is slow, this PR adds a triton kernel for this which is much faster for large "M" (local_batch_size * seq_len) which is what we care about for MoE training.

Test plan

pytest test/prototype/mx_formats/test_kernels.py -k mxfp8_dequant

Benchmarks

input_shape        torch_us    triton_us    torch_gbps    triton_gbps  triton_speedup
---------------  ----------  -----------  ------------  -------------  ----------------
(1, 8192, 7168)      36.864       39.968       4828.44        4453.46  0.922x
(2, 8192, 7168)     287.712       78.88        1237.32        4513.08  3.647x
(4, 8192, 7168)     560.32       150.56        1270.67        4728.9   3.722x
(8, 8192, 7168)    1110.9        297.984       1281.82        4778.67  3.728x

pytorch-bot · 2025-10-16T23:39:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3195

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8e391f0 with merge base 41a0778 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

vkuzo · 2025-10-17T00:34:08Z

test/prototype/mx_formats/test_kernels.py

+        torch.bfloat16,
+    )
+    hp_t = triton_mxfp8_dequant_dim0(x_data, x_scales, torch.bfloat16, block_size)
+    torch.testing.assert_close(hp_t, hp_ref, rtol=0, atol=0)


lgtm, didn't look at the rest too closely

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

drisspg · 2025-10-17T00:40:48Z

Looks like there is some ptx for going to bf16

__CUDA_HOSTDEVICE_FP8_DECL__
__nv_bfloat16_raw __nv_cvt_e8m0_to_bf16raw(const __nv_fp8_storage_t x)
{
    __nv_bfloat16_raw res;

#if (__CUDA_FP8_INTERNAL_CAN_RELY_ON_PTX_FOR_SHORTTYPESCVT__)
    unsigned short in = (unsigned short)x;
    unsigned hr = 0U;
    asm("{cvt.rn.bf16x2.ue8m0x2 %0, %1;}\n"
                : "=r"(hr)
                : "h"(in));

    res.x = (unsigned short)hr;
#else
    res.x = __internal_e8m0_to_bf16(x);
#endif

    return res;
}

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre added a commit that referenced this pull request Oct 16, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

8ce6f0d

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2025

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from 61a6f60 to 8ce6f0d Compare October 16, 2025 23:39

danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes moe labels Oct 17, 2025

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

357d20f

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from 8ce6f0d to 357d20f Compare October 17, 2025 00:13

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

b0e5061

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from 357d20f to b0e5061 Compare October 17, 2025 00:33

vkuzo reviewed Oct 17, 2025

View reviewed changes

vkuzo approved these changes Oct 17, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

ba81844

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from b0e5061 to ba81844 Compare October 17, 2025 00:34

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

82ded0b

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from ba81844 to 82ded0b Compare October 17, 2025 00:39

This was referenced Oct 17, 2025

[mxfp8 moe training] integrate triton quant/dequant kernels into mxfp8 all to all #3197

Merged

[mxfp8 moe training] simplify e8m0 -> fp32 calc #3201

Merged

[mxfp8 moe training] bench and profile mxfp8 a2a fwd and bwd separately #3203

Merged

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

8e391f0

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from 82ded0b to 8e391f0 Compare October 28, 2025 15:52

danielvegamyhre merged commit 7537d99 into main Oct 28, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

danielvegamyhre commented Oct 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

vkuzo Oct 17, 2025

Uh oh!

drisspg commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

Conversation

danielvegamyhre commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!