Skip to content

[AMD] perf: enable FlyDSL w4a16 MoE for Kimi INT4#1785

Merged
chunfangamd merged 2 commits into
mainfrom
asalykov/flydsl-moe
Jun 16, 2026
Merged

[AMD] perf: enable FlyDSL w4a16 MoE for Kimi INT4#1785
chunfangamd merged 2 commits into
mainfrom
asalykov/flydsl-moe

Conversation

@amd-asalykov

@amd-asalykov amd-asalykov commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Replace default triton w4a16 MoE kernel with more performant FlyDSL implementation for Kimi INT4 MI355X


Note

Low Risk
Benchmark and serving-flag changes only for one AMD config; no auth or production app code, though nightly images and MoE backend swaps can affect benchmark stability until validated.

Overview
Updates the Kimi K2.5 INT4 MI355X vLLM benchmark to use FlyDSL for w4a16 MoE instead of the default Triton path, and pins a digest-suffixed ROCm vLLM nightly image.

The fixed-seq-len serve script now passes --moe-backend flydsl and disables fuse_allreduce_rms in compilation config. amd-master expands the sweep: TP8 concurrency runs to 128, adds TP4 rows for 1k1k and 8k1k, and bumps the container image from v0.21.0 to the new nightly. perf-changelog records the MoE backend, image, and sweep changes for kimik2.5-int4-mi355x-vllm.

Reviewed by Cursor Bugbot for commit cd40240. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0bc2ad2. Configure here.

--trust-remote-code \
--no-enable-prefix-caching \
--max-num-seqs 256 \
--moe-backend flydsl \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TP4 sweep missing RMSNorm guard

Medium Severity

This commit adds tp: 4 sweep rows for kimik2.5-int4-mi355x-vllm, but kimik2.5_int4_mi355x.sh still sets VLLM_ROCM_USE_AITER=1 without disabling AITER RMSNorm when TP is below 8. The matching MI355X Kimi script documents accuracy problems at lower TP unless VLLM_ROCM_USE_AITER_RMSNORM=0, so new TP4 runs may produce invalid throughput numbers.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0bc2ad2. Configure here.

@amd-asalykov amd-asalykov enabled auto-merge (squash) June 15, 2026 19:51
@amd-asalykov

Copy link
Copy Markdown
Collaborator Author

@seungrokj @chunfangamd @billishyahao @cquil11 could you please review/merge it?

@cquil11

cquil11 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

@cquil11

cquil11 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@salykova

Copy link
Copy Markdown

@cquil11 thanks

Comment on lines +45 to +46
--moe-backend flydsl \
--compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u update https://github.com/vllm-project/recipes with these changes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functionstackx yes, once the next stable vLLM release is published, the recipes will be updated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salykova can u stage an PR up for it today that can be merged once the next release comes?

@salykova salykova Jun 15, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functionstackx created a draft PR vllm-project/recipes#552 let's wait for the next stable vllm release

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haic0 could you please help with the recipe?

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm once sweep passes

@billishyahao

Copy link
Copy Markdown
Collaborator

There is a conflict

@billishyahao billishyahao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@functionstackx

Copy link
Copy Markdown
Collaborator

plz remember to use the claude command /merge-prs once this pr validation finishes https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27572316951?pr=1785

@github-actions

Copy link
Copy Markdown
Contributor

@amd-asalykov amd-asalykov disabled auto-merge June 16, 2026 07:39

@chunfangamd chunfangamd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Note: fixed the PR number during the merge. Should be OK, but not a serious process

@chunfangamd

Copy link
Copy Markdown
Collaborator

/merge-prs

@chunfangamd chunfangamd merged commit 6462ac0 into main Jun 16, 2026
26 checks passed
@chunfangamd chunfangamd deleted the asalykov/flydsl-moe branch June 16, 2026 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

6 participants