[AMD] perf: enable FlyDSL w4a16 MoE for Kimi INT4#1785
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 0bc2ad2. Configure here.
| --trust-remote-code \ | ||
| --no-enable-prefix-caching \ | ||
| --max-num-seqs 256 \ | ||
| --moe-backend flydsl \ |
There was a problem hiding this comment.
TP4 sweep missing RMSNorm guard
Medium Severity
This commit adds tp: 4 sweep rows for kimik2.5-int4-mi355x-vllm, but kimik2.5_int4_mi355x.sh still sets VLLM_ROCM_USE_AITER=1 without disabling AITER RMSNorm when TP is below 8. The matching MI355X Kimi script documents accuracy problems at lower TP unless VLLM_ROCM_USE_AITER_RMSNORM=0, so new TP4 runs may produce invalid throughput numbers.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 0bc2ad2. Configure here.
|
@seungrokj @chunfangamd @billishyahao @cquil11 could you please review/merge it? |
|
/reuse-sweep-run |
|
@cquil11 thanks |
| --moe-backend flydsl \ | ||
| --compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \ |
There was a problem hiding this comment.
can u update https://github.com/vllm-project/recipes with these changes
There was a problem hiding this comment.
@functionstackx yes, once the next stable vLLM release is published, the recipes will be updated
There was a problem hiding this comment.
@salykova can u stage an PR up for it today that can be merged once the next release comes?
There was a problem hiding this comment.
@functionstackx created a draft PR vllm-project/recipes#552 let's wait for the next stable vllm release
functionstackx
left a comment
There was a problem hiding this comment.
lgtm once sweep passes
|
There is a conflict |
|
plz remember to use the claude command /merge-prs once this pr validation finishes https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27572316951?pr=1785 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27572316951 |
chunfangamd
left a comment
There was a problem hiding this comment.
LGTM
Note: fixed the PR number during the merge. Should be OK, but not a serious process
|
/merge-prs |


Replace default triton w4a16 MoE kernel with more performant FlyDSL implementation for Kimi INT4 MI355X
Note
Low Risk
Benchmark and serving-flag changes only for one AMD config; no auth or production app code, though nightly images and MoE backend swaps can affect benchmark stability until validated.
Overview
Updates the Kimi K2.5 INT4 MI355X vLLM benchmark to use FlyDSL for w4a16 MoE instead of the default Triton path, and pins a digest-suffixed ROCm vLLM nightly image.
The fixed-seq-len serve script now passes
--moe-backend flydsland disablesfuse_allreduce_rmsin compilation config. amd-master expands the sweep: TP8 concurrency runs to 128, adds TP4 rows for 1k1k and 8k1k, and bumps the container image fromv0.21.0to the new nightly. perf-changelog records the MoE backend, image, and sweep changes forkimik2.5-int4-mi355x-vllm.Reviewed by Cursor Bugbot for commit cd40240. Bugbot is set up for automated code reviews on this repo. Configure here.