[AMD] perf: enable FlyDSL w4a16 MoE for Kimi INT4 by amd-asalykov · Pull Request #1785 · SemiAnalysisAI/InferenceX

amd-asalykov · 2026-06-15T19:48:58Z

Replace default triton w4a16 MoE kernel with more performant FlyDSL implementation for Kimi INT4 MI355X

Note

Low Risk
Benchmark and serving-flag changes only for one AMD config; no auth or production app code, though nightly images and MoE backend swaps can affect benchmark stability until validated.

Overview
Updates the Kimi K2.5 INT4 MI355X vLLM benchmark to use FlyDSL for w4a16 MoE instead of the default Triton path, and pins a digest-suffixed ROCm vLLM nightly image.

The fixed-seq-len serve script now passes --moe-backend flydsl and disables fuse_allreduce_rms in compilation config. amd-master expands the sweep: TP8 concurrency runs to 128, adds TP4 rows for 1k1k and 8k1k, and bumps the container image from v0.21.0 to the new nightly. perf-changelog records the MoE backend, image, and sweep changes for kimik2.5-int4-mi355x-vllm.

^{Reviewed by Cursor Bugbot for commit cd40240. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-15T19:49:07Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 0bc2ad2. Configure here.}

cursor · 2026-06-15T19:49:58Z

 --trust-remote-code \
 --no-enable-prefix-caching \
 --max-num-seqs 256 \
+--moe-backend flydsl \


TP4 sweep missing RMSNorm guard

Medium Severity

This commit adds tp: 4 sweep rows for kimik2.5-int4-mi355x-vllm, but kimik2.5_int4_mi355x.sh still sets VLLM_ROCM_USE_AITER=1 without disabling AITER RMSNorm when TP is below 8. The matching MI355X Kimi script documents accuracy problems at lower TP unless VLLM_ROCM_USE_AITER_RMSNORM=0, so new TP4 runs may produce invalid throughput numbers.

Additional Locations (2)

.github/configs/amd-master.yaml#L754-L755

.github/configs/amd-master.yaml#L759-L760

^{Reviewed by Cursor Bugbot for commit 0bc2ad2. Configure here.}

amd-asalykov · 2026-06-15T19:52:36Z

@seungrokj @chunfangamd @billishyahao @cquil11 could you please review/merge it?

cquil11 · 2026-06-15T19:53:48Z

@amd-asalykov run in progress: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27572316951

cquil11 · 2026-06-15T19:53:57Z

/reuse-sweep-run

salykova · 2026-06-15T19:54:25Z

@cquil11 thanks

functionstackx · 2026-06-15T19:57:34Z

+--moe-backend flydsl \
+--compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \


can u update https://github.com/vllm-project/recipes with these changes

@functionstackx yes, once the next stable vLLM release is published, the recipes will be updated

@salykova can u stage an PR up for it today that can be merged once the next release comes?

@functionstackx created a draft PR vllm-project/recipes#552 let's wait for the next stable vllm release

@haic0 could you please help with the recipe?

functionstackx

lgtm once sweep passes

billishyahao · 2026-06-16T01:16:01Z

There is a conflict

billishyahao

LGTM

functionstackx · 2026-06-16T01:22:10Z

plz remember to use the claude command /merge-prs once this pr validation finishes https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27572316951?pr=1785

github-actions · 2026-06-16T02:07:50Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27572316951
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27572316951

chunfangamd

LGTM

Note: fixed the PR number during the merge. Should be OK, but not a serious process

chunfangamd · 2026-06-16T07:41:40Z

/merge-prs

enable FlyDSL w4a16 MoE

0bc2ad2

amd-asalykov requested a review from a team June 15, 2026 19:48

amd-asalykov requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 15, 2026 19:48

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

cursor Bot reviewed Jun 15, 2026

View reviewed changes

amd-asalykov enabled auto-merge (squash) June 15, 2026 19:51

cquil11 added the full-sweep-fail-fast label Jun 15, 2026

functionstackx reviewed Jun 15, 2026

View reviewed changes

functionstackx approved these changes Jun 15, 2026

View reviewed changes

billishyahao approved these changes Jun 16, 2026

View reviewed changes

Merge branch 'main' into asalykov/flydsl-moe

cd40240

amd-asalykov disabled auto-merge June 16, 2026 07:39

chunfangamd approved these changes Jun 16, 2026

View reviewed changes

chunfangamd merged commit 6462ac0 into main Jun 16, 2026
26 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 16, 2026

chunfangamd deleted the asalykov/flydsl-moe branch June 16, 2026 07:43

		--moe-backend flydsl \
		--compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \

Uh oh!

Conversation

amd-asalykov commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

TP4 sweep missing RMSNorm guard

Uh oh!

amd-asalykov commented Jun 15, 2026

Uh oh!

cquil11 commented Jun 15, 2026

Uh oh!

cquil11 commented Jun 15, 2026

Uh oh!

salykova commented Jun 15, 2026

Uh oh!

functionstackx Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

salykova Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

salykova Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chunfangamd Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

billishyahao commented Jun 16, 2026

Uh oh!

billishyahao left a comment

Choose a reason for hiding this comment

Uh oh!

functionstackx commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

chunfangamd left a comment

Choose a reason for hiding this comment

Uh oh!

chunfangamd commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amd-asalykov commented Jun 15, 2026 •

edited by cursor Bot

Loading

salykova Jun 15, 2026 •

edited

Loading