Add MiniMax-M3 MXFP4 (AMD) variant#579
Conversation
Add an `mxfp4` variant for `amd/MiniMax-M3-MXFP4` targeting AMD CDNA4 (MI350X/MI355X, gfx950), served through the AITER MoE backend. At ~0.5 bytes/param it is roughly half the VRAM of MXFP8 and fits a single 8x MI355X node from TP=4. Validated single-node on 8x MI355X (gfx950), TP=4, vLLM 0.23.1 (rocm/vllm-dev ROCm image): the model serves and the minimax_m3 reasoning/tool parsers split reasoning from content correctly. Flags mirror the existing AMD MXFP8 path (block-size 128, TRITON_ATTN MSA) plus the AITER MoE backend; the ATOM MiniMax-M3 recipe was used as a cross reference for the MXFP4 sharding/KV constraints. The checkpoint ships no calibrated KV scales: `--kv-cache-dtype fp8` still serves but falls back to an uncalibrated scale of 1.0 (accuracy risk), so the variant keeps the KV cache at its default dtype. Documented in the guide. Signed-off-by: andyluo7 <andy.luo@amd.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request adds a new AMD-quantized MXFP4 variant (amd/MiniMax-M3-MXFP4) for the MiniMax-M3 model, including its configuration, environment variables, and a detailed usage guide. Feedback is provided regarding a version mismatch in the documentation between the minimum required vLLM version and the validated version.
| Validated on 8x MI355X (gfx950), TP=4, with the `rocm/vllm-dev` ROCm image | ||
| (vLLM 0.23.1): the model serves and the `minimax_m3` reasoning/tool parsers |
There was a problem hiding this comment.
There is a mismatch between the minimum required vLLM version (0.24.0 specified on line 20) and the validation version (0.23.1) mentioned here. To prevent user confusion, please update the validation reference to align with the minimum required version or clarify the version requirements.
Validated on 8x MI355X (gfx950), TP=4, with the `rocm/vllm-dev` ROCm image
(vLLM 0.24.0): the model serves and the `minimax_m3` reasoning/tool parsers| This variant is AMD-only; it is not applicable to NVIDIA hardware (use the | ||
| **mxfp8** variant on Blackwell for native MX matrix cores). | ||
|
|
||
| Validated on 8x MI355X (gfx950), TP=4, with the `rocm/vllm-dev` ROCm image |
There was a problem hiding this comment.
should we put which image?
hongxiayang
left a comment
There was a problem hiding this comment.
thanks. some nit comments
| - "aiter" | ||
| extra_env: | ||
| VLLM_ROCM_USE_AITER: "1" | ||
| VLLM_ROCM_USE_AITER_MOE: "1" |
There was a problem hiding this comment.
this is default to True, maybe not needed?
There was a problem hiding this comment.
this PR may need some changes, firstly the PR should be specific'ed to MI355X only but from verification it seems to be claiming MI300/MI325 [Image 1] supports MXFP4 and claims to say that NVIDIA supports amd MXFP4 checkpoint too [Image 2]
secondly since this is an upstream vllm recipe, from testing following this AMD recipe following the instructions on this recipe branch [Image 3], it does not work and results in an crash
Image 1: screenshot from this dev branch showing that it accientally claims that it works on MI300/Mi325
Image 2: Screenshot from this dev branch showing that it accientally claims that it works on H100/H200/B200 too
Image 3: screenshot from this dev branch showing the recipe & image i am following in this recipe PR that shows it crashing likely due to AITER not enabled on nightly upstream image vllm-project/vllm#46419

…, accuracy The AITER MoE path for amd/MiniMax-M3-MXFP4 needs aiter >=0.1.16.post2 (vllm#46692) and the MoE enablement vllm#46419; until #46419 ships in a published vllm/vllm-openai-rocm image, a plain nightly will not bring up MXFP4 on --moe-backend aiter. Add the emulation backend command (TP=8, runs on current images) that AMD uses for accuracy measurement, and cite the model card's gsm8k recovery (94.19 vs 95.30 bf16, 98.84%). Signed-off-by: andyluo7 <andy.luo@amd.com>
…dant env Address PR vllm-project#579 review: - functionstackx: MXFP4 is a CDNA4-only checkpoint. Add a variant-level `requires_arch: gfx950` gate (new `arch` field on AMD taxonomy GPUs) so the hardware pills no longer claim MI300X/MI325X (gfx942) or NVIDIA support. The gate is variant-scoped, so gpt-oss MXFP4 (which runs on NVIDIA) is unaffected. - functionstackx: a plain upstream nightly crashes (AITER MXFP4 MoE not yet published, vllm#46419). Pin the working ROCm dev image on the variant and call out the crash in the guide. - hongxiayang: drop the redundant VLLM_ROCM_USE_AITER_MOE=1 (defaults on under the VLLM_ROCM_USE_AITER umbrella) and name the validated image. - gemini: reconcile the 0.23.1 validation image with the 0.24.0 min_vllm_version. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> Signed-off-by: andyluo7 <andy.luo@amd.com>
|
Thanks all — pushed 92fe4c6 addressing the review.
|
|
Thanks for the detailed update, @andyluo7. The addition of the |
considering that this is upstream recipes repo, i am not sure if vLLM upstream maintainers would accept an non-upstream rocm docker image, but of course i can't speak on behalf of them. Will let them jump in Would the prefer path to be to wait till there is an accessible upstream https://hub.docker.com/r/vllm/ docker image such that this upstream recipes repo can accuracy track the upstream images |
Thanks for fixing this! |
|
@andyluo7 Thanks! Can you resolve the conflict? then can merge. |
|
since #580 has been merged, we can wait to use upstream docker images when some critical mxfp4 PRs are merged and nightly image is available after that. |
| mxfp4: | ||
| model_id: "amd/MiniMax-M3-MXFP4" | ||
| precision: mxfp4 | ||
| # AMD MXFP4 is a CDNA4-only checkpoint: the AITER MXFP4 MoE kernels are |
There was a problem hiding this comment.
remove all of the comments. They are all not needed. The yaml attributes already reflect them. And aiter version upgrade already happened. Don't need to state them. Moreover the paged are rendered into html. all these comments won't be seen by any users.
| > which has **not** landed in a published `vllm/vllm-openai-rocm` nightly. On a | ||
| > plain nightly, `--moe-backend aiter` fails to bring up MXFP4 (AITER MXFP4 MoE | ||
| > kernel missing). Use the ROCm dev image that carries the path — | ||
| > `rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625` — or build from source. |
There was a problem hiding this comment.
I think we have landed all of the optimization. Please validate with the team. After validation, we can point the docker image to the vllm/vllm-openai-rocm:nightly now.
Thanks @functionstackx , yes. The commands in the yaml file must work with upstream docker image. |
|
@hongxiayang @andyluo7 please consolidate the content in the |
|
@tjtanaa @hongxiayang , will waiting for the upstream docker images to update and ensure the content consistent with #580 |
Summary
Adds an
mxfp4variant to the MiniMax-M3 recipe foramd/MiniMax-M3-MXFP4, targeting AMD CDNA4 (MI350X/MI355X, gfx950) served through the AITER MoE backend. At ~0.5 bytes/param it is roughly half the VRAM of the existing MXFP8 variant and fits a single 8×MI355X node from TP=4.The variant reuses the recipe's existing AMD path (block-size 128, TRITON_ATTN MSA attention,
minimax_m3parsers, CUDA-graph env) and adds only--moe-backend aiter+ the AITER MoE env vars. No NVIDIA changes; the variant is AMD-only (mxfp4is ungated, so it does not force Blackwell and does not disable the AMD hardware pill).Validation
Validated single-node on 8×MI355X (gfx950), TP=4, vLLM
0.23.1(rocm/vllm-devROCm image):vllm serve amd/MiniMax-M3-MXFP4 --tensor-parallel-size 4 --block-size 128 --moe-backend aiter --attention-backend TRITON_ATTN --language-model-only --no-enable-prefix-caching --tool-call-parser minimax_m3 --reasoning-parser minimax_m3 --enable-auto-tool-choicereachesApplication startup complete.quantization=quark,moe_backend='aiter',kv_cache_dtype=auto.minimax_m3reasoning parser splitsreasoningfromcontentcorrectly.KV cache note (corrected after testing)
amd/MiniMax-M3-MXFP4ships no calibrated KV scales. I verified that--kv-cache-dtype fp8does still start and serve on vLLM — it falls back to an uncalibrated KV scale of 1.0 and logsUsing uncalibrated q_scale 1.0 ... This may cause accuracy issues(it does not hard-fail). The variant therefore keeps the KV cache at its default dtype, and the guide documents the fp8 behavior so users can opt in only after validating accuracy.The MXFP4 sharding/KV constraints were cross-referenced against the ROCm/ATOM MiniMax-M3 recipe.
Test plan
node scripts/build-recipes-api.mjspasses (✓ JSON API: 142 models, 8 strategies)Need help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.