[AMD] MiniMax-M3: enable AITER + AMD runtime knobs in the ROCm hardware override#556
[AMD] MiniMax-M3: enable AITER + AMD runtime knobs in the ROCm hardware override#556JohnQinAMD wants to merge 1 commit into
Conversation
…rride MiniMax-M3 is the only MiniMax recipe that doesn't enable AITER on AMD; its siblings (M2/M2.1/M2.5/M2.7) all set VLLM_ROCM_USE_AITER=1. Enable it here so the hot decode GEMMs and fused MoE run on AITER (master toggle; the per-component flags default True behind it). Keep MHA off AITER (VLLM_ROCM_USE_AITER_MHA=0) so MSA sparse attention stays on TRITON_ATTN — the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention. Also add AMD-recommended, numerically-inert runtime knobs to the AMD override: TORCH_BLAS_PREFER_HIPBLASLT=1 and GPU_MAX_HW_QUEUES=2. NCCL_MIN_NCHANNELS=112 is documented in the guide as a gfx942-only RCCL tuning (the gfx942 default is ~32-64) rather than set for all AMD: gfx950 already defaults to 112 channels for an 8-GPU node, and setting it explicitly bypasses RCCL's adaptive channel-tuning model (RCCL 2.26.6 / ROCm 7.0). Measured +5.6..+10.8% total tok/s/gpu on 8xMI300X (MXFP8, 1k1k random, conc 4..256); GSM8K exact-match holds ~0.95. Co-authored-by: Gong Zheng <zgong@amd.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: JohnQinAMD <yanyuan.qin@amd.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request updates the hardware overrides and guide documentation for MiniMax-M3 on AMD ROCm, enabling AITER kernels and adding recommended runtime knobs such as TORCH_BLAS_PREFER_HIPBLASLT and GPU_MAX_HW_QUEUES. Feedback suggests correcting the comment for GPU_MAX_HW_QUEUES to refer to hardware queues rather than HIP streams to avoid technical inaccuracy.
| export VLLM_ROCM_USE_AITER=1 # AITER kernels: hot decode GEMMs + fused MoE | ||
| export VLLM_ROCM_USE_AITER_MHA=0 # keep MSA attention on TRITON_ATTN (MXFP8 lacks calibrated ROCm FP8 attn scales) | ||
| export TORCH_BLAS_PREFER_HIPBLASLT=1 | ||
| export GPU_MAX_HW_QUEUES=2 # cap HIP streams below the default of 4 |
There was a problem hiding this comment.
The environment variable GPU_MAX_HW_QUEUES limits the maximum number of hardware queues (HSA/AQL queues) allocated per process on the GPU, rather than capping HIP streams (which are software-level constructs multiplexed onto these hardware queues). Updating the comment to refer to hardware queues avoids technical confusion.
export GPU_MAX_HW_QUEUES=2 # cap hardware queues below the default of 4|
@hongxiayang Can you help review this? |
MiniMax-M3 is the only MiniMax recipe that doesn't enable AITER on AMD; its siblings (M2/M2.1/M2.5/M2.7) all set VLLM_ROCM_USE_AITER=1. Enable it here so the hot decode GEMMs and fused MoE run on AITER (master toggle; the per-component flags default True behind it). Keep MHA off AITER (VLLM_ROCM_USE_AITER_MHA=0) so MSA sparse attention stays on TRITON_ATTN — the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.
Also add AMD-recommended, numerically-inert runtime knobs to the AMD override: TORCH_BLAS_PREFER_HIPBLASLT=1 and GPU_MAX_HW_QUEUES=2.
NCCL_MIN_NCHANNELS=112 is documented in the guide as a gfx942-only RCCL tuning (the gfx942 default is ~32-64) rather than set for all AMD: gfx950 already defaults to 112 channels for an 8-GPU node, and setting it explicitly bypasses RCCL's adaptive channel-tuning model (RCCL 2.26.6 / ROCm 7.0).
Measured +5.6..+10.8% total tok/s/gpu on 8xMI300X (MXFP8, 1k1k random, conc 4..256); GSM8K exact-match holds ~0.95.