Skip to content

[AMD] MiniMax-M3: enable AITER + AMD runtime knobs in the ROCm hardware override#556

Open
JohnQinAMD wants to merge 1 commit into
vllm-project:mainfrom
JohnQinAMD:minimaxm3-amd-aiter-env
Open

[AMD] MiniMax-M3: enable AITER + AMD runtime knobs in the ROCm hardware override#556
JohnQinAMD wants to merge 1 commit into
vllm-project:mainfrom
JohnQinAMD:minimaxm3-amd-aiter-env

Conversation

@JohnQinAMD

Copy link
Copy Markdown

MiniMax-M3 is the only MiniMax recipe that doesn't enable AITER on AMD; its siblings (M2/M2.1/M2.5/M2.7) all set VLLM_ROCM_USE_AITER=1. Enable it here so the hot decode GEMMs and fused MoE run on AITER (master toggle; the per-component flags default True behind it). Keep MHA off AITER (VLLM_ROCM_USE_AITER_MHA=0) so MSA sparse attention stays on TRITON_ATTN — the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also add AMD-recommended, numerically-inert runtime knobs to the AMD override: TORCH_BLAS_PREFER_HIPBLASLT=1 and GPU_MAX_HW_QUEUES=2.

NCCL_MIN_NCHANNELS=112 is documented in the guide as a gfx942-only RCCL tuning (the gfx942 default is ~32-64) rather than set for all AMD: gfx950 already defaults to 112 channels for an 8-GPU node, and setting it explicitly bypasses RCCL's adaptive channel-tuning model (RCCL 2.26.6 / ROCm 7.0).

Measured +5.6..+10.8% total tok/s/gpu on 8xMI300X (MXFP8, 1k1k random, conc 4..256); GSM8K exact-match holds ~0.95.

…rride

MiniMax-M3 is the only MiniMax recipe that doesn't enable AITER on AMD; its
siblings (M2/M2.1/M2.5/M2.7) all set VLLM_ROCM_USE_AITER=1. Enable it here so
the hot decode GEMMs and fused MoE run on AITER (master toggle; the
per-component flags default True behind it). Keep MHA off AITER
(VLLM_ROCM_USE_AITER_MHA=0) so MSA sparse attention stays on TRITON_ATTN — the
MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also add AMD-recommended, numerically-inert runtime knobs to the AMD override:
TORCH_BLAS_PREFER_HIPBLASLT=1 and GPU_MAX_HW_QUEUES=2.

NCCL_MIN_NCHANNELS=112 is documented in the guide as a gfx942-only RCCL tuning
(the gfx942 default is ~32-64) rather than set for all AMD: gfx950 already
defaults to 112 channels for an 8-GPU node, and setting it explicitly bypasses
RCCL's adaptive channel-tuning model (RCCL 2.26.6 / ROCm 7.0).

Measured +5.6..+10.8% total tok/s/gpu on 8xMI300X (MXFP8, 1k1k random,
conc 4..256); GSM8K exact-match holds ~0.95.

Co-authored-by: Gong Zheng <zgong@amd.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: JohnQinAMD <yanyuan.qin@amd.com>
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 16, 2026 4:16pm

Request Review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the hardware overrides and guide documentation for MiniMax-M3 on AMD ROCm, enabling AITER kernels and adding recommended runtime knobs such as TORCH_BLAS_PREFER_HIPBLASLT and GPU_MAX_HW_QUEUES. Feedback suggests correcting the comment for GPU_MAX_HW_QUEUES to refer to hardware queues rather than HIP streams to avoid technical inaccuracy.

export VLLM_ROCM_USE_AITER=1 # AITER kernels: hot decode GEMMs + fused MoE
export VLLM_ROCM_USE_AITER_MHA=0 # keep MSA attention on TRITON_ATTN (MXFP8 lacks calibrated ROCm FP8 attn scales)
export TORCH_BLAS_PREFER_HIPBLASLT=1
export GPU_MAX_HW_QUEUES=2 # cap HIP streams below the default of 4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable GPU_MAX_HW_QUEUES limits the maximum number of hardware queues (HSA/AQL queues) allocated per process on the GPU, rather than capping HIP streams (which are software-level constructs multiplexed onto these hardware queues). Updating the comment to refer to hardware queues avoids technical confusion.

  export GPU_MAX_HW_QUEUES=2              # cap hardware queues below the default of 4

@esmeetu

esmeetu commented Jun 24, 2026

Copy link
Copy Markdown
Member

@hongxiayang Can you help review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants