Add GLM-5.2 MXFP4 recipe support#583
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the AMD Quark MXFP4 quantized variant of the GLM-5.2 model, optimized for AMD MI350 and MI355X GPUs. It adds the mxfp4 variant configuration to the model YAML and updates the documentation with prerequisites, environment variables, and run commands for utilizing ROCm AITER kernels. The feedback suggests updating the mxfp4 variant metadata to include --trust-remote-code in extra_args and the required environment variables in extra_env to ensure consistency with the guide and support automated deployment tools.
| mxfp4: | ||
| model_id: "amd/GLM-5.2-MXFP4" | ||
| precision: mxfp4 | ||
| vram_minimum_gb: 446 | ||
| description: "AMD Quark MXFP4 checkpoint — MoE weights quantized for MI350 / MI355X" | ||
| extra_args: | ||
| - "--quantization" | ||
| - "quark" |
There was a problem hiding this comment.
To ensure consistency with the launch instructions in the guide and to support automated deployment tools that parse this recipe, the mxfp4 variant metadata should include --trust-remote-code in extra_args and the required environment variables (VLLM_ROCM_USE_AITER_FP8BMM: "0" and VLLM_ROCM_USE_AITER_FP4BMM: "0") in extra_env.
mxfp4:
model_id: "amd/GLM-5.2-MXFP4"
precision: mxfp4
vram_minimum_gb: 446
description: "AMD Quark MXFP4 checkpoint — MoE weights quantized for MI350 / MI355X"
extra_args:
- "--quantization"
- "quark"
- "--trust-remote-code"
extra_env:
VLLM_ROCM_USE_AITER_FP8BMM: "0"
VLLM_ROCM_USE_AITER_FP4BMM: "0"There was a problem hiding this comment.
Setting them to 0 keeps serving on the known-good kernels and avoids accidental dispatch into experimental/unsupported FP8/FP4 BMM implementations.
| export VLLM_ROCM_USE_AITER_FP8BMM=0 | ||
| export VLLM_ROCM_USE_AITER_FP4BMM=0 |
There was a problem hiding this comment.
related to gemini's comment, are these needed?
There was a problem hiding this comment.
Will remove it after verified, its a extra condition to keep it safe.
Follow the MiniMax-M3 MXFP4 recipe pattern by allowing the GLM-5.2 MXFP4 variant only on MI355X hardware and keeping the generated command on the validated TP8 Quark path.
Link the related vLLM Quark MTP loading fix from the GLM-5.2 recipe guide so users know which vLLM change is required before enabling MXFP4 speculative decoding.
Summary
--quantization quark.--trust-remote-codeoverride.excludeentries for the unquantized MTP layer.PR Link for mxfp4 MTP support
Motivation
GLM-5.2 has an AMD Quark MXFP4 checkpoint that significantly reduces HBM footprint compared with the FP8 and BF16 variants. This recipe update makes that checkpoint selectable in the recipe UI while preventing unsupported hardware combinations from being generated.
The MTP note is tied to the corresponding vLLM Quark loading fix: vLLM PR #46757.
Test Plan
models/MiniMaxAI/MiniMax-M3.yaml.