Skip to content

Add GLM-5.2 MXFP4 recipe support#583

Open
xiao-llm wants to merge 3 commits into
vllm-project:mainfrom
xiao-llm:main
Open

Add GLM-5.2 MXFP4 recipe support#583
xiao-llm wants to merge 3 commits into
vllm-project:mainfrom
xiao-llm:main

Conversation

@xiao-llm

@xiao-llm xiao-llm commented Jun 26, 2026

Copy link
Copy Markdown

Summary

  • Add GLM-5.2 MXFP4 as a recipe variant alongside the existing FP8 and BF16 options.
  • Configure the MXFP4 variant for the AMD Quark checkpoint with --quantization quark.
  • Restrict the MXFP4 variant to supported MI355X hardware, following the MiniMax-M3 MXFP4 recipe pattern.
  • Set the MXFP4 generated command to the validated TP8 path and include the MI355X-specific --trust-remote-code override.
  • Add ROCm/AITER serving guidance for GLM-5.2 MXFP4, including recommended MI355X launch flags and KV-cache settings.
  • Document MTP usage caveats for MXFP4, including the need for a vLLM build that honors Quark exclude entries for the unquantized MTP layer.

PR Link for mxfp4 MTP support

Motivation

GLM-5.2 has an AMD Quark MXFP4 checkpoint that significantly reduces HBM footprint compared with the FP8 and BF16 variants. This recipe update makes that checkpoint selectable in the recipe UI while preventing unsupported hardware combinations from being generated.

The MTP note is tied to the corresponding vLLM Quark loading fix: vLLM PR #46757.

Test Plan

  • Verified the GLM-5.2 recipe YAML parses successfully.
  • Compared the MXFP4 hardware gating pattern with models/MiniMaxAI/MiniMax-M3.yaml.
  • Confirmed the generated diff updates the GLM-5.2 recipe metadata, variant config, and guide text only.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vercel

vercel Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 26, 2026 6:30pm

Request Review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the AMD Quark MXFP4 quantized variant of the GLM-5.2 model, optimized for AMD MI350 and MI355X GPUs. It adds the mxfp4 variant configuration to the model YAML and updates the documentation with prerequisites, environment variables, and run commands for utilizing ROCm AITER kernels. The feedback suggests updating the mxfp4 variant metadata to include --trust-remote-code in extra_args and the required environment variables in extra_env to ensure consistency with the guide and support automated deployment tools.

Comment on lines +64 to +71
mxfp4:
model_id: "amd/GLM-5.2-MXFP4"
precision: mxfp4
vram_minimum_gb: 446
description: "AMD Quark MXFP4 checkpoint — MoE weights quantized for MI350 / MI355X"
extra_args:
- "--quantization"
- "quark"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure consistency with the launch instructions in the guide and to support automated deployment tools that parse this recipe, the mxfp4 variant metadata should include --trust-remote-code in extra_args and the required environment variables (VLLM_ROCM_USE_AITER_FP8BMM: "0" and VLLM_ROCM_USE_AITER_FP4BMM: "0") in extra_env.

  mxfp4:
    model_id: "amd/GLM-5.2-MXFP4"
    precision: mxfp4
    vram_minimum_gb: 446
    description: "AMD Quark MXFP4 checkpoint — MoE weights quantized for MI350 / MI355X"
    extra_args:
      - "--quantization"
      - "quark"
      - "--trust-remote-code"
    extra_env:
      VLLM_ROCM_USE_AITER_FP8BMM: "0"
      VLLM_ROCM_USE_AITER_FP4BMM: "0"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting them to 0 keeps serving on the known-good kernels and avoids accidental dispatch into experimental/unsupported FP8/FP4 BMM implementations.

Comment thread models/zai-org/GLM-5.2.yaml Outdated
Comment on lines +219 to +220
export VLLM_ROCM_USE_AITER_FP8BMM=0
export VLLM_ROCM_USE_AITER_FP4BMM=0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to gemini's comment, are these needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it after verified, its a extra condition to keep it safe.

Follow the MiniMax-M3 MXFP4 recipe pattern by allowing the GLM-5.2 MXFP4 variant only on MI355X hardware and keeping the generated command on the validated TP8 Quark path.
Link the related vLLM Quark MTP loading fix from the GLM-5.2 recipe guide so users know which vLLM change is required before enabling MXFP4 speculative decoding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants