glm4.5 support mtp with piecewise graph #4258

1092626063 · 2025-11-18T11:23:58Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

export HCCL_IF_IP=$(ifconfig | grep -B1 141 | grep 'inet ' | awk '{print $2}')
export GLOO_SOCKET_IFNAME=$(ifconfig -a | grep -B1 $HCCL_IF_IP | grep -v 'inet' | sed 's/://g' | awk '{print $1}')
export TP_SOCKET_IFNAME=$GLOO_SOCKET_IFNAME
export HCCL_SOCKET_IFNAME=$GLOO_SOCKET_IFNAME
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_DIR="./vllm-profiling"
export HCCL_OP_EXPANSION_MODE=AIV
# export VLLM_ASCEND_TRACE_RECOMPILES=1

# export TNG_LOG_LEVEL=0
# export ASCEND_GLOBAL_LOG_LEVEL=1


export VLLM_VERSION=0.0.0

#export HCCL_INTRA_PCIE_ENABLE=1
#export HCCL_INTRA_ROCE_ENABLE=0
#allgaherEP
#export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
#export VLLM_ASCEND_ENABLE_FLASHCOMM=1

export VLLM_TORCH_PROFILER_WITH_STACK=0

rm -rf /root/atc_data
rm -rf /root/ascend/log/debug
rm -rf /root/.cache/vllm/torch_compile_cache/*
rm -rf ./.torchair_cache/
rm -rf ./dynamo_*
rm -rf /root/.cache/vllm/torch_compile_cache/*
pkill -9 python
pkill -9 VLLM

# /mnt/share/glm4.6_w8a8_with_float_mtp
# /mnt/share/GLM-4.6-w8a8

vllm serve /mnt/share/glm4.6_w8a8_with_float_mtp/ \
  --data-parallel-size 1 \
  --tensor-parallel-size 16 \
  --seed 1024 \
  --served-model-name dsv3 \
  --max-model-len 35000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --quantization ascend \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --additional-config '{"ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":true}}' \
  --speculative-config '{"num_speculative_tokens": 1, "model":"/mnt/share/glm4.6_w8a8_with_float_mtp/", "method":"glm4_moe_mtp"}' \
  --compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16], "cudagraph_mode": "FULL_DECODE_ONLY"}' \

prefix cache 70% preformance：

export VLLM_VERSION=0.11.0
cd vllm
vllm bench serve \
  --backend vllm \
  --dataset-name prefix_repetition \
  --prefix-repetition-prefix-len 22400 \
  --prefix-repetition-suffix-len 9600 \
  --prefix-repetition-output-len 1024 \
  --num-prompts 1 \
  --prefix-repetition-num-prefixes 1 \
  --ignore-eos \
  --model dsv3 \
  --tokenizer /mnt/share/glm4.6_w8a8_with_float_mtp \
  --seed 1000 \
  --host 141.61.81.162 \
  --port 8000 \
  --endpoint /v1/completions \
  --max-concurrency 1 \
  --request-rate 1

github-actions · 2025-11-18T11:24:06Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for glm4.5 with multi-token prediction (MTP). The changes primarily involve configuration updates and model loading adjustments. I've identified a critical issue in mtp_proposer.py where the model loading is hardcoded, which would break functionality for other MTP models. I've also noted a high-severity maintainability concern in patch_config.py due to significant code duplication and have recommended a refactoring to a more data-driven approach.

gemini-code-assist · 2025-11-18T11:25:58Z

vllm_ascend/spec_decode/mtp_proposer.py

            else:
-                self.model = DeepSeekMTP(
+                from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP
+                self.model = Glm4MoeMTP(
                    vllm_config=self.vllm_config).to(target_device)


This change hardcodes Glm4MoeMTP as the model for the MtpProposer. However, MtpProposer is now used for multiple methods (deepseek_mtp, qwen3_next_mtp, glm4_moe_mtp). This will cause incorrect model loading for methods other than glm4_moe_mtp.

The model should be selected dynamically based on the speculative decoding method. You can get the method from self.vllm_config.speculative_config.method.

else: method = self.vllm_config.speculative_config.method if method == "deepseek_mtp": self.model = DeepSeekMTP( vllm_config=self.vllm_config).to(target_device) elif method == "glm4_moe_mtp": from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP self.model = Glm4MoeMTP( vllm_config=self.vllm_config).to(target_device) else: raise NotImplementedError( f"MTP method '{method}' is not supported in eager mode." )

gemini-code-assist · 2025-11-18T11:25:58Z

vllm_ascend/patch/platform/patch_config.py

+            elif (self.draft_model_config.hf_config.model_type ==
+                  "glm4_moe_mtp"):
+                self.method = "glm4_moe_mtp"
+                if self.num_speculative_tokens > 1:
+                    logger.warning(
+                            "All GLM4 MTP models only have " \
+                            "one layer. Might need some code changes " \
+                            "to support multiple layers."
+                        )


This elif block is almost identical to several others for different MTP models (e.g., deepseek_mtp, ernie_mtp, qwen3_next_mtp). This introduces significant code duplication, making the code harder to read and maintain.

Consider refactoring this if/elif chain into a data-driven approach. You could use a dictionary to map model types to their corresponding method and warning message components. This would eliminate the repeated logic and make it easier to add or modify support for MTP models in the future.

For example:

MTP_CONFIGS = { "deepseek_mtp": ("deepseek_mtp", "Deepseek"), "mimo_mtp": ("deepseek_mtp", "Deepseek"), "ernie_mtp": ("ernie_mtp", "Ernie"), "glm4_moe_mtp": ("glm4_moe_mtp", "GLM4"), "qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next"), "longcat_flash_mtp": ("longcat_flash_mtp", "LongCat"), } model_type = self.draft_model_config.hf_config.model_type if model_type in MTP_CONFIGS: method, model_name = MTP_CONFIGS[model_type] self.method = method if self.num_speculative_tokens > 1: logger.warning( f"All {model_name} MTP models only have one layer. " "Might need some code changes to support multiple layers." ) # ... then handle other non-MTP cases

While a full refactoring is a larger change, applying this pattern would greatly improve code quality.

wangxiyuan · 2025-12-02T11:24:23Z

won't merge for dev branch.

glm4.5 support mtp with piecewise graph

fc445d3

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

fix device sync

166b382

wangxiyuan added the hold-on The PR should be hold-on but no need to release label Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

glm4.5 support mtp with piecewise graph #4258

glm4.5 support mtp with piecewise graph #4258

1092626063 commented Nov 18, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

wangxiyuan commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

glm4.5 support mtp with piecewise graph #4258

Are you sure you want to change the base?

glm4.5 support mtp with piecewise graph #4258

Conversation

1092626063 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1092626063 commented Nov 18, 2025 •

edited

Loading