Skip to content

Conversation

@1092626063
Copy link
Contributor

@1092626063 1092626063 commented Nov 18, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

export HCCL_IF_IP=$(ifconfig | grep -B1 141 | grep 'inet ' | awk '{print $2}')
export GLOO_SOCKET_IFNAME=$(ifconfig -a | grep -B1 $HCCL_IF_IP | grep -v 'inet' | sed 's/://g' | awk '{print $1}')
export TP_SOCKET_IFNAME=$GLOO_SOCKET_IFNAME
export HCCL_SOCKET_IFNAME=$GLOO_SOCKET_IFNAME
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_DIR="./vllm-profiling"
export HCCL_OP_EXPANSION_MODE=AIV
# export VLLM_ASCEND_TRACE_RECOMPILES=1

# export TNG_LOG_LEVEL=0
# export ASCEND_GLOBAL_LOG_LEVEL=1


export VLLM_VERSION=0.0.0

#export HCCL_INTRA_PCIE_ENABLE=1
#export HCCL_INTRA_ROCE_ENABLE=0
#allgaherEP
#export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
#export VLLM_ASCEND_ENABLE_FLASHCOMM=1

export VLLM_TORCH_PROFILER_WITH_STACK=0

rm -rf /root/atc_data
rm -rf /root/ascend/log/debug
rm -rf /root/.cache/vllm/torch_compile_cache/*
rm -rf ./.torchair_cache/
rm -rf ./dynamo_*
rm -rf /root/.cache/vllm/torch_compile_cache/*
pkill -9 python
pkill -9 VLLM

# /mnt/share/glm4.6_w8a8_with_float_mtp
# /mnt/share/GLM-4.6-w8a8

vllm serve /mnt/share/glm4.6_w8a8_with_float_mtp/ \
  --data-parallel-size 1 \
  --tensor-parallel-size 16 \
  --seed 1024 \
  --served-model-name dsv3 \
  --max-model-len 35000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --quantization ascend \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --additional-config '{"ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":true}}' \
  --speculative-config '{"num_speculative_tokens": 1, "model":"/mnt/share/glm4.6_w8a8_with_float_mtp/", "method":"glm4_moe_mtp"}' \
  --compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16], "cudagraph_mode": "FULL_DECODE_ONLY"}' \

prefix cache 70% preformance:

export VLLM_VERSION=0.11.0
cd vllm
vllm bench serve \
  --backend vllm \
  --dataset-name prefix_repetition \
  --prefix-repetition-prefix-len 22400 \
  --prefix-repetition-suffix-len 9600 \
  --prefix-repetition-output-len 1024 \
  --num-prompts 1 \
  --prefix-repetition-num-prefixes 1 \
  --ignore-eos \
  --model dsv3 \
  --tokenizer /mnt/share/glm4.6_w8a8_with_float_mtp \
  --seed 1000 \
  --host 141.61.81.162 \
  --port 8000 \
  --endpoint /v1/completions \
  --max-concurrency 1 \
  --request-rate 1

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for glm4.5 with multi-token prediction (MTP). The changes primarily involve configuration updates and model loading adjustments. I've identified a critical issue in mtp_proposer.py where the model loading is hardcoded, which would break functionality for other MTP models. I've also noted a high-severity maintainability concern in patch_config.py due to significant code duplication and have recommended a refactoring to a more data-driven approach.

Comment on lines 91 to 94
else:
self.model = DeepSeekMTP(
from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP
self.model = Glm4MoeMTP(
vllm_config=self.vllm_config).to(target_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change hardcodes Glm4MoeMTP as the model for the MtpProposer. However, MtpProposer is now used for multiple methods (deepseek_mtp, qwen3_next_mtp, glm4_moe_mtp). This will cause incorrect model loading for methods other than glm4_moe_mtp.

The model should be selected dynamically based on the speculative decoding method. You can get the method from self.vllm_config.speculative_config.method.

            else:
                method = self.vllm_config.speculative_config.method
                if method == "deepseek_mtp":
                    self.model = DeepSeekMTP(
                        vllm_config=self.vllm_config).to(target_device)
                elif method == "glm4_moe_mtp":
                    from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP
                    self.model = Glm4MoeMTP(
                        vllm_config=self.vllm_config).to(target_device)
                else:
                    raise NotImplementedError(
                        f"MTP method '{method}' is not supported in eager mode."
                    )

Comment on lines +138 to +146
elif (self.draft_model_config.hf_config.model_type ==
"glm4_moe_mtp"):
self.method = "glm4_moe_mtp"
if self.num_speculative_tokens > 1:
logger.warning(
"All GLM4 MTP models only have " \
"one layer. Might need some code changes " \
"to support multiple layers."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This elif block is almost identical to several others for different MTP models (e.g., deepseek_mtp, ernie_mtp, qwen3_next_mtp). This introduces significant code duplication, making the code harder to read and maintain.

Consider refactoring this if/elif chain into a data-driven approach. You could use a dictionary to map model types to their corresponding method and warning message components. This would eliminate the repeated logic and make it easier to add or modify support for MTP models in the future.

For example:

MTP_CONFIGS = {
    "deepseek_mtp": ("deepseek_mtp", "Deepseek"),
    "mimo_mtp": ("deepseek_mtp", "Deepseek"),
    "ernie_mtp": ("ernie_mtp", "Ernie"),
    "glm4_moe_mtp": ("glm4_moe_mtp", "GLM4"),
    "qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next"),
    "longcat_flash_mtp": ("longcat_flash_mtp", "LongCat"),
}

model_type = self.draft_model_config.hf_config.model_type
if model_type in MTP_CONFIGS:
    method, model_name = MTP_CONFIGS[model_type]
    self.method = method
    if self.num_speculative_tokens > 1:
        logger.warning(
            f"All {model_name} MTP models only have one layer. "
            "Might need some code changes to support multiple layers."
        )
# ... then handle other non-MTP cases

While a full refactoring is a larger change, applying this pattern would greatly improve code quality.

@wangxiyuan wangxiyuan added the hold-on The PR should be hold-on but no need to release label Dec 2, 2025
@wangxiyuan
Copy link
Collaborator

won't merge for dev branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold-on The PR should be hold-on but no need to release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants