[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 #4254

jianzs · 2025-11-18T09:51:29Z

What this PR does / why we need it?

Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens.

I set the non_blocking argument to False when moving exceeds_max_model_len to the CPU. From what I understand, using non_blocking=True and immediately accessing the tensor on the CPU can cause accuracy problems. However, this issue doesn't happen when transferring data to a device. ref: https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/18

Does this PR introduce any user-facing change?

No.

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@2918c1b

gemini-code-assist

Code Review

This pull request aims to fix an issue with speculative decoding (MTP) when tensor parallelism is used on the language model head. The core of the fix is to ensure the dummy run correctly simulates the multiple compute_logits calls that occur in a real run. While the fix is correctly applied for MtpProposer, it seems to be incomplete for EagleProposer, which could lead to the same issue in that scenario. Additionally, a refactoring in model_runner_v1.py appears to have introduced an AttributeError by calling a non-existent method on the drafter object. I've provided critical comments and suggestions for both issues.

gemini-code-assist · 2025-11-18T09:54:49Z

vllm_ascend/worker/model_runner_v1.py

                        hidden_states[dummy_indices])

+                def dummy_drafter_compute_logits(hidden_states):
+                    return self.drafter.compute_logits(


The dummy_drafter_compute_logits function calls self.drafter.compute_logits, but the compute_logits method is on the model attribute of the drafter object, not on the drafter itself. This will result in an AttributeError. The call should be self.drafter.model.compute_logits.

Suggested change

return self.drafter.compute_logits(

return self.drafter.model.compute_logits(

github-actions · 2025-11-18T10:05:56Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

whx-sjtu

We firstly fixed hanging issue of running MTP=1 with llm head tp in PR #3915. This PR refactors it to run dummy_compute_logits in drafter's dummy_run and further fixes MTP > 1 scenario. LGTM.

github-actions · 2025-11-20T12:37:15Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zouyida2052 · 2025-11-21T14:05:49Z

I've tested it on deepseek and it proves to be useful, please make ci happy

github-actions · 2025-11-24T09:35:03Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens. Signed-off-by: Jade Zheng <[email protected]>

Signed-off-by: Jade Zheng <[email protected]>

MengqingCao · 2025-11-30T11:45:25Z

vllm_ascend/spec_decode/mtp_proposer.py

            # sequence length to 1 to minimize their overheads in attention.
            exceeds_max_model_len_cpu = exceeds_max_model_len.to(
-                attn_metadata_i.seq_lens.device, non_blocking=True)
+                attn_metadata_i.seq_lens.device, non_blocking=False)


@jianzs I noticed you explain the reason why we disable non-blocking here. But IMO, the stream will keep the right order of data copy and the following operations in the same stream. I don't get the point on why there is an accuracy issue of this, is this a bug of torch-npu?

In discuss offline, @jianzs mentioned that it is fine with h2d non-blocking copy, and the accuracy issue occurs with d2h copy. I think it might be a bug of torch-npu. Thus I'm fine with this change here as a workround, and we'll report this to torch-npu to finally fix it.

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

whx-sjtu approved these changes Nov 19, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Nov 20, 2025

jianzs force-pushed the gh-fix-mtp-gt-1 branch from 28752d2 to 84253da Compare November 21, 2025 15:15

jianzs added ready read for review ready-for-test start test by label for PR labels Nov 21, 2025

github-actions bot removed the merge-conflicts label Nov 21, 2025

jianzs force-pushed the gh-fix-mtp-gt-1 branch 2 times, most recently from 7b9f38b to 577c77b Compare November 24, 2025 06:32

github-actions bot added the merge-conflicts label Nov 24, 2025

jianzs added 6 commits November 24, 2025 23:00

update

d324648

Signed-off-by: Jade Zheng <[email protected]>

update

d2beb9c

Signed-off-by: Jade Zheng <[email protected]>

update

684a0a2

Signed-off-by: Jade Zheng <[email protected]>

update

6f7f844

Signed-off-by: Jade Zheng <[email protected]>

update

0642908

Signed-off-by: Jade Zheng <[email protected]>

jianzs force-pushed the gh-fix-mtp-gt-1 branch from 577c77b to 0642908 Compare November 24, 2025 15:00

github-actions bot removed the merge-conflicts label Nov 24, 2025

MengqingCao reviewed Nov 30, 2025

View reviewed changes

MengqingCao approved these changes Dec 1, 2025

View reviewed changes

MengqingCao merged commit 51c8f60 into vllm-project:main Dec 1, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 #4254

[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 #4254

jianzs commented Nov 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

whx-sjtu left a comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

zouyida2052 commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

MengqingCao Nov 30, 2025

Uh oh!

MengqingCao Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	return self.drafter.compute_logits(
	return self.drafter.model.compute_logits(

[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 #4254

[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 #4254

Conversation

jianzs commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

whx-sjtu left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

zouyida2052 commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

MengqingCao Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jianzs commented Nov 18, 2025 •

edited

Loading