[Fix][Spec Decode] Fix llama4 draft loading with different quantization #27136

linzebing · 2025-10-18T00:51:33Z

Purpose

This PR takes care of the case where draft model and target model may use different quantization configs.

replace doesn't work with pydantic dataclass
deepcopy fails with new error "TypeError: BasevLLMParameter.new() takes 2 positional arguments but 3 were given"

Test Plan

Tested with loading a llama4 scout model with an eagle draft model with different quantization.

Test Result

Before:

layers.0.self_attn.qkv_proj.weight_scale is not loaded!

After:
model successful loads, and gsm8k eval result is expected.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…tization config Signed-off-by: linzebing <[email protected]>

gemini-code-assist

Code Review

This pull request addresses an issue where draft models and target models use different quantization configurations. The changes involve temporarily modifying the vllm_config.quant_config for draft model layers and restoring it afterward. The PR includes modifications to the Llama4_eagle.py file to implement this fix. I have identified a critical issue regarding potential race conditions due to the modification of a shared configuration object.

vllm/model_executor/models/llama4_eagle.py

Signed-off-by: linzebing <[email protected]>

yeqcharlotte · 2025-10-18T05:07:18Z

vllm/model_executor/models/llama4_eagle.py

+            )
+        finally:
+            # Restore original quant_config
+            vllm_config.quant_config = original_quant_config


when do we see the exception?

When target model and draft model have different quantization, e.g. target model is bf16, while draft model uses fp8 quantization.

[fix][spec decode] Fix llama4 draft model loading with different quan…

43bc12f

…tization config Signed-off-by: linzebing <[email protected]>

mergify bot added llama Related to Llama models speculative-decoding labels Oct 18, 2025

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

vllm/model_executor/models/llama4_eagle.py Show resolved Hide resolved

vllm/model_executor/models/llama4_eagle.py Outdated Show resolved Hide resolved

make gemini happy

35b96a2

Signed-off-by: linzebing <[email protected]>

yeqcharlotte reviewed Oct 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Fix][Spec Decode] Fix llama4 draft loading with different quantization #27136

[Fix][Spec Decode] Fix llama4 draft loading with different quantization #27136

linzebing commented Oct 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yeqcharlotte Oct 18, 2025

Uh oh!

linzebing Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Fix][Spec Decode] Fix llama4 draft loading with different quantization #27136

Are you sure you want to change the base?

[Fix][Spec Decode] Fix llama4 draft loading with different quantization #27136

Conversation

linzebing commented Oct 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yeqcharlotte Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

linzebing Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linzebing commented Oct 18, 2025 •

edited by github-actions bot

Loading