Skip to content

Conversation

linzebing
Copy link
Contributor

@linzebing linzebing commented Oct 18, 2025

Purpose

This PR takes care of the case where draft model and target model may use different quantization configs.

replace doesn't work with pydantic dataclass
deepcopy fails with new error "TypeError: BasevLLMParameter.new() takes 2 positional arguments but 3 were given"

Test Plan

Tested with loading a llama4 scout model with an eagle draft model with different quantization.

Test Result

Before:

layers.0.self_attn.qkv_proj.weight_scale is not loaded!

After:
model successful loads, and gsm8k eval result is expected.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added llama Related to Llama models speculative-decoding labels Oct 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where draft models and target models use different quantization configurations. The changes involve temporarily modifying the vllm_config.quant_config for draft model layers and restoring it afterward. The PR includes modifications to the Llama4_eagle.py file to implement this fix. I have identified a critical issue regarding potential race conditions due to the modification of a shared configuration object.

Signed-off-by: linzebing <[email protected]>
)
finally:
# Restore original quant_config
vllm_config.quant_config = original_quant_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we see the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When target model and draft model have different quantization, e.g. target model is bf16, while draft model uses fp8 quantization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llama Related to Llama models speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants