-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[NIXL] use Host buffer to support TP_ratio > 1 for XPU #27140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chendi Xue <[email protected]>
@jikunshang @zhenwei-intel , may you take a look of this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
# check and update parallel config | ||
parallel_config = vllm_config.parallel_config | ||
parallel_config.worker_cls = "vllm.v1.worker.xpu_worker.XPUWorker" | ||
vllm_config.kv_transfer_config.enable_permute_local_kv = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard kv_transfer_config before forcing permute flag
The new line unconditionally sets vllm_config.kv_transfer_config.enable_permute_local_kv = True
. On XPU, check_and_update_config
runs for every job, including those that are not using KV transfer at all. Because VllmConfig.kv_transfer_config
defaults to None
, this call will raise AttributeError: 'NoneType' object has no attribute 'enable_permute_local_kv'
at startup, preventing any XPU deployment without explicit KV transfer configuration from launching. A simple null check or default KVTransferConfig
initialization is required before accessing this attribute.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for tensor parallelism ratios greater than 1 on XPU devices by leveraging a host buffer for KV cache transfers. The key changes involve adding logic to handle different KV cache layouts (NHD on device, HND on host) and performing permutations during data copies. While the overall approach is sound, there's a critical issue in vllm/platforms/xpu.py
where enable_permute_local_kv
is unconditionally set to True
, which could lead to incorrect behavior for non-XPU platforms or scenarios where this feature is not desired. This should be made conditional to the XPU platform.
# check and update parallel config | ||
parallel_config = vllm_config.parallel_config | ||
parallel_config.worker_cls = "vllm.v1.worker.xpu_worker.XPUWorker" | ||
vllm_config.kv_transfer_config.enable_permute_local_kv = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unconditionally setting enable_permute_local_kv = True
in check_and_update_config
could have unintended side effects for platforms other than XPU or when this specific permutation logic is not required. This setting should be applied only when the XPU platform is active and the conditions for this permutation are met. It's better to handle this within the XPU-specific logic rather than applying it globally to kv_transfer_config
.
) | ||
# Since NHD will not support Decode/Prefill TP_ratio > 1, | ||
# we can leverage host_buffer for permute | ||
self.host_buffer_kv_cache_layout = "HND" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set it to NHD
when homogeneous TP, tp ratio=1?
Nice work. |
Purpose
support XPU Decode/Prefill TP_ratio > 1 scenario with host_buffer
solution:
KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cpu","**enable_permute_local_kv**":"True"'}
If
enable_permute_local_kv
is enabled, initialize host_buffer with HND kv_layout and always for permute copy for device_to_host data copy and host_to_device data copy.Since memory copy is inevitable, adding additional permute will not introduce performance overhead.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.