Skip to content

Conversation

xuechendi
Copy link
Contributor

@xuechendi xuechendi commented Oct 18, 2025

Purpose

support XPU Decode/Prefill TP_ratio > 1 scenario with host_buffer

solution:

KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cpu","**enable_permute_local_kv**":"True"'}
If enable_permute_local_kv is enabled, initialize host_buffer with HND kv_layout and always for permute copy for device_to_host data copy and host_to_device data copy.

Since memory copy is inevitable, adding additional permute will not introduce performance overhead.

  • Code change to general
  1. introduce self.host_buffer_kv_cache_layout
  2. remove XPU tp_ratio assert

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@xuechendi
Copy link
Contributor Author

@jikunshang @zhenwei-intel , may you take a look of this PR?

@mergify mergify bot added the kv-connector label Oct 18, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 160 to 164
# check and update parallel config
parallel_config = vllm_config.parallel_config
parallel_config.worker_cls = "vllm.v1.worker.xpu_worker.XPUWorker"
vllm_config.kv_transfer_config.enable_permute_local_kv = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard kv_transfer_config before forcing permute flag

The new line unconditionally sets vllm_config.kv_transfer_config.enable_permute_local_kv = True. On XPU, check_and_update_config runs for every job, including those that are not using KV transfer at all. Because VllmConfig.kv_transfer_config defaults to None, this call will raise AttributeError: 'NoneType' object has no attribute 'enable_permute_local_kv' at startup, preventing any XPU deployment without explicit KV transfer configuration from launching. A simple null check or default KVTransferConfig initialization is required before accessing this attribute.

Useful? React with 👍 / 👎.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for tensor parallelism ratios greater than 1 on XPU devices by leveraging a host buffer for KV cache transfers. The key changes involve adding logic to handle different KV cache layouts (NHD on device, HND on host) and performing permutations during data copies. While the overall approach is sound, there's a critical issue in vllm/platforms/xpu.py where enable_permute_local_kv is unconditionally set to True, which could lead to incorrect behavior for non-XPU platforms or scenarios where this feature is not desired. This should be made conditional to the XPU platform.

# check and update parallel config
parallel_config = vllm_config.parallel_config
parallel_config.worker_cls = "vllm.v1.worker.xpu_worker.XPUWorker"
vllm_config.kv_transfer_config.enable_permute_local_kv = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Unconditionally setting enable_permute_local_kv = True in check_and_update_config could have unintended side effects for platforms other than XPU or when this specific permutation logic is not required. This setting should be applied only when the XPU platform is active and the conditions for this permutation are met. It's better to handle this within the XPU-specific logic rather than applying it globally to kv_transfer_config.

)
# Since NHD will not support Decode/Prefill TP_ratio > 1,
# we can leverage host_buffer for permute
self.host_buffer_kv_cache_layout = "HND"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set it to NHD when homogeneous TP, tp ratio=1?

@zhenwei-intel
Copy link
Contributor

Nice work.
LGTM. Just one comment:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants