-
Notifications
You must be signed in to change notification settings - Fork 582
[Feat][Doc] Add a load_balance_dp_proxy in examples and external dp doc. #4265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: whx-sjtu <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces a load-balancing proxy for external data parallelism and adds corresponding documentation. The implementation of the proxy server has a critical issue with request body handling that will cause it to fail, and a high-severity issue where the load balancing metric is based on request body size instead of prompt token count, which could lead to suboptimal balancing. The documentation and examples also contain some incorrect filenames that should be fixed. I've provided suggestions to address these points.
| request_length = len(req_body) | ||
| instance_info = await _select_instance(api, req_data, | ||
| request_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load balancing logic currently uses request_length = len(req_body), which is the byte size of the entire JSON request body. This is not an accurate measure of the load, which is primarily determined by the number of tokens in the prompt. A request with a short prompt but large metadata could be incorrectly assessed as larger than a request with a long prompt.
To improve accuracy, you should estimate the number of tokens from the prompt text. You can access the prompt from req_data. A simple heuristic is to use the length of the prompt string.
prompt_text = ""
if api == "/completions":
prompt = req_data.get("prompt")
if isinstance(prompt, str):
prompt_text = prompt
elif isinstance(prompt, list):
prompt_text = "".join(map(str, prompt))
elif api == "/chat/completions":
messages = req_data.get("messages", [])
prompt_text = "\n".join(m.get("content", "") for m in messages if m.get("content"))
# Using prompt length as a better heuristic for load.
request_length = len(prompt_text)
instance_info = await _select_instance(api, req_data,
request_length)9cbb1f5 to
7463937
Compare
Signed-off-by: whx-sjtu <[email protected]>
7463937 to
ebb71f7
Compare
| "kv_port": "20001", | ||
| "engine_id": "0", | ||
| "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" | ||
| '{"kv_connector": "MooncakeLayerwiseConnector", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me review the mooncake-related modifications of run_dp_template.sh @liziyu179
What this PR does / why we need it?
This PR adds a load-balance dp proxy server which can be used in external DP scenario without Disaggregated-Prefill enabled. What's more, add a doc of external dp and load-balance dp proxy server.
Does this PR introduce any user-facing change?
No
How was this patch tested?
See the new doc.