Skip to content

Conversation

@gnovack
Copy link
Contributor

@gnovack gnovack commented Oct 17, 2025

Purpose

This PR adds early exit logic the moe_lora_align_sum_kernel and _fused_moe_lora_kernel kernels. This is to handle the case where LoRA adapters are active, but they do not include weights for MoE layers (e.g. attention only adapters).

Test Plan

  • Benchmark LoRA serving before & after this change using LoRA adapters on attention projections only.

Test Result

Serve Command

vllm serve openai/gpt-oss-120b \
	--served-model-name openai/gpt-oss-120b \
	--trust-remote-code --enable-lora --max-loras 1 \
	-tp=1 --lora-modules lora1=LevinZheng/gpt-oss-20b-lora-adapter \
	--max-lora-rank 64 \
	--max-num-seqs 8 --gpu-memory-utilization 0.95 \
	--no-enable-prefix-caching

Benchmark Command

vllm bench serve \
  --backend vllm \
  --model openai/gpt-oss-120b \
  --lora-modules lora1 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --max-concurrency 8 \
  --num-prompts 80

Benchmark Results (before)

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  45.78     
Total input tokens:                      19708     
Total generated tokens:                  16469     
Request throughput (req/s):              1.75      
Output token throughput (tok/s):         359.70    
Peak output token throughput (tok/s):    440.00    
Peak concurrent requests:                14.00     
Total Token throughput (tok/s):          790.15    
---------------Time to First Token----------------
Mean TTFT (ms):                          180.72    
Median TTFT (ms):                        70.88     
P99 TTFT (ms):                           1351.53   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.27     
Median TPOT (ms):                        19.37     
P99 TPOT (ms):                           21.75     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.35     
Median ITL (ms):                         18.53     
P99 ITL (ms):                            60.23     
==================================================

Benchmark Results (after)

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  30.81     
Total input tokens:                      19708     
Total generated tokens:                  16469     
Request throughput (req/s):              2.60      
Output token throughput (tok/s):         534.55    
Peak output token throughput (tok/s):    631.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          1174.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          58.61     
Median TTFT (ms):                        53.34     
P99 TTFT (ms):                           121.21    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.27     
Median TPOT (ms):                        13.40     
P99 TPOT (ms):                           14.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.34     
Median ITL (ms):                         12.70     
P99 ITL (ms):                            47.56     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added ci/build deepseek Related to DeepSeek models qwen Related to Qwen models gpt-oss Related to GPT-OSS models labels Oct 17, 2025
@mergify
Copy link

mergify bot commented Oct 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gnovack.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Oct 22, 2025

Documentation preview: https://vllm--27131.org.readthedocs.build/en/27131/

@mergify mergify bot added documentation Improvements or additions to documentation frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues rocm Related to AMD ROCm structured-output speculative-decoding v1 labels Oct 22, 2025
@mergify mergify bot added tpu Related to Google TPUs tool-calling labels Oct 22, 2025
@mergify mergify bot added the kv-connector label Oct 22, 2025
@mergify mergify bot removed tpu Related to Google TPUs needs-rebase labels Oct 22, 2025
@gnovack gnovack marked this pull request as ready for review October 22, 2025 19:51
@gnovack gnovack requested a review from jeejeelee as a code owner October 22, 2025 19:51
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@jeejeelee
Copy link
Collaborator

Thank you, can we add the related tests to verify if the outputs of lora and non-lora conform to expectations? We can place these tests in https://github.com/vllm-project/vllm/blob/main/tests/lora/test_olmoe_tp.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling v1

Projects

Status: No status
Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.

3 participants