Early exit for MoE LoRA kernels #27131

gnovack · 2025-10-17T22:30:16Z

Purpose

This PR adds early exit logic the moe_lora_align_sum_kernel and _fused_moe_lora_kernel kernels. This is to handle the case where LoRA adapters are active, but they do not include weights for MoE layers (e.g. attention only adapters).

Test Plan

Benchmark LoRA serving before & after this change using LoRA adapters on attention projections only.

Test Result

Serve Command

vllm serve openai/gpt-oss-120b \
	--served-model-name openai/gpt-oss-120b \
	--trust-remote-code --enable-lora --max-loras 1 \
	-tp=1 --lora-modules lora1=LevinZheng/gpt-oss-20b-lora-adapter \
	--max-lora-rank 64 \
	--max-num-seqs 8 --gpu-memory-utilization 0.95 \
	--no-enable-prefix-caching

Benchmark Command

vllm bench serve \
  --backend vllm \
  --model openai/gpt-oss-120b \
  --lora-modules lora1 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --max-concurrency 8 \
  --num-prompts 80

Benchmark Results (before)

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  45.78     
Total input tokens:                      19708     
Total generated tokens:                  16469     
Request throughput (req/s):              1.75      
Output token throughput (tok/s):         359.70    
Peak output token throughput (tok/s):    440.00    
Peak concurrent requests:                14.00     
Total Token throughput (tok/s):          790.15    
---------------Time to First Token----------------
Mean TTFT (ms):                          180.72    
Median TTFT (ms):                        70.88     
P99 TTFT (ms):                           1351.53   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.27     
Median TPOT (ms):                        19.37     
P99 TPOT (ms):                           21.75     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.35     
Median ITL (ms):                         18.53     
P99 ITL (ms):                            60.23     
==================================================

Benchmark Results (after)

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  30.81     
Total input tokens:                      19708     
Total generated tokens:                  16469     
Request throughput (req/s):              2.60      
Output token throughput (tok/s):         534.55    
Peak output token throughput (tok/s):    631.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          1174.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          58.61     
Median TTFT (ms):                        53.34     
P99 TTFT (ms):                           121.21    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.27     
Median TPOT (ms):                        13.40     
P99 TPOT (ms):                           14.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.34     
Median ITL (ms):                         12.70     
P99 ITL (ms):                            47.56     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-10-17T22:30:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gnovack.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-10-22T18:19:56Z

Documentation preview: https://vllm--27131.org.readthedocs.build/en/27131/

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/lora/layers/fused_moe.py

Signed-off-by: gnovack <[email protected]>

jeejeelee · 2025-10-24T01:27:25Z

Thank you, can we add the related tests to verify if the outputs of lora and non-lora conform to expectations? We can place these tests in https://github.com/vllm-project/vllm/blob/main/tests/lora/test_olmoe_tp.py

mergify bot added ci/build deepseek Related to DeepSeek models qwen Related to Qwen models gpt-oss Related to GPT-OSS models labels Oct 17, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Oct 17, 2025

mergify bot added the needs-rebase label Oct 17, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Oct 17, 2025

dcmaddix mentioned this pull request Oct 18, 2025

[Feature][Kernel]FusedMoE LoRA #21229

Merged

4 tasks

github-project-automation bot added this to Structured Output Oct 22, 2025

mergify bot added tpu Related to Google TPUs tool-calling labels Oct 22, 2025

mergify bot assigned sangstar Oct 22, 2025

github-project-automation bot added this to Tool Calling Oct 22, 2025

mergify bot added the kv-connector label Oct 22, 2025

gnovack force-pushed the fused_moe_lora branch from 48fd6ad to 7760556 Compare October 22, 2025 18:44

mergify bot removed tpu Related to Google TPUs needs-rebase labels Oct 22, 2025

gnovack marked this pull request as ready for review October 22, 2025 19:51

gnovack requested a review from jeejeelee as a code owner October 22, 2025 19:51

chatgpt-codex-connector bot reviewed Oct 22, 2025

View reviewed changes

vllm/lora/layers/fused_moe.py Outdated Show resolved Hide resolved

add early exit logic to fused moe lora kernels

8e77c65

Signed-off-by: gnovack <[email protected]>

gnovack force-pushed the fused_moe_lora branch from 7760556 to 8e77c65 Compare October 22, 2025 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Early exit for MoE LoRA kernels #27131

Early exit for MoE LoRA kernels #27131

gnovack commented Oct 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 17, 2025

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

jeejeelee commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Early exit for MoE LoRA kernels #27131

Are you sure you want to change the base?

Early exit for MoE LoRA kernels #27131

Conversation

gnovack commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 17, 2025

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jeejeelee commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gnovack commented Oct 17, 2025 •

edited by github-actions bot

Loading