-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[Feature][Kernel]FusedMoE LoRA #21229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an experimental extension for FusedMoE to support parallel inference with multiple LoRA models. The changes are extensive, touching CUDA kernels, Triton kernels, and various Python layers for model execution and LoRA management. The core idea is to enable LoRA adapters for Mixture-of-Experts layers, which is a significant feature enhancement.
My review focuses on the correctness and robustness of the new kernels and their integration. I've identified a few critical and high-severity issues that should be addressed:
- A race condition in the new CUDA kernel (
moe_lora_align_sum_kernels.cu) could lead to incorrect behavior or memory corruption. - Incorrect shared memory calculation in the same CUDA kernel could lead to resource exhaustion and launch failures.
- Hardcoded values for the number of experts and LoRA rank in the Triton kernels and Python wrappers limit the general applicability of this feature.
These issues are important to fix to ensure the stability and correctness of this new feature. The PR is a work-in-progress, and addressing these points will significantly improve its quality and readiness for merging.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: wuchen <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Chen Wu <[email protected]>
Signed-off-by: Chen Wu <[email protected]>
Signed-off-by: Chen Wu <[email protected]>
Signed-off-by: Chen Wu <[email protected]>
enable passing activation func so act_wrapper in lora will be called …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you to everyone who contributed to this PR
|
@Yikun @xuechendi @vanbasten23 this PR might affect LoRA in your backend |
|
@jeejeelee , oh, Thanks for the info, will check from our backend |
Signed-off-by: wuchen <[email protected]> Signed-off-by: banjuede <[email protected]> Signed-off-by: Chen Wu <[email protected]> Signed-off-by: Danielle Robinson <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: bk-201 <[email protected]> Co-authored-by: wuchen <[email protected]> Co-authored-by: Nathan Van Gheem <[email protected]> Co-authored-by: banjuede <[email protected]> Co-authored-by: Danielle Robinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: bk-201 <[email protected]>
Signed-off-by: wuchen <[email protected]> Signed-off-by: banjuede <[email protected]> Signed-off-by: Chen Wu <[email protected]> Signed-off-by: Danielle Robinson <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: bk-201 <[email protected]> Co-authored-by: wuchen <[email protected]> Co-authored-by: Nathan Van Gheem <[email protected]> Co-authored-by: banjuede <[email protected]> Co-authored-by: Danielle Robinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: bk-201 <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
Signed-off-by: wuchen <[email protected]> Signed-off-by: banjuede <[email protected]> Signed-off-by: Chen Wu <[email protected]> Signed-off-by: Danielle Robinson <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: bk-201 <[email protected]> Co-authored-by: wuchen <[email protected]> Co-authored-by: Nathan Van Gheem <[email protected]> Co-authored-by: banjuede <[email protected]> Co-authored-by: Danielle Robinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: bk-201 <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: wuchen <[email protected]> Signed-off-by: banjuede <[email protected]> Signed-off-by: Chen Wu <[email protected]> Signed-off-by: Danielle Robinson <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: bk-201 <[email protected]> Co-authored-by: wuchen <[email protected]> Co-authored-by: Nathan Van Gheem <[email protected]> Co-authored-by: banjuede <[email protected]> Co-authored-by: Danielle Robinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: bk-201 <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: wuchen <[email protected]> Signed-off-by: banjuede <[email protected]> Signed-off-by: Chen Wu <[email protected]> Signed-off-by: Danielle Robinson <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: bk-201 <[email protected]> Co-authored-by: wuchen <[email protected]> Co-authored-by: Nathan Van Gheem <[email protected]> Co-authored-by: banjuede <[email protected]> Co-authored-by: Danielle Robinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: bk-201 <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.FIX [Feature]: Add LoRA support for gpt-oss model #23610
Purpose
This PR is an experimental extension to FusedMoE, aiming to support parallel inference with multiple LoRA models.
✅ Completed:
❌ Work in progress:
Any suggestions are welcome!
Test Plan
Test Result
(Optional) Documentation Update