feat(qwen3): support request-level LoRA adapters#193
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for dynamic LoRA adapter unloading and listing across the engine, scheduler, and vLLM frontend, as well as request grouping by LoRA adapter during execution. It also adds startup LoRA module loading via CLI arguments and includes a new stress test script. Feedback on the changes highlights a performance bottleneck in Qwen3Executor::list_lora_adapters, where querying the primary worker thread on every scheduler iteration introduces blocking channel communication overhead; caching the loaded adapter names locally is recommended to resolve this.
xiaguan
left a comment
There was a problem hiding this comment.
Could we move the LoRA-related tools under tools/lora?
Others LGTM.
f5137e3 to
0969363
Compare
|
Addressed the review comments in
Validation rerun:
|
Summary
Part of #173.
This PR extends the Qwen3 LoRA path from single active adapter serving to request-level adapter selection.
In LoRA mode, one Qwen3-4B server process can now keep multiple adapters resident, expose them as OpenAI model names, and route each generation request to either the base model or a selected adapter. The scheduler carries adapter identity through prefill and decode, and mixed base/LoRA work is split into adapter-homogeneous execution groups for correctness.
This is still a correctness-first implementation. It does not add Punica-style fused multi-LoRA kernels or CUDA Graph support for LoRA mode.
What Changed
GenerateRequest.lora_adapter: Option<String>NoneNoneclears LoRA and runs the base modelSome(adapter)activates a loaded adaptermodelPOST /v1/load_lora_adapterPOST /v1/unload_lora_adapter--lora-modules/v1/modelsload_inplace=truereplaces an already loaded adapter nametools/lora/qwen3_lora_live_stress.pyfor live multi-adapter API stress coverage.swap_removeindices.API Behavior
The LoRA API surface is intentionally close to vLLM:
--enable-lora.--lora-modules name=path.modelfield to the adapter name./v1/modelsto see the base served model plus loaded adapter names.load_inplace=trueare rejected.load_inplace=truereplaces the existing adapter entry after the new adapter is constructed successfully.is_3d_lora_weight=trueis accepted in the schema for compatibility but remains unsupported in this PR.Correctness Boundary
This PR supports multiple resident adapters and request-level routing, but it does not implement fused multi-LoRA execution.
When a scheduler execution plan contains multiple adapter identities, the current implementation groups work by adapter and runs those groups sequentially:
That is the required correctness behavior for the current Qwen3 executor, because the model reads one active adapter at a time.
Follow-up performance work should add a vLLM/Punica-style token-to-LoRA mapping and fused LoRA kernels so one forward pass can handle multiple adapters more efficiently.
Frontend Note
The Rust
vllm_serverpath validates OpenAImodelnames before dispatching to PegaInfer's engine bridge. To keep adapter-as-model semantics without forkingvllm_server, LoRA mode keeps using the same-port proxy workaround:vllm_xargs;/v1/modelsis overlaid with dynamically loaded adapter names;This should become simpler if vLLM Rust exposes a native route/model-alias extension point.
Scope Boundary
This PR does not add:
--max-lorasValidation
Local checks:
PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-engine --lib lora -- --nocapturePEGAINFER_CUDA_SM=80 cargo test -p pegainfer-vllm-frontend --lib lora -- --nocapturePEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib lora -- --nocapturePEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib -- --nocapturecargo fmt --all --check && PEGAINFER_CUDA_SM=80 cargo check --workspace --all-targetspython -m py_compile tools/lora/qwen3_lora_live_parity.pypython -m py_compile tools/lora/qwen3_lora_live_stress.pygit diff --checkScheduler grouping coverage:
mixed_lora_prefill_requests_run_in_adapter_groupsconstructs one execution plan containing base,adapter-a, andadapter-b.None,adapter-a, andadapter-b, proving a mixed plan is split into adapter-homogeneous groups before execution.Live Qwen3-4B parity smoke:
parity/v1/load_lora_adaptersucceededmatch: truefirst_token_mismatch: nullLive multi-LoRA routing smoke:
modelname.scale-0_05,scale-0_1, andscale-0_25matched HF/PEFT for the checked fixture.scale-0_05andscale-0_1produced distinct matching references, proving the same live server selected different resident adapters by request model name.Live API stress:
python tools/lora/qwen3_lora_live_stress.py \ --model-path "$QWEN3_4B_MODEL_PATH" \ --startup-timeout-s 600 \ --concurrency 12 \ --rounds 3 \ --max-tokens 8Result:
ok: truestress-a,stress-b,stress-c/v1/modelsafter load:qwen3-base,stress-a,stress-b,stress-cstress-bload withoutload_inplace: HTTP 400load_inplace=truereplacement forstress-b: success, followed by successful completionstress-c: success/v1/modelsno longer listedstress-cstress-c: HTTP 404 model-not-foundTP2 live API stress:
baaa283python tools/lora/qwen3_lora_live_stress.py \ --model-path "$QWEN3_4B_MODEL_PATH" \ --startup-timeout-s 900 \ --concurrency 4 \ --rounds 2 \ --max-tokens 8 \ --tp-size 2Result:
ok: truestress-a,stress-b,stress-c/v1/modelsafter load:qwen3-base,stress-a,stress-b,stress-cstress-bload withoutload_inplace: HTTP 400, reported by both TP ranksload_inplace=truereplacement forstress-b: success, followed by successful completionstress-c: success on TP ranks/v1/modelsno longer listedstress-cstress-c: HTTP 404 model-not-foundNotes