Skip to content

feat(qwen3): support request-level LoRA adapters#193

Merged
xiaguan merged 8 commits into
xiaguan:mainfrom
NolanHo:feat/qwen3-lora-pr3-request-lora
May 29, 2026
Merged

feat(qwen3): support request-level LoRA adapters#193
xiaguan merged 8 commits into
xiaguan:mainfrom
NolanHo:feat/qwen3-lora-pr3-request-lora

Conversation

@NolanHo
Copy link
Copy Markdown
Contributor

@NolanHo NolanHo commented May 29, 2026

Summary

Part of #173.

This PR extends the Qwen3 LoRA path from single active adapter serving to request-level adapter selection.

In LoRA mode, one Qwen3-4B server process can now keep multiple adapters resident, expose them as OpenAI model names, and route each generation request to either the base model or a selected adapter. The scheduler carries adapter identity through prefill and decode, and mixed base/LoRA work is split into adapter-homogeneous execution groups for correctness.

This is still a correctness-first implementation. It does not add Punica-style fused multi-LoRA kernels or CUDA Graph support for LoRA mode.

What Changed

  • Added request-level LoRA selection to the engine contract:
    • GenerateRequest.lora_adapter: Option<String>
    • base model requests use None
    • adapter requests use the loaded adapter name
  • Changed Qwen3 LoRA state from a single active adapter to a named adapter registry.
  • Added a Qwen3 executor-side adapter-name cache so scheduler admission checks do not block on rank command-channel round trips.
  • Added explicit Qwen3 adapter activation before execution:
    • None clears LoRA and runs the base model
    • Some(adapter) activates a loaded adapter
  • Threaded adapter identity through Qwen3 scheduler pending and active request state.
  • Split Qwen3 prefill, decode, and unified execution plans by adapter identity.
  • Added scheduler admission rejection for unknown adapter names in direct/internal requests.
  • Extended vLLM-compatible LoRA serving semantics:
    • adapter names can be used as the OpenAI model
    • POST /v1/load_lora_adapter
    • POST /v1/unload_lora_adapter
    • startup --lora-modules
    • dynamic adapter names are listed by /v1/models
    • load_inplace=true replaces an already loaded adapter name
  • Preserved tensor-parallel LoRA loading from the previous PR by forwarding adapter load/activation/unload commands to all Qwen3 TP ranks.
  • Added tools/lora/qwen3_lora_live_stress.py for live multi-adapter API stress coverage.
  • Fixed a scheduler retirement bug found by the stress test where several completed requests in one decode step could produce stale swap_remove indices.

API Behavior

The LoRA API surface is intentionally close to vLLM:

  • Start LoRA mode with --enable-lora.
  • Optionally preload adapters with --lora-modules name=path.
  • Dynamically load an adapter with:
POST /v1/load_lora_adapter
  • Dynamically unload an adapter with:
POST /v1/unload_lora_adapter
  • Request an adapter by setting the OpenAI model field to the adapter name.
  • Query /v1/models to see the base served model plus loaded adapter names.
  • Duplicate loads without load_inplace=true are rejected.
  • load_inplace=true replaces the existing adapter entry after the new adapter is constructed successfully.

is_3d_lora_weight=true is accepted in the schema for compatibility but remains unsupported in this PR.

Correctness Boundary

This PR supports multiple resident adapters and request-level routing, but it does not implement fused multi-LoRA execution.

When a scheduler execution plan contains multiple adapter identities, the current implementation groups work by adapter and runs those groups sequentially:

base group      -> activate no adapter -> execute
adapter-a group -> activate adapter-a  -> execute
adapter-b group -> activate adapter-b  -> execute

That is the required correctness behavior for the current Qwen3 executor, because the model reads one active adapter at a time.

Follow-up performance work should add a vLLM/Punica-style token-to-LoRA mapping and fused LoRA kernels so one forward pass can handle multiple adapters more efficiently.

Frontend Note

The Rust vllm_server path validates OpenAI model names before dispatching to PegaInfer's engine bridge. To keep adapter-as-model semantics without forking vllm_server, LoRA mode keeps using the same-port proxy workaround:

  • known adapter model names are rewritten to the base served model for vLLM validation;
  • the selected adapter name is forwarded through vllm_xargs;
  • /v1/models is overlaid with dynamically loaded adapter names;
  • non-LoRA mode still uses the normal vLLM server path.

This should become simpler if vLLM Rust exposes a native route/model-alias extension point.

Scope Boundary

This PR does not add:

  • Punica-style fused multi-LoRA kernels
  • CUDA Graph support in LoRA mode
  • per-adapter CUDA Graph cache keys
  • exact vLLM error JSON compatibility for every error path
  • non-Qwen3 LoRA support
  • production memory-management policy such as --max-loras

Validation

Local checks:

  • PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-engine --lib lora -- --nocapture
  • PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-vllm-frontend --lib lora -- --nocapture
  • PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib lora -- --nocapture
  • PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib -- --nocapture
  • cargo fmt --all --check && PEGAINFER_CUDA_SM=80 cargo check --workspace --all-targets
  • python -m py_compile tools/lora/qwen3_lora_live_parity.py
  • python -m py_compile tools/lora/qwen3_lora_live_stress.py
  • git diff --check

Scheduler grouping coverage:

  • mixed_lora_prefill_requests_run_in_adapter_groups constructs one execution plan containing base, adapter-a, and adapter-b.
  • The fake executor observes grouped execution under None, adapter-a, and adapter-b, proving a mixed plan is split into adapter-homogeneous groups before execution.

Live Qwen3-4B parity smoke:

  • Model: Qwen3-4B local checkpoint
  • Request model: loaded adapter name parity
  • Max tokens: 8
  • Result:
    • /v1/load_lora_adapter succeeded
    • HF/PEFT token IDs matched PegaInfer token IDs exactly
    • match: true
    • first_token_mismatch: null

Live multi-LoRA routing smoke:

  • Loaded five adapters into one server process.
  • Queried each adapter by OpenAI model name.
  • scale-0_05, scale-0_1, and scale-0_25 matched HF/PEFT for the checked fixture.
  • scale-0_05 and scale-0_1 produced distinct matching references, proving the same live server selected different resident adapters by request model name.

Live API stress:

  • Hardware: single GPU test node
  • Model: Qwen3-4B local checkpoint
  • Command shape:
python tools/lora/qwen3_lora_live_stress.py \
  --model-path "$QWEN3_4B_MODEL_PATH" \
  --startup-timeout-s 600 \
  --concurrency 12 \
  --rounds 3 \
  --max-tokens 8

Result:

  • ok: true
  • loaded adapters: stress-a, stress-b, stress-c
  • /v1/models after load: qwen3-base, stress-a, stress-b, stress-c
  • 36 concurrent completion requests across base and adapter model names
  • duplicate stress-b load without load_inplace: HTTP 400
  • load_inplace=true replacement for stress-b: success, followed by successful completion
  • unload stress-c: success
  • /v1/models no longer listed stress-c
  • completion against unloaded stress-c: HTTP 404 model-not-found

TP2 live API stress:

  • Hardware: two GPUs on one test node
  • Model: Qwen3-4B local checkpoint
  • Branch head: baaa283
  • Command shape:
python tools/lora/qwen3_lora_live_stress.py \
  --model-path "$QWEN3_4B_MODEL_PATH" \
  --startup-timeout-s 900 \
  --concurrency 4 \
  --rounds 2 \
  --max-tokens 8 \
  --tp-size 2

Result:

  • ok: true
  • loaded adapters: stress-a, stress-b, stress-c
  • /v1/models after load: qwen3-base, stress-a, stress-b, stress-c
  • 8 concurrent completion requests across base and adapter model names
  • duplicate stress-b load without load_inplace: HTTP 400, reported by both TP ranks
  • load_inplace=true replacement for stress-b: success, followed by successful completion
  • unload stress-c: success on TP ranks
  • /v1/models no longer listed stress-c
  • completion against unloaded stress-c: HTTP 404 model-not-found

Notes

  • Very large synthetic adapter scales can diverge from HF/PEFT over longer outputs; those cases are not used as acceptance evidence for this PR.
  • The live stress test proves concurrent API behavior and multiple adapter residency. Same-plan grouping is covered by scheduler unit instrumentation rather than inferred from HTTP timing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dynamic LoRA adapter unloading and listing across the engine, scheduler, and vLLM frontend, as well as request grouping by LoRA adapter during execution. It also adds startup LoRA module loading via CLI arguments and includes a new stress test script. Feedback on the changes highlights a performance bottleneck in Qwen3Executor::list_lora_adapters, where querying the primary worker thread on every scheduler iteration introduces blocking channel communication overhead; caching the loaded adapter names locally is recommended to resolve this.

Comment thread pegainfer-qwen3-4b/src/executor.rs
Copy link
Copy Markdown
Owner

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the LoRA-related tools under tools/lora?

Others LGTM.

@NolanHo NolanHo force-pushed the feat/qwen3-lora-pr3-request-lora branch from f5137e3 to 0969363 Compare May 29, 2026 06:24
@NolanHo
Copy link
Copy Markdown
Contributor Author

NolanHo commented May 29, 2026

Addressed the review comments in 1806c9f:

  • Qwen3Executor now keeps a local sorted adapter-name cache for scheduler admission/listing, so list_lora_adapters() no longer performs a blocking rank command-channel round trip in the scheduler loop.
  • LoRA live tools have been moved under tools/lora/.

Validation rerun:

  • cargo fmt --all --check
  • python -m py_compile tools/lora/qwen3_lora_live_parity.py tools/lora/qwen3_lora_live_stress.py
  • PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib lora -- --nocapture
  • PEGAINFER_CUDA_SM=80 cargo check --workspace --all-targets

@xiaguan xiaguan merged commit 8cc2f65 into xiaguan:main May 29, 2026
1 check passed
@NolanHo NolanHo deleted the feat/qwen3-lora-pr3-request-lora branch May 29, 2026 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants