Skip to content

feat(qwen3): support tensor-parallel LoRA adapter loading#190

Merged
xiaguan merged 1 commit into
xiaguan:mainfrom
NolanHo:feat/qwen3-lora-pr2-tp-clean
May 28, 2026
Merged

feat(qwen3): support tensor-parallel LoRA adapter loading#190
xiaguan merged 1 commit into
xiaguan:mainfrom
NolanHo:feat/qwen3-lora-pr2-tp-clean

Conversation

@NolanHo
Copy link
Copy Markdown
Contributor

@NolanHo NolanHo commented May 28, 2026

Summary

Part of #173.

This PR extends the Qwen3 LoRA MVP from PR1 to tensor-parallel execution while keeping the same correctness-first scope:

  • single active adapter per engine
  • dynamic /v1/load_lora_adapter
  • adapter loading only when the scheduler is idle
  • CUDA Graph disabled in LoRA mode
  • no per-request or mixed-adapter batching yet

The branch is stacked on the PR1 LoRA control/API work until PR1 lands.

What Changed

  • Added rank-local LoRA adapter sharding for Qwen3 TP:
    • q_proj, k_proj, v_proj, gate_proj, up_proj: replicate LoRA A, row-shard LoRA B.
    • o_proj, down_proj: column-shard LoRA A, replicate LoRA B.
  • Updated Qwen3 executor LoRA loading to install the same adapter name on all TP ranks and aggregate per-rank load errors.
  • Removed the Qwen3 server-side --enable-lora && --tp-size != 1 rejection.
  • Added --tp-size to tools/qwen3_lora_live_parity.py so the live HF/PEFT parity smoke can exercise TP.
  • Added unit coverage for TP sharding shape/range behavior.

Scope Boundary

This PR does not add:

  • multiple active adapters
  • per-request adapter selection
  • mixed base + LoRA batching
  • /v1/unload_lora_adapter
  • CUDA Graph cache keys per adapter
  • optimized grouped LoRA kernels

Those remain follow-up work for the later staged LoRA PRs in #173.

Validation

Local checks:

  • cargo fmt --check
  • PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib lora -- --nocapture
  • PEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib scheduler -- --nocapture
  • PEGAINFER_CUDA_SM=80 cargo check -p pegainfer-server
  • python -m py_compile tools/qwen3_lora_live_parity.py
  • git diff --check

Worker5 TP2 live parity:

  • Branch commit: ac91cff
  • Model: Qwen3-4B
  • Command shape:
python tools/qwen3_lora_live_parity.py \
  --model-path Qwen3-4B \
  --port 18112 \
  --startup-timeout-s 360 \
  --max-tokens 8 \
  --disable-peft-adapter-autocast \
  --tp-size 2

Result:

  • Server started with --enable-lora --tp-size 2.
  • /v1/load_lora_adapter returned Success: LoRA adapter 'parity' added successfully.
  • HF/PEFT and PegaInfer both generated about a young girl named Lila who.
  • Token IDs matched exactly: [911, 264, 3908, 3743, 6941, 444, 10524, 879].
  • Result summary had "match": true and "first_token_mismatch": null.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables tensor-parallel loading of Qwen3 LoRA adapters, removing the previous single-GPU limitation. It introduces adapter sharding logic (shard_for_tensor_parallel) for row-parallel and column-parallel projections, updates the executor to shard and distribute adapters to all workers, and adds corresponding unit and integration tests. Feedback points out a potential desynchronization issue where a sharding failure on a later rank could leave earlier ranks in an inconsistent state. It is recommended to pre-shard the adapters for all ranks before sending any load commands to the workers.

Comment thread pegainfer-qwen3-4b/src/executor.rs Outdated
@NolanHo NolanHo force-pushed the feat/qwen3-lora-pr2-tp-clean branch from ac91cff to 6432037 Compare May 28, 2026 11:55
Copy link
Copy Markdown
Owner

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaguan xiaguan merged commit d08851b into xiaguan:main May 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants