feat(qwen3): support tensor-parallel LoRA adapter loading#190
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request enables tensor-parallel loading of Qwen3 LoRA adapters, removing the previous single-GPU limitation. It introduces adapter sharding logic (shard_for_tensor_parallel) for row-parallel and column-parallel projections, updates the executor to shard and distribute adapters to all workers, and adds corresponding unit and integration tests. Feedback points out a potential desynchronization issue where a sharding failure on a later rank could leave earlier ranks in an inconsistent state. It is recommended to pre-shard the adapters for all ranks before sending any load commands to the workers.
ac91cff to
6432037
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Part of #173.
This PR extends the Qwen3 LoRA MVP from PR1 to tensor-parallel execution while keeping the same correctness-first scope:
/v1/load_lora_adapterThe branch is stacked on the PR1 LoRA control/API work until PR1 lands.
What Changed
q_proj,k_proj,v_proj,gate_proj,up_proj: replicate LoRAA, row-shard LoRAB.o_proj,down_proj: column-shard LoRAA, replicate LoRAB.--enable-lora && --tp-size != 1rejection.--tp-sizetotools/qwen3_lora_live_parity.pyso the live HF/PEFT parity smoke can exercise TP.Scope Boundary
This PR does not add:
/v1/unload_lora_adapterThose remain follow-up work for the later staged LoRA PRs in #173.
Validation
Local checks:
cargo fmt --checkPEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib lora -- --nocapturePEGAINFER_CUDA_SM=80 cargo test -p pegainfer-qwen3-4b --lib scheduler -- --nocapturePEGAINFER_CUDA_SM=80 cargo check -p pegainfer-serverpython -m py_compile tools/qwen3_lora_live_parity.pygit diff --checkWorker5 TP2 live parity:
ac91cffQwen3-4BResult:
--enable-lora --tp-size 2./v1/load_lora_adapterreturnedSuccess: LoRA adapter 'parity' added successfully.about a young girl named Lila who.[911, 264, 3908, 3743, 6941, 444, 10524, 879]."match": trueand"first_token_mismatch": null.