-
Notifications
You must be signed in to change notification settings - Fork 582
[Feat] Add support for MLP tensor splitting in the DeepSeek DenseFFN layer #4257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces tensor parallelism for the DenseFFN layers in DeepSeek models to reduce memory overhead on decode nodes. The approach of creating a dedicated tensor parallel group for these specific layers is sound. However, I've identified a few critical issues in the implementation. The configuration validation messages are contradictory, which could lead to user confusion. More importantly, the logic for selecting the new parallel group is too broad and would incorrectly apply it to MoE layers as well, which needs to be corrected. Please see the detailed comments for suggestions on how to fix these issues.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yuweivvv <[email protected]> Signed-off-by: 子潜 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yuweivvv <[email protected]> Signed-off-by: 子潜 <[email protected]>
6b4a150 to
3804c0f
Compare
Co-authored-by: zzhx1 <[email protected]> Signed-off-by: 子潜 <[email protected]>
3fe5cca to
32b3e92
Compare
What this PR does / why we need it?
Motivation
In the DeepSeek-R1 architecture, the first 3 blocks utilize Dense Feed-Forward Network (DenseFFN) layers, while the subsequent 58 blocks are MoE layers. The Gate_Up_Proj and Down_Proj matrices in these DenseFFN layers have large dimensions ([7168, 36864] and [18432, 7168] respectively).
This results in significant memory overhead:
FP16/BF16: ~1.48 GB per layer × 3 layers = 4.44 GB
W8A8 Quantization: ~1.11 GB total
During the decoding phase, using a pure Data Parallel (DP) strategy requires each GPU to store the complete model weights for these non-MoE layers. This redundancy causes excessive VRAM usage on each device.
Proposed Changes
This PR implements partial Tensor Parallelism (TP) splitting for the memory-intensive DenseFFN weights. Instead of replicating the full weights across all devices (DP), we split these specific tensors across the available GPUs.
Impact Analysis
Memory Efficiency: significantly reduces the memory footprint of weights on individual cards.

Performance: The performance gain from parallelizing the large matrix multiplications outweighs the introduced communication overhead.
Conclusion: We achieve lower VRAM usage without compromising TPOT (Time Per Output Token).
Does this PR introduce any user-facing change?
This PR introduces a new parameter in additional_config.
--additional_config={"denseffn_tensor_parallel_size": 8}How was this patch tested?