[Feat] Add support for MLP tensor splitting in the DeepSeek DenseFFN layer #4257

yuweivvv · 2025-11-18T11:23:37Z

What this PR does / why we need it?

Motivation

In the DeepSeek-R1 architecture, the first 3 blocks utilize Dense Feed-Forward Network (DenseFFN) layers, while the subsequent 58 blocks are MoE layers. The Gate_Up_Proj and Down_Proj matrices in these DenseFFN layers have large dimensions ([7168, 36864] and [18432, 7168] respectively).

This results in significant memory overhead:
FP16/BF16: ~1.48 GB per layer × 3 layers = 4.44 GB
W8A8 Quantization: ~1.11 GB total

During the decoding phase, using a pure Data Parallel (DP) strategy requires each GPU to store the complete model weights for these non-MoE layers. This redundancy causes excessive VRAM usage on each device.

Proposed Changes

This PR implements partial Tensor Parallelism (TP) splitting for the memory-intensive DenseFFN weights. Instead of replicating the full weights across all devices (DP), we split these specific tensors across the available GPUs.

Impact Analysis

Memory Efficiency: significantly reduces the memory footprint of weights on individual cards.
Performance: The performance gain from parallelizing the large matrix multiplications outweighs the introduced communication overhead.
Conclusion: We achieve lower VRAM usage without compromising TPOT (Time Per Output Token).

Does this PR introduce any user-facing change?

This PR introduces a new parameter in additional_config.

--additional_config={"denseffn_tensor_parallel_size": 8}

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@2918c1b

github-actions · 2025-11-18T11:23:44Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces tensor parallelism for the DenseFFN layers in DeepSeek models to reduce memory overhead on decode nodes. The approach of creating a dedicated tensor parallel group for these specific layers is sound. However, I've identified a few critical issues in the implementation. The configuration validation messages are contradictory, which could lead to user confusion. More importantly, the logic for selecting the new parallel group is too broad and would incorrectly apply it to MoE layers as well, which needs to be corrected. Please see the detailed comments for suggestions on how to fix these issues.

vllm_ascend/ascend_config.py

vllm_ascend/ops/linear_op.py

vllm_ascend/quantization/quant_config.py

Signed-off-by: 子潜 <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yuweivvv <[email protected]> Signed-off-by: 子潜 <[email protected]>

Co-authored-by: zzhx1 <[email protected]> Signed-off-by: 子潜 <[email protected]>

github-actions bot added module:ops module:core module:quantization labels Nov 18, 2025

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

vllm_ascend/ascend_config.py Show resolved Hide resolved

vllm_ascend/ops/linear_op.py Show resolved Hide resolved

vllm_ascend/ops/linear_op.py Show resolved Hide resolved

vllm_ascend/quantization/quant_config.py Outdated Show resolved Hide resolved

子潜 and others added 3 commits November 18, 2025 20:08

Add support for MLP tensor splitting in the DeepSeek DenseFFN layer

30c78ca

Signed-off-by: 子潜 <[email protected]>

Update vllm_ascend/ascend_config.py

68d2865

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yuweivvv <[email protected]> Signed-off-by: 子潜 <[email protected]>

Update vllm_ascend/quantization/quant_config.py

9741455

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yuweivvv <[email protected]> Signed-off-by: 子潜 <[email protected]>

yuweivvv force-pushed the denseffn_tp branch 3 times, most recently from 6b4a150 to 3804c0f Compare November 19, 2025 06:48

Update linear_op.py

32b3e92

Co-authored-by: zzhx1 <[email protected]> Signed-off-by: 子潜 <[email protected]>

yuweivvv force-pushed the denseffn_tp branch from 3fe5cca to 32b3e92 Compare November 19, 2025 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add support for MLP tensor splitting in the DeepSeek DenseFFN layer #4257

[Feat] Add support for MLP tensor splitting in the DeepSeek DenseFFN layer #4257

yuweivvv commented Nov 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Feat] Add support for MLP tensor splitting in the DeepSeek DenseFFN layer #4257

Are you sure you want to change the base?

[Feat] Add support for MLP tensor splitting in the DeepSeek DenseFFN layer #4257

Conversation

yuweivvv commented Nov 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Motivation

Proposed Changes

Impact Analysis

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuweivvv commented Nov 18, 2025 •

edited by github-actions bot

Loading