Skip to content

Conversation

@yuweivvv
Copy link

@yuweivvv yuweivvv commented Nov 18, 2025

What this PR does / why we need it?

Motivation

In the DeepSeek-R1 architecture, the first 3 blocks utilize Dense Feed-Forward Network (DenseFFN) layers, while the subsequent 58 blocks are MoE layers. The Gate_Up_Proj and Down_Proj matrices in these DenseFFN layers have large dimensions ([7168, 36864] and [18432, 7168] respectively).

This results in significant memory overhead:
FP16/BF16: ~1.48 GB per layer × 3 layers = 4.44 GB
W8A8 Quantization: ~1.11 GB total

During the decoding phase, using a pure Data Parallel (DP) strategy requires each GPU to store the complete model weights for these non-MoE layers. This redundancy causes excessive VRAM usage on each device.

Proposed Changes

This PR implements partial Tensor Parallelism (TP) splitting for the memory-intensive DenseFFN weights. Instead of replicating the full weights across all devices (DP), we split these specific tensors across the available GPUs.

Impact Analysis

Memory Efficiency: significantly reduces the memory footprint of weights on individual cards.
Performance: The performance gain from parallelizing the large matrix multiplications outweighs the introduced communication overhead.
Conclusion: We achieve lower VRAM usage without compromising TPOT (Time Per Output Token).
image

Does this PR introduce any user-facing change?

This PR introduces a new parameter in additional_config.

image

--additional_config={"denseffn_tensor_parallel_size": 8}

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tensor parallelism for the DenseFFN layers in DeepSeek models to reduce memory overhead on decode nodes. The approach of creating a dedicated tensor parallel group for these specific layers is sound. However, I've identified a few critical issues in the implementation. The configuration validation messages are contradictory, which could lead to user confusion. More importantly, the logic for selecting the new parallel group is too broad and would incorrectly apply it to MoE layers as well, which needs to be corrected. Please see the detailed comments for suggestions on how to fix these issues.

子潜 and others added 3 commits November 18, 2025 20:08
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yuweivvv <[email protected]>
Signed-off-by: 子潜 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yuweivvv <[email protected]>
Signed-off-by: 子潜 <[email protected]>
@yuweivvv yuweivvv force-pushed the denseffn_tp branch 3 times, most recently from 6b4a150 to 3804c0f Compare November 19, 2025 06:48
Co-authored-by: zzhx1 <[email protected]>
Signed-off-by: 子潜 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant