[https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound #9300

lancelly · 2025-11-19T09:45:03Z

The lm_head_head_size should be less than the world_size, and a new environment variable is added to allow users to configure it.

Summary by CodeRabbit

Improvements
- Added environment variable configuration option (LM_HEAD_TP_SIZE) for fine-tuning distributed tensor parallelism settings.
- Enhanced automatic computation of tensor parallelism parameters with improved hardware topology awareness for better resource allocation on multi-GPU setups.

Signed-off-by: Lanyu Liao <[email protected]>

coderabbitai · 2025-11-19T09:48:37Z

📝 Walkthrough

Walkthrough

Modified the create_lm_head_tp_mapping utility function to compute tensor parallelism size for LM head with upper-bound awareness. The function now reads an optional environment variable for configuration and falls back to a nearest-bucket calculation if not set.

Changes

Cohort / File(s)	Summary
LM Head Tensor Parallelism Configuration `tensorrt_llm/_torch/utils.py`	Added `os` import. Modified `create_lm_head_tp_mapping` to introduce `lm_head_tp_size_upper_bound` (minimum of `world_size` and `gpus_per_node`). Changed `lm_head_tp_size` determination to read optional `LM_HEAD_TP_SIZE` environment variable; defaults to `nearest_in_buckets` computation. Added explicit error message to existing divisibility assertion. Included TODO comment for potential GB200 platform optimization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Review environment variable handling logic (LM_HEAD_TP_SIZE)
Verify upper-bound computation correctness and interaction with nearest_in_buckets
Confirm assertion error message clarity and divisibility constraint validity

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is incomplete and lacks required sections. It provides only a brief one-liner without detailed explanation, test coverage details, or checklist verification as specified in the template.	Expand the description to include: (1) detailed explanation of the issue and solution, (2) test coverage information, and (3) completion of the PR checklist items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly summarizes the main change: setting correct lm_head_tp_size_upper_bound, which aligns with the core modification in the changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46dd988 and f23f216.

📒 Files selected for processing (1)

tensorrt_llm/_torch/utils.py (2 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

tensorrt_llm/_torch/utils.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

tensorrt_llm/_torch/utils.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

tensorrt_llm/_torch/utils.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/utils.py (2)

tensorrt_llm/llmapi/llm_args.py (2)

world_size (459-460)

world_size (469-473)

tensorrt_llm/_torch/distributed/communicator.py (2)

world_size (43-44)

tp_size (63-64)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/_torch/utils.py (3)

2-2: LGTM!

The os import is necessary for reading the LM_HEAD_TP_SIZE environment variable.

328-328: LGTM!

The explicit error message improves debuggability by showing both mapping.tp_size and lm_head_tp_size when the assertion fails.

321-322: Potential issue: non-power-of-2 upper bound is valid by current constraints.

The concern is valid but nuanced. nearest_in_buckets can return non-power-of-2 values when lm_head_tp_size_upper_bound is non-power-of-2 (e.g., 6). However, line 328's assertion only requires mapping.tp_size % lm_head_tp_size == 0 (divisibility), not power-of-2.

The current implementation accepts non-power-of-2 TP sizes as long as they divide mapping.tp_size. If non-power-of-2 values cause performance issues downstream or conflict with kernel assumptions, add an explicit check: lm_head_tp_size = last_positive_power_of_2(lm_head_tp_size) after line 327, or constrain the upper bound to last_positive_power_of_2(lm_head_tp_size_upper_bound) on line 322.

tensorrt_llm/_torch/utils.py

lancelly · 2025-11-19T09:59:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-19T10:05:03Z

PR_Github #25037 [ run ] triggered by Bot. Commit: f23f216

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2025-11-19T10:13:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-19T13:08:23Z

PR_Github #25037 [ run ] completed with state SUCCESS. Commit: f23f216
/LLM/main/L0_MergeRequest_PR pipeline #18919 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2025-11-19T13:56:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-19T14:02:09Z

PR_Github #25056 [ run ] triggered by Bot. Commit: 73f94e9

tensorrt-cicd · 2025-11-19T17:51:26Z

PR_Github #25056 [ run ] completed with state SUCCESS. Commit: 73f94e9
/LLM/main/L0_MergeRequest_PR pipeline #18937 completed with status: 'FAILURE'

lancelly · 2025-11-20T01:21:28Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-20T01:26:45Z

PR_Github #25116 [ run ] triggered by Bot. Commit: 73f94e9

tensorrt-cicd · 2025-11-20T03:15:10Z

PR_Github #25116 [ run ] completed with state SUCCESS. Commit: 73f94e9
/LLM/main/L0_MergeRequest_PR pipeline #18987 completed with status: 'FAILURE'

lancelly · 2025-11-20T03:17:50Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-20T03:23:34Z

PR_Github #25140 [ run ] triggered by Bot. Commit: 73f94e9

tensorrt-cicd · 2025-11-20T04:25:46Z

PR_Github #25140 [ run ] completed with state SUCCESS. Commit: 73f94e9
/LLM/main/L0_MergeRequest_PR pipeline #19006 completed with status: 'SUCCESS'

tensorrt_llm/_torch/utils.py

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2025-11-20T06:04:24Z

/bot run --skip-test

tensorrt-cicd · 2025-11-20T06:10:38Z

PR_Github #25152 [ run ] triggered by Bot. Commit: ce076d6

kaiyux · 2025-11-20T07:55:35Z

/bot skip --comment "pipeline has passed"

tensorrt-cicd · 2025-11-20T08:02:11Z

PR_Github #25172 [ skip ] triggered by Bot. Commit: ce076d6

tensorrt-cicd · 2025-11-20T08:02:13Z

PR_Github #25152 [ run ] completed with state ABORTED. Commit: ce076d6
LLM/main/L0_MergeRequest_PR #19016 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-11-20T08:40:54Z

PR_Github #25172 [ skip ] completed with state SUCCESS. Commit: ce076d6
Skipping testing for commit ce076d6

set correct lm_head_tp_size_upper_bound

f23f216

Signed-off-by: Lanyu Liao <[email protected]>

lancelly requested a review from a team as a code owner November 19, 2025 09:45

lancelly requested review from Naveassaf and leslie-fang25 November 19, 2025 09:45

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

tensorrt_llm/_torch/utils.py Show resolved Hide resolved

lancelly requested a review from kaiyux November 19, 2025 09:59

add it

d694d57

Signed-off-by: Lanyu Liao <[email protected]>

add test case where world_size < gpus_per_node

73f94e9

Signed-off-by: Lanyu Liao <[email protected]>

kaiyux approved these changes Nov 20, 2025

View reviewed changes

tensorrt_llm/_torch/utils.py Show resolved Hide resolved

add comment for lm_head_tp_size_raw

ce076d6

Signed-off-by: Lanyu Liao <[email protected]>

kaiyux enabled auto-merge (squash) November 20, 2025 06:39

kaiyux merged commit 04ad9f9 into NVIDIA:main Nov 20, 2025
5 checks passed

lancelly deleted the fix/5667687 branch November 20, 2025 08:55

[https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound #9300

[https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound #9300

Conversation

lancelly commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 19, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lancelly commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

lancelly commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

lancelly commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

lancelly commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

lancelly commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

Uh oh!

lancelly commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

kaiyux commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lancelly commented Nov 19, 2025 •

edited by coderabbitai bot

Loading