Skip to content

Conversation

@lancelly
Copy link
Collaborator

@lancelly lancelly commented Nov 19, 2025

The lm_head_head_size should be less than the world_size, and a new environment variable is added to allow users to configure it.

Summary by CodeRabbit

  • Improvements
    • Added environment variable configuration option (LM_HEAD_TP_SIZE) for fine-tuning distributed tensor parallelism settings.
    • Enhanced automatic computation of tensor parallelism parameters with improved hardware topology awareness for better resource allocation on multi-GPU setups.

@lancelly lancelly requested a review from a team as a code owner November 19, 2025 09:45
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

📝 Walkthrough

Walkthrough

Modified the create_lm_head_tp_mapping utility function to compute tensor parallelism size for LM head with upper-bound awareness. The function now reads an optional environment variable for configuration and falls back to a nearest-bucket calculation if not set.

Changes

Cohort / File(s) Summary
LM Head Tensor Parallelism Configuration
tensorrt_llm/_torch/utils.py
Added os import. Modified create_lm_head_tp_mapping to introduce lm_head_tp_size_upper_bound (minimum of world_size and gpus_per_node). Changed lm_head_tp_size determination to read optional LM_HEAD_TP_SIZE environment variable; defaults to nearest_in_buckets computation. Added explicit error message to existing divisibility assertion. Included TODO comment for potential GB200 platform optimization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Review environment variable handling logic (LM_HEAD_TP_SIZE)
  • Verify upper-bound computation correctness and interaction with nearest_in_buckets
  • Confirm assertion error message clarity and divisibility constraint validity

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is incomplete and lacks required sections. It provides only a brief one-liner without detailed explanation, test coverage details, or checklist verification as specified in the template. Expand the description to include: (1) detailed explanation of the issue and solution, (2) test coverage information, and (3) completion of the PR checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly summarizes the main change: setting correct lm_head_tp_size_upper_bound, which aligns with the core modification in the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46dd988 and f23f216.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/utils.py (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • tensorrt_llm/_torch/utils.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • tensorrt_llm/_torch/utils.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/utils.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/utils.py (2)
tensorrt_llm/llmapi/llm_args.py (2)
  • world_size (459-460)
  • world_size (469-473)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • world_size (43-44)
  • tp_size (63-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/utils.py (3)

2-2: LGTM!

The os import is necessary for reading the LM_HEAD_TP_SIZE environment variable.


328-328: LGTM!

The explicit error message improves debuggability by showing both mapping.tp_size and lm_head_tp_size when the assertion fails.


321-322: Potential issue: non-power-of-2 upper bound is valid by current constraints.

The concern is valid but nuanced. nearest_in_buckets can return non-power-of-2 values when lm_head_tp_size_upper_bound is non-power-of-2 (e.g., 6). However, line 328's assertion only requires mapping.tp_size % lm_head_tp_size == 0 (divisibility), not power-of-2.

The current implementation accepts non-power-of-2 TP sizes as long as they divide mapping.tp_size. If non-power-of-2 values cause performance issues downstream or conflict with kernel assumptions, add an explicit check: lm_head_tp_size = last_positive_power_of_2(lm_head_tp_size) after line 327, or constrain the upper bound to last_positive_power_of_2(lm_head_tp_size_upper_bound) on line 322.

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@lancelly lancelly requested a review from kaiyux November 19, 2025 09:59
@tensorrt-cicd
Copy link
Collaborator

PR_Github #25037 [ run ] triggered by Bot. Commit: f23f216

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25037 [ run ] completed with state SUCCESS. Commit: f23f216
/LLM/main/L0_MergeRequest_PR pipeline #18919 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25056 [ run ] triggered by Bot. Commit: 73f94e9

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25056 [ run ] completed with state SUCCESS. Commit: 73f94e9
/LLM/main/L0_MergeRequest_PR pipeline #18937 completed with status: 'FAILURE'

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25116 [ run ] triggered by Bot. Commit: 73f94e9

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25116 [ run ] completed with state SUCCESS. Commit: 73f94e9
/LLM/main/L0_MergeRequest_PR pipeline #18987 completed with status: 'FAILURE'

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25140 [ run ] triggered by Bot. Commit: 73f94e9

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25140 [ run ] completed with state SUCCESS. Commit: 73f94e9
/LLM/main/L0_MergeRequest_PR pipeline #19006 completed with status: 'SUCCESS'

@lancelly
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25152 [ run ] triggered by Bot. Commit: ce076d6

@kaiyux kaiyux enabled auto-merge (squash) November 20, 2025 06:39
@kaiyux
Copy link
Member

kaiyux commented Nov 20, 2025

/bot skip --comment "pipeline has passed"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25172 [ skip ] triggered by Bot. Commit: ce076d6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25152 [ run ] completed with state ABORTED. Commit: ce076d6
LLM/main/L0_MergeRequest_PR #19016 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25172 [ skip ] completed with state SUCCESS. Commit: ce076d6
Skipping testing for commit ce076d6

@kaiyux kaiyux merged commit 04ad9f9 into NVIDIA:main Nov 20, 2025
5 checks passed
@lancelly lancelly deleted the fix/5667687 branch November 20, 2025 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants