-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[#9198][feat] Refactor dist ops in AutoDeploy #9301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/bot run |
📝 WalkthroughWalkthroughRefactoring renames distributed and fusion operations to explicitly support both PyTorch and TRT-LLM backends. Public operations are split into backend-specific variants with distinct function names and registrations. Backend selection logic is centralized, enabling pattern-based transformations to select appropriate backend ops. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ShardingUtils as sharding_utils<br/>(_get_dist_ops)
participant DistOps as auto_deploy.dist
participant TRTLLMDistOps as trtllm.dist
User->>ShardingUtils: Need dist ops
ShardingUtils->>TRTLLMDistOps: Check is_trtllm_op_available()
alt TRT-LLM available
TRTLLMDistOps-->>ShardingUtils: True
ShardingUtils-->>User: (trtllm_dist_all_gather,<br/>trtllm_dist_all_reduce)
else TRT-LLM unavailable
TRTLLMDistOps-->>ShardingUtils: False
ShardingUtils->>DistOps: Use PyTorch ops
DistOps-->>ShardingUtils: (torch_dist_all_gather,<br/>torch_dist_all_reduce)
ShardingUtils-->>User: PyTorch dist ops
end
Note over ShardingUtils,TRTLLMDistOps: Backend selection centralized<br/>in _get_dist_ops
sequenceDiagram
participant Matcher as Pattern Matcher<br/>(collectives.py)
participant Factory as _make_allreduce_...<br/>_pattern factories
participant GraphOps as Graph Operations
Matcher->>Factory: Generate patterns for Torch<br/>(residual_first + x_first)
Factory-->>Matcher: torch_patterns[]
Matcher->>Factory: Generate patterns for TRT-LLM<br/>(residual_first + x_first)
Factory-->>Matcher: trtllm_patterns[]
Matcher->>GraphOps: Register all 4 patterns
GraphOps->>GraphOps: Match graph against patterns
alt Pattern matches x_first Torch
GraphOps-->>Matcher: Use torch replacement
else Pattern matches residual_first TRT-LLM
GraphOps-->>Matcher: Use trtllm replacement
end
Note over Factory: Factory generates<br/>backend-specific patterns<br/>from templates
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py (1)
1-5: Multi-backend allreduce+residual+RMSNorm patterns and registrations look solid; consider a small guard.The pattern factory + replacement helpers correctly abstract the backend choice and addition order, and the transform cleanly registers four variants sharing
dummy_args,op_ignore_types, andscalar_workaround. This is a nice cleanup and should keep future backend additions local.One low-impact suggestion: in
_make_allreduce_residual_rmsnorm_pattern, consider asserting theadd_ordervalue (e.g.,{"residual_first", "x_first"}) so any accidental typos fail fast during tracing rather than silently taking theelsebranch.Also applies to: 25-41, 43-71, 74-89, 91-115, 125-129, 149-193
tensorrt_llm/_torch/auto_deploy/utils/sharding_utils.py (1)
32-51: Backend-aware_get_dist_opswiring looks correct; only minor polish possible.The new
_get_dist_ops()helper and its use sites are consistent:
- Returning the
.defaultoverloads is appropriate forgm.graph.call_function._shard_parameter_nodecorrectly mapsdim=0 → all_gather(..., dim=-1)anddim=1 → all_reduce(...), matching the intended COLUMN/ROW semantics.- BMM and EP/MoE sharding now automatically pick TRT-LLM vs PyTorch dist ops based on
is_trtllm_op_available(), which aligns with the PR goals.Two optional cleanups you might consider:
- Cache
_get_dist_ops()at module scope or the first time it’s called if you expect many sharding transforms per graph; currently each call re-imports and re-checks availability.- Update the comment in
BMMShardingInfo.validatethat refers specifically totorch_dist_all_gatherso it describes the limitation in backend-agnostic terms (since we now dispatch via_get_dist_ops).Neither affects correctness; the current implementation is fine as-is.
Also applies to: 509-515, 949-953, 1038-1041, 1109-1112
tensorrt_llm/_torch/auto_deploy/custom_ops/linear.py (1)
30-53: Fused linear+all-reduce backend split is consistent; consider de-duplicating the legacy alias.The new
torch_fused_linear_all_reduceandtrtllm_fused_linear_all_reduceops have clear, symmetric semantics and keep mutation confined to a locally createdoutputtensor, which matches themutates_args=()contract. The fake implementations are also appropriate for meta/tracing use.To reduce duplication and keep the legacy alias in sync, you could implement
trtllm_dist_fused_linear_all_reduceas a thin wrapper:@torch.library.custom_op( "auto_deploy::trtllm_dist_fused_linear_all_reduce", mutates_args=(), device_types="cuda" ) def trtllm_dist_fused_linear_all_reduce( input: torch.Tensor, weight: torch.Tensor, bias: Optional[torch.Tensor] ) -> torch.Tensor: - """Legacy name for trtllm_fused_linear_all_reduce. - - Kept for backward compatibility with existing code. - This is an alias that directly implements the same logic. - """ - output = torch.ops.aten.linear(input, weight, bias) - return trtllm_dist.trtllm_allreduce(output, op=dist.ReduceOp.SUM) + """Legacy name for trtllm_fused_linear_all_reduce. + + Kept for backward compatibility with existing code. + """ + return trtllm_fused_linear_all_reduce(input, weight, bias)This ensures any future change to the TRT-LLM fused implementation automatically applies to the alias.
Also applies to: 55-72, 74-96
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py (1)
314-325: Legacy alias behavior is consistent; optional deduplication with torch fused opThe updated
torch_quant_fused_fp8_linear_all_reducedocstring now matches the implementation: it’s purely a legacy alias that always takes the torch backend path, which aligns with the new design and avoids any hidden TRT‑LLM optimization behind the old name. The fake implementation staying as a pure FP8 linear op is consistent with the new fakes above.If you want to reduce duplication, you could delegate the legacy alias to the torch‑specific fused op (e.g., via
torch.ops.auto_deploy.torch_fused_fp8_linear_all_reduce) instead of inlining the same linear+all‑reduce sequence again, but that’s purely stylistic and not required.Also applies to: 333-343
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1)
75-76: Stubtrtllm_allreducematches signature but triggers Ruff ARG001/TRY003The ImportError stub keeps the same signature as the real
trtllm_allreduce, which is important for call‑site compatibility. Ruff correctly flags the unused arguments and the explicit message, but changing this is optional.If you want to satisfy Ruff without affecting behavior, you could do something like:
- def trtllm_allreduce(tensor, op, all_reduce_params=None): - raise ImportError("TRT-LLM is not available.") + def trtllm_allreduce(tensor, op, all_reduce_params=None): + _ = tensor, op, all_reduce_params # silence ARG001 + raise ImportError("TRT-LLM is not available.")or add an appropriate
# noqain line with your lint policy.Please confirm what the project‑preferred way of handling Ruff ARG001/TRY003 is in stubs like this (e.g.,
# noqa, dummy assignments, or leaving as‑is).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py(2 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/linear.py(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py(1 hunks)tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py(4 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py(3 hunks)tensorrt_llm/_torch/auto_deploy/utils/node_utils.py(1 hunks)tensorrt_llm/_torch/auto_deploy/utils/sharding_utils.py(5 hunks)
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py:94-100
Timestamp: 2025-08-27T16:22:10.695Z
Learning: When there are inconsistent operator detection methods (like custom_op() vs target_op()), removing one method and standardizing on the other is often cleaner than supporting both methods simultaneously.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/auto_deploy/utils/node_utils.pytensorrt_llm/_torch/auto_deploy/custom_ops/quant.pytensorrt_llm/_torch/auto_deploy/utils/sharding_utils.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear.pytensorrt_llm/_torch/auto_deploy/custom_ops/dist.pytensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/auto_deploy/utils/node_utils.pytensorrt_llm/_torch/auto_deploy/custom_ops/quant.pytensorrt_llm/_torch/auto_deploy/utils/sharding_utils.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear.pytensorrt_llm/_torch/auto_deploy/custom_ops/dist.pytensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.pytensorrt_llm/_torch/auto_deploy/transform/library/collectives.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear.pytensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.pytensorrt_llm/_torch/auto_deploy/transform/library/collectives.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear.pytensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/collectives.pytensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/collectives.pytensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/collectives.pytensorrt_llm/_torch/auto_deploy/utils/sharding_utils.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear.pytensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: 2025-09-24T03:31:28.908Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
🧬 Code graph analysis (7)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py (4)
torch_dist_all_gather(20-29)torch_dist_all_reduce(38-48)trtllm_dist_all_gather(64-71)trtllm_dist_all_reduce(82-87)
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py (2)
tensorrt_llm/_torch/auto_deploy/distributed/common.py (1)
all_reduce(44-48)tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (2)
trtllm_allreduce(27-41)trtllm_allreduce(75-76)
tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py (2)
torch_dist_all_reduce(38-48)trtllm_dist_all_reduce(82-87)tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (2)
torch_fused_allreduce_residual_rmsnorm(86-110)trtllm_fused_allreduce_residual_rmsnorm(47-61)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)
register_ad_pattern(99-182)tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py (1)
dummy_args(153-158)
tensorrt_llm/_torch/auto_deploy/utils/sharding_utils.py (2)
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1)
is_trtllm_op_available(120-122)tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py (4)
trtllm_dist_all_gather(64-71)trtllm_dist_all_reduce(82-87)torch_dist_all_gather(20-29)torch_dist_all_reduce(38-48)
tensorrt_llm/_torch/auto_deploy/custom_ops/linear.py (2)
tensorrt_llm/_torch/auto_deploy/distributed/common.py (1)
all_reduce(44-48)tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (2)
trtllm_allreduce(27-41)trtllm_allreduce(75-76)
tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py (2)
tensorrt_llm/_torch/auto_deploy/distributed/common.py (3)
get_world_size(76-77)all_gather(37-41)all_reduce(44-48)tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (4)
trtllm_allgather(22-25)trtllm_allgather(72-73)trtllm_allreduce(27-41)trtllm_allreduce(75-76)
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1)
tensorrt_llm/_torch/auto_deploy/distributed/common.py (1)
all_reduce(44-48)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py
21-21: Unused function argument: sizes
(ARG001)
33-33: Unused function argument: sizes
(ARG001)
75-75: Unused function argument: sizes
(ARG001)
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
75-75: Unused function argument: tensor
(ARG001)
75-75: Unused function argument: op
(ARG001)
75-75: Unused function argument: all_reduce_params
(ARG001)
76-76: Avoid specifying long messages outside the exception class
(TRY003)
115-115: Unused function argument: residual
(ARG001)
115-115: Unused function argument: norm_weight
(ARG001)
115-115: Unused function argument: eps
(ARG001)
🔇 Additional comments (5)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
308-316: is_dist_op correctly expanded to cover TRT-LLM dist ops.The updated docstring and
dist_opsset are consistent with the new torch vs TRT-LLM split, and using the OpOverloadPackets here keeps matching robust across.defaultoverloads. No further changes needed.tensorrt_llm/_torch/auto_deploy/custom_ops/dist.py (1)
1-5: Clean up unusedsizesparameters in all_gather variants to satisfy Ruff ARG001 warnings.Verification confirms the analysis is correct:
sizesis unused intorch_dist_all_gather(line 20),torch_dist_all_gather_fake(line 33), andtrtllm_dist_all_gather_fake(line 75)sizesis correctly used intrtllm_dist_all_gather(line 64) where it's passed through totrtllm_dist.trtllm_allgather()The suggested approach of adding
del sizesis valid for maintaining API compatibility while silencing Ruff's unused argument warnings. Apply the diff to the three functions identified in the original review comment.⛔ Skipped due to learnings
Learnt from: tshmilnvidia Repo: NVIDIA/TensorRT-LLM PR: 5488 File: cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/transferAgent.cpp:507-517 Timestamp: 2025-08-25T08:48:39.694Z Learning: When reviewing potential compilation errors related to function parameter types, always check for type aliases using `using` declarations that might make seemingly incompatible types equivalent. RegisterDescs = MemoryDescs in TensorRT-LLM's transfer agent code.tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (2)
43-68: Renamed TRT‑LLM fused residual RMSNorm op wiring looks correctThe new
dist::trtllm_fused_allreduce_residual_rmsnormop is still driven throughtrtllm_allreducewithAllReduceParams(fusion_op=AllReduceFusionOp.RESIDUAL_RMS_NORM, ...), so the fused behavior should be identical to the old name while making the backend explicit. The fake implementation returning twoempty_like(tensor)tensors is sufficient for meta/device tracing and keeps the signature aligned.No issues from a correctness or API‑surface perspective.
120-122: Availability helper correctly ties TRT‑LLM ops to both import success and MPI
is_trtllm_op_available()now explicitly documents and enforces that TRT‑LLM ops are only considered available when the imports succeeded (TRTLLM_OP_AVAILABLE) and the runtime is under MPI (is_ompi()). That aligns well with the intent to reserve TRT‑LLM fused paths for MPI mode and use the torch fallback otherwise.You may want to confirm that the higher‑level AutoDeploy transforms are exclusively checking
is_trtllm_op_available()(and not justTRTLLM_OP_AVAILABLE) when deciding between the TRT‑LLM vs torch fused ops.tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py (1)
237-305: Now let me check where this quant.py module is imported and whether there's any conditional loading:Let me search for where this quant module is imported and check for availability patterns:
Now let me search the codebase for where quant.py is imported and check for guards:
Based on the verification, I can now provide the final rewritten review comment:
Backend‑specific fused FP8 linear+all‑reduce ops are correctly structured
The separation into
torch_fused_fp8_linear_all_reduce(PyTorch dist) andtrtllm_fused_fp8_linear_all_reduce(TRT‑LLM allreduce) is sound. Both consistently reusetorch_quant_fp8_linear, keeping shapes/dtypes aligned across backends. The fake registrations correctly delegate only to the FP8 linear op, avoiding dist/TRT‑LLM calls during tracing, which is correct.Two minor polish suggestions (not blockers):
- For consistency: the torch path calls
dist.all_reduce()directly on the imported module, while the TRT‑LLM path goes through a helper. Consider standardizing on a wrapper import for uniformity across files.- The module imports
trtllm_distunconditionally at the top level, so if TRT‑LLM is unavailable, quant.py fails to load entirely (early failure is safer than late call-time errors). No additional guards needed—the current design prevents accidental calls when unavailable.
Signed-off-by: Eran Geva <[email protected]>
Signed-off-by: Eran Geva <[email protected]>
36f7a59 to
81bb39d
Compare
|
/bot run |
|
PR_Github #25054 [ run ] triggered by Bot. Commit: |
Signed-off-by: Eran Geva <[email protected]>
Signed-off-by: Eran Geva <[email protected]>
|
/bot run |
|
PR_Github #25061 [ run ] triggered by Bot. Commit: |
|
PR_Github #25054 [ run ] completed with state |
Signed-off-by: Eran Geva <[email protected]>
|
PR_Github #25061 [ run ] completed with state |
|
/bot run |
|
PR_Github #25159 [ run ] triggered by Bot. Commit: |
|
PR_Github #25159 [ run ] completed with state |
|
/bot run |
|
PR_Github #25194 [ run ] triggered by Bot. Commit: |
|
PR_Github #25194 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.