-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass #9358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass #9358
Conversation
bf5c6a7 to
43141a6
Compare
…ackend Signed-off-by: Wanli Jiang <[email protected]>
43141a6 to
ff0b584
Compare
…oE cutlass Signed-off-by: Wanli Jiang <[email protected]>
ff0b584 to
b9389bc
Compare
📝 WalkthroughWalkthroughThis pull request introduces comprehensive support for gated and non-gated activation types in MoE modules, dynamically adjusting intermediate size expansion and weight quantization shapes accordingly. Changes span C++ quantization logic, Python MoE backends, Nemotron H model integration with auxiliary CUDA streams for parallel execution, and weight mapping for experts. New utility functions for activation type checking and tensor operations were added. Changes
Sequence Diagram(s)sequenceDiagram
participant Python as Python Layer
participant MoE as MoE Module
participant Quantization as Quantization Method
participant Utils as Utils
Python->>MoE: create_moe(activation_type=Swiglu/Relu2)
MoE->>Utils: is_gated_activation(activation_type)
Utils-->>MoE: boolean (gated)
MoE->>Quantization: Instantiate with activation_type
Quantization->>Quantization: Compute intermediate_size_expand_ratio (2 if gated, 1 if not)
Quantization->>Quantization: get_weights_shapes() computes w3_w1_weight_shape<br/>= inter_size * expand_ratio
Note over Quantization: For Relu2: expand_ratio=1<br/>For Swiglu/Gelu: expand_ratio=2
Quantization-->>MoE: Shape info for weight allocation
MoE-->>Python: MoE module ready
sequenceDiagram
participant Python as Nemotron H Forward
participant Layer as NemotronHLayer
participant MoE as NemotronHMOE
participant Stream as CUDA Streams
Python->>Layer: forward(x, layer_type='E', aux_stream_dict)
alt layer_type == 'E'
Layer->>Stream: Get MoE streams (routing, balancer, etc.)
Layer->>MoE: Schedule MoE computation on aux streams
Layer->>Layer: Schedule shared MLP on main/aux streams
MoE->>Stream: maybe_execute_in_parallel(moe_out, shared_out)
Stream-->>Layer: Synchronized outputs
Layer->>Layer: Sum routed + shared results
else layer_type != 'E'
Layer->>Layer: Standard FFN path
end
Layer-->>Python: Output tensor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
540-541: AlignrunMoeMinLantencyfc1/fc2 shape checks with gated vs non‑gated activations.
runMoeMinLantencystill enforcesfc1_expert_weights.sizes()[1] == fc2_expert_weights.sizes()[2] * mInnerDimMultiplier * 2(Line 540), while you now threadbase_activation_typethrough and useisGatedActivationinrunMoeand ingetQuantParams. If min‑latency mode is ever used with non‑gated activations, this hard-coded* 2will reject valid shapes.Consider mirroring the gated/ungated check from
runMoehere (usingisGatedActivation(base_activation_type)and anexpand_ratio) so both paths accept the same configurations, or explicitly document/guard that the min‑latency path only supports gated activations.Also applies to: 556-559, 614-618
286-316: Duplicate fc1/fc2 bias validation block inrunMoe.The validation of
fc1_expert_biases/fc2_expert_biases(dims and expert counts) appears twice in a row with identical logic (Lines 286–300 and 302–316). This is pre‑existing, but you might want to collapse it into a single block to avoid redundant checks and future divergence.tensorrt_llm/_torch/utils.py (1)
384-392: Hardensplitagainst invalidtp_size/ index valuesCurrent implementation assumes a valid configuration; a bad
tp_sizeoridxwould fail with less clear errors (ZeroDivisionError or IndexError).Consider tightening it slightly:
-def split(x: torch.Tensor, - tp_size: int, - idx: int, - dim: int = 0) -> torch.Tensor: - assert x.shape[dim] % tp_size == 0 +def split(x: torch.Tensor, + tp_size: int, + idx: int, + dim: int = 0) -> torch.Tensor: + assert tp_size > 0, "tp_size must be > 0" + assert 0 <= idx < tp_size, f"idx must be in [0, {tp_size}), got {idx}" + assert x.shape[dim] % tp_size == 0, ( + f"Dimension {dim} (size={x.shape[dim]}) must be divisible by tp_size={tp_size}" + ) split_size = x.shape[dim] // tp_size if tp_size == 1: return x return torch.split(x, split_size, dim=dim)[idx]This keeps the helper cheap while surfacing configuration bugs with clearer diagnostics.
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
12-13: Activation-type metadata is correct; just ensure consumers agree on “expanded” vs “base” intermediate size
is_gated_activation(activation_type)and the derivedintermediate_size_expand_ratio = 2 if gated else 1are consistent with the Swiglu/Geglu gating semantics and will help MoE backends derive shapes.The one thing to keep in mind is that
self.intermediate_sizeandself.intermediate_size_per_partitionare still computed from the rawintermediate_sizeargument. Downstream logic (e.g., quantization get_weights_shapes, expert weight layouts) must consistently treat that argument as either:
- already expanded (2× for gated), and then use
intermediate_size_expand_ratioonly where necessary, or- base size, and multiply by
intermediate_size_expand_ratiowhenever actual projection dimensions are needed.It’s worth double-checking those callers to make sure there’s no mixed interpretation.
Also applies to: 167-172
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)
37-52: MoE expert remap logic looks sound; consider clarifying the NVFP4 vs FP8 shape check
- The
_scaleguard formixer.in_proj/mixer.out_projcorrectly prevents those scale tensors from going through the"A"/"D"processing, which would be wrong.- The new
mixer.experts.handling:
- Cleanly no-ops for
moe_backend == 'VANILLA'.- Maps
up_proj→w3/w1by splitting along axis 0, which matches the expected gated layout.- Treats
input_scaleandweight_scale_2as shared scalars (duplicated for w1/w3), andweight_scaleas:
- NVFP4: vector/array split in half to w3/w1.
- FP8: scalar duplicated to both.
- Maps
down_proj→w2and raises on any unexpected MoE weight key, which is a good fail-fast.For readability (and to avoid relying on the truthiness of
Tensor.shape), it would be clearer to write the NVFP4 vs FP8 branch as:- elif "weight_scale" in key: - # NVFP4 case. - if weights[name].shape: + elif "weight_scale" in key: + # NVFP4 (per-channel) vs FP8 (scalar) scale handling. + if weights[name].ndim > 0: new_weights[w3_key] = weights[ name][:weights[name].shape[0] // 2] new_weights[w1_key] = weights[name][ weights[name].shape[0] // 2:] # FP8 case. else: new_weights[w3_key] = weights[name] new_weights[w1_key] = weights[name]Functionally equivalent, but it makes the intent (tensor vs scalar) more explicit.
Also applies to: 98-130
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
1863-1911: Well-implemented alignment logic for Cutlass NVFP4 requirements.The override correctly applies row alignment (128) and validates column alignment (4). The shape calculations properly account for alignment padding while maintaining consistency between weights and their scales.
Minor note: The static analysis hint (TRY003) suggests defining the error message in the exception class, but the current format provides clear debugging information.
Consider extracting the alignment ceiling calculation as a helper to improve readability:
+ @staticmethod + def _align_up(value: int, alignment: int) -> int: + return (value + alignment - 1) // alignment * alignment + def get_weights_shapes(self, module: torch.nn.Module, weight_vec_size: int, block_scales_vec_size: int): """Override the base method to get aligned weights shapes for Cutlass nvfp4 alignment.""" intermediate_size_expand = module.intermediate_size_per_partition * module.intermediate_size_expand_ratio - intermediate_size_expand_aligned = ( - intermediate_size_expand + self.NVFP4_ROW_ALIGNMENT - - 1) // self.NVFP4_ROW_ALIGNMENT * self.NVFP4_ROW_ALIGNMENT + intermediate_size_expand_aligned = self._align_up( + intermediate_size_expand, self.NVFP4_ROW_ALIGNMENT)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
cpp/tensorrt_llm/thop/moeOp.cpp(6 hunks)tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py(3 hunks)tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py(1 hunks)tensorrt_llm/_torch/models/modeling_nemotron_h.py(7 hunks)tensorrt_llm/_torch/modules/fused_moe/create_moe.py(4 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py(4 hunks)tensorrt_llm/_torch/modules/fused_moe/interface.py(2 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py(11 hunks)tensorrt_llm/_torch/utils.py(3 hunks)tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)tests/unittest/_torch/modeling/test_modeling_nemotron_h.py(7 hunks)tests/unittest/_torch/modules/test_fused_moe.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.pytensorrt_llm/_torch/modules/fused_moe/interface.pytensorrt_llm/_torch/utils.pytests/unittest/_torch/modules/test_fused_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/modules/fused_moe/create_moe.pytests/unittest/_torch/modeling/test_modeling_nemotron_h.pytensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.pytensorrt_llm/_torch/modules/fused_moe/interface.pytensorrt_llm/_torch/utils.pytests/unittest/_torch/modules/test_fused_moe.pycpp/tensorrt_llm/thop/moeOp.cpptensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/modules/fused_moe/create_moe.pytests/unittest/_torch/modeling/test_modeling_nemotron_h.pytensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
🧠 Learnings (13)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/test_lists/test-db/l0_a10.ymltests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_a10.ymltests/integration/test_lists/test-db/l0_h100.ymltests/unittest/_torch/modeling/test_modeling_nemotron_h.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.
Applied to files:
tests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytests/unittest/_torch/modules/test_fused_moe.pycpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpptensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.pytensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
🧬 Code graph analysis (8)
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1)
tensorrt_llm/_torch/utils.py (1)
split(384-392)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (3)
tensorrt_llm/_torch/utils.py (2)
split(384-392)shape(140-141)tensorrt_llm/_torch/models/modeling_utils.py (1)
config(522-523)tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
config(156-159)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
tensorrt_llm/_torch/utils.py (2)
ActivationType(38-47)Fp4QuantizedTensor(134-141)tests/unittest/_torch/helpers.py (1)
ceil_div(12-13)tensorrt_llm/quantization/utils/fp8_utils.py (1)
ceil_div(10-21)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
tensorrt_llm/_torch/utils.py (3)
get_model_extra_attrs(88-89)is_gated_activation(52-55)is_torch_compiling(63-65)
tests/unittest/_torch/modules/test_fused_moe.py (1)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
get_weights_shapes(1567-1601)get_weights_shapes(1866-1911)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (2)
tensorrt_llm/_torch/utils.py (3)
ActivationType(38-47)is_gated_activation(52-55)relu2(395-396)tensorrt_llm/_torch/modules/gated_mlp.py (1)
GatedMLP(19-182)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
tensorrt_llm/_torch/utils.py (1)
ActivationType(38-47)
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (2)
tests/unittest/utils/util.py (1)
skip_gpu_memory_less_than(200-206)tests/scripts/perf-sanity/run_benchmark_serve.py (1)
llm_models_root(174-175)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
130-130: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
55-57: Avoid specifying long messages outside the exception class
(TRY003)
59-61: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/models/modeling_nemotron_h.py
221-221: Unused method argument: kwargs
(ARG002)
422-422: Avoid specifying long messages outside the exception class
(TRY003)
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
198-198: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/modules/fused_moe/quantization.py
1875-1877: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (34)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
438-439: Base activation type correctly propagated into quant params.Passing
base_activation_typeintogetQuantParamskeeps the quantization logic consistent with the activation-dependent checks earlier inrunMoe(including gated vs non-gated handling) and withgetWorkspaceInfo. This looks structurally sound and matches the new signature.
864-867: Activation‑awareexpand_ratioin MXFP4/NVFP4 quant params looks consistent; verifyisGatedActivationcoverage.Using
base_activation_typeto computeexpand_ratioand applying it only to the fc1 weight‑block N‑dim checks in MXFP4 (W4A8_MXFP4_FP8,W4A8_MXFP4_MXFP8) and NVFP4 branches aligns the scale tensor shapes with the gated vs non‑gated inter‑size logic. The updated error messages that mentioninter_size * expand_ratioalso clearly describe the expected layout. This all looks consistent with the higher‑level activation handling.Please double‑check that:
isGatedActivationreturnstruefor all gated variants you use here (includingActivationType::SwigluBias), and- the producers of
fc1_weight_block/fc2_weight_blockalready generate N‑dims scaled by the sameexpand_ratioconvention for both gated and non‑gated activations.If either assumption doesn’t hold, the new checks could become overly strict for some activation types.
Also applies to: 912-959, 981-1004, 1014-1077
tensorrt_llm/_torch/utils.py (1)
50-55: Gated-activation andrelu2helpers look consistent with ActivationType semantics
is_gated_activationcovering Swiglu/SwigluBias/Geglu andrelu2implemented assquare(relu(x))match the expected behavior and line up with how these activations are typically handled in MoE kernels. No issues from a correctness standpoint.Also applies to: 395-396
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1)
10-10: Consolidatingsplitto shared utils is appropriateSwitching to
tensorrt_llm._torch.utils.splitkeeps the Qwen3 Next mapper aligned with Nemotron and other users and avoids duplicated splitting logic. Given all call sites split along dim 0, the helper’s default matches existing behavior.tests/integration/test_lists/test-db/l0_a10.yml (1)
22-22: Good addition of targeted NVFP4 Cutlass fused MoE test to A10 pre-merge listIncluding
test_nvfp4_cutlass_get_weights_shapeshere ensures the new alignment logic for NVFP4 Cutlass MoE weights is exercised in the standard PyTorch pre-merge pipeline on A10.tests/integration/test_lists/test-db/l0_h100.yml (1)
36-37: Nemotron H correctness parametrizations are well-integrated into H100 PyTorch suiteAdding the two
test_nemotron_h_correctness[...]variants under the Nemotron modeling block is consistent with how other models are parameterized here and gives coverage for the new Nemotron H MoE path across the intendedmamba_ssm_cache_dtypeoptions.tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
14-16: Constructor-level activation_type wiring into MoE base is consistentAdding
activation_type: ActivationType = ActivationType.SwiglutoCutlassFusedMoE.__init__and passing it through asactivation_type=activation_typeto theMoEbase nicely aligns this backend with the new activation metadata in the interface. This keeps the Python/C++ ActivationType enums in sync without changing external callers (which can rely on the default).Also applies to: 77-95
545-579: Passing activation_type into the fused_moe kernel matches the new enum plumbingIncluding
activation_type=self.activation_typein thetorch.ops.trtllm.fused_moecall ensures the Cutlass kernel can distinguish gated vs non-gated activations and adjust its internal logic (e.g., projection layout, Swiglu vs non-gated paths) accordingly. Withself.activation_typealready converted to the underlying IntEnum value inMoE, this call site looks correct.tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
9-9: Factory now cleanly threads activation_type to MoE backends that need itThe
create_moechanges are coherent:
- Public API:
activation_type: ActivationType = ActivationType.Swigluis added with a sensible default, so existing callers remain valid.CutlassFusedMoEandVanillaMoEreceiveactivation_type=activation_type, aligning them with the new MoE interface fields and the fused Cutlass kernel changes.- Other backends continue to use the base-class default ActivationType, which is fine as long as they don’t depend on non-default activations yet.
This is a good spot to centralize activation selection per model/backend.
Also applies to: 77-78, 121-138, 156-168
tests/unittest/_torch/modules/test_fused_moe.py (2)
39-40: LGTM: Import addition for new test.The import of
NVFP4CutlassFusedMoEMethodis correctly placed and follows the existing import pattern in this file.
2711-2818: Well-structured unit test for NVFP4 weight shape alignment.The test comprehensively validates the
get_weights_shapesmethod by:
- Testing error handling for invalid
hidden_sizealignment.- Verifying shape calculations for weights and scales match expected aligned dimensions.
- Testing both bias and no-bias scenarios.
- Including edge cases like
intermediate_size=120(not divisible by 128).The use of a
MockModuleis appropriate for isolating the method under test from the full module implementation.tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (3)
13-15: LGTM: New imports for activation type support.The imports follow the project's namespace convention, importing from subpackages rather than directly importing individual classes.
36-61: Clean introduction of activation type support with proper validation.The new parameters maintain backward compatibility via defaults. The validation logic appropriately restricts non-gated activations to
Relu2and ensurespack_weights=False, preventing unsupported configurations at construction time rather than at runtime.
128-147: LGTM: Activation-aware expert creation.The branching logic correctly handles:
ActivationType.Relu2→MLP(non-gated, single projection)- Other (gated) activations →
GatedMLP(gate + up projections)Both paths properly propagate
layer_idxfor layer-aware functionality.tensorrt_llm/_torch/models/modeling_nemotron_h.py (6)
17-39: LGTM: Import additions for NemotronH MoE support.The new imports are well-organized and necessary for the new NemotronHMOE class and multi-stream execution support.
112-216: Well-structured NemotronHMOE class initialization.The implementation correctly:
- Handles both list and scalar
moe_intermediate_sizeconfigurations.- Creates shared experts conditionally based on
n_shared_experts.- Uses
DeepseekV3Gatefor MoE routing (consistent with codebase patterns).- Supports latent MoE projections when
moe_latent_sizeis configured.- Sets up CUDA events for multi-stream synchronization.
The local import of
DeepseekV3Gate(line 124) to avoid circular dependency is an acceptable pattern.
217-261: LGTM: Multi-stream parallel execution for MoE.The forward method efficiently parallelizes routed and shared expert computations using
maybe_execute_in_parallel. The pattern of returning0from_compute_shared_outputwhen no shared experts exist enables clean addition with the routed output.Based on learnings, the gate returns
topk_indicesandtopk_weightsin the correct shape for the MoE call without needing reshaping.
264-310: LGTM: Layer type "E" support for MoE layers.The NemotronHLayer extension cleanly adds support for MoE layers (
layer_type == "E") while passing the requiredaux_stream_dictfor multi-stream execution.
335-361: LGTM: Auxiliary stream initialization and propagation.The model correctly initializes three CUDA streams for MoE operations and propagates them to all layers. Creating streams once at model initialization and sharing them across layers is the correct pattern for multi-stream execution.
416-423: LGTM: Config attribute compatibility handling.The code gracefully handles different config attribute names (
rms_norm_epsvslayer_norm_epsilon) that may appear in different model configurations, normalizing torms_norm_epsfor downstream consistency.tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (5)
4-4: LGTM: Import addition for similarity comparison.The
similarimport enables fuzzy string matching for model outputs where exact matches aren't expected.
34-56: LGTM: Parametrized model folder support.The function signature change makes
model_folderexplicit, improving test clarity and enabling multi-model testing. The increasedmax_num_tokensdefault (8192) accommodates larger model context requirements.
59-66: LGTM: Multi-model parametrization with memory guards.The parametrization correctly:
- Tests both 8B and 30B NemotronH variants.
- Uses
skip_gpu_memory_less_thanto skip tests on GPUs with insufficient memory.- Memory estimates (2 × model_size + 1GB) are reasonable for FP16/BF16 inference.
83-198: Well-structured per-model reference data with appropriate tolerances.The test correctly handles model-specific variations:
- 8B model: tighter tolerance (atol=0.2), exact completion matching.
- 30B model: relaxed tolerance (atol=0.4), similarity-based completion matching.
The comments documenting reference sources (commit hashes, hardware) provide valuable provenance for the reference values.
314-467: LGTM: Explicit model folder in remaining tests.The
test_nemotron_h_cuda_graph_overlap_schedulerandtest_nemotron_h_chunked_prefilltests are correctly updated to explicitly specifymodel_folder="Nemotron-H-8B-Base-8K", maintaining their existing behavior.tensorrt_llm/_torch/modules/fused_moe/quantization.py (9)
221-223: LGTM!The default bias shape calculation correctly incorporates
intermediate_size_expand_ratioto handle both gated and non-gated activation types.
426-429: LGTM!Consistent use of
intermediate_size_expand_ratiofor the fused gate/up projection weight shape.
493-508: LGTM!The
split_lengthcalculation correctly adapts to theintermediate_size_expand_ratiofor splitting the fused w3/w1 weights during requantization.
516-519: LGTM!Consistent application of
intermediate_size_expand_ratiofor FP8 QDQ method.
1567-1601: Good refactoring to centralize shape computation.The new
get_weights_shapesmethod cleanly encapsulates shape calculations and enables subclass overrides for alignment requirements. The return tuple structure is well-organized.
1611-1654: LGTM!The refactored
create_weightscorrectly setsscaling_vector_sizebefore callingget_weights_shapes, ensuring subclass overrides have access to this value. The use of keyword arguments when calling the parent'screate_weightsimproves clarity.
1937-1950: LGTM!The padding logic correctly handles dimension mismatches by zero-padding to the aligned destination shape. The
contiguous()call ensures proper memory layout after padding.
1972-1981: LGTM!Consistent padding implementation for w2 weight scales, matching the w3_w1 approach.
1993-2045: LGTM!The weight loading overrides are well-implemented with:
- Clear docstrings explaining the Cutlass alignment purpose
- Consistent padding logic matching the weight scale loading
- Appropriate use of
non_blocking=Truefor GPU efficiency- Helpful inline comment documenting the pad order
|
|
||
| def get_weights_shapes(self, module: torch.nn.Module, weight_vec_size: int, | ||
| block_scales_vec_size: int): | ||
| # Divide by 16 because we use int64 to pack 16 fp4 values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe noob question: is weight_vec_size expected to be 16? Does it have to be 16? If so, it may be better to hardcode it, if not, maybe the comment can be left out?
EDIT: nevermind, I see you just moved it from the original code.
|
|
||
| if module.bias: | ||
| w3_w1_bias_shape = (module.expert_size_per_partition, | ||
| module.intermediate_size_per_partition * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: seems like the following value could be assigned to a variable as it is used in multiple places
module.intermediate_size_per_partition * module.intermediate_size_expand_ratio
| dst_row, dst_col = dst_w3_w1_weight_scale.shape | ||
| _row, _col = cast_w31_weight_scale.shape | ||
| if _row != dst_row or _col != dst_col: | ||
| cast_w31_weight_scale = torch.nn.functional.pad( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: leave a comment for why we are doing these few lines? Similar below.
| device=device) | ||
| cast_w2_weight_shard = w2_weight_shard.view(dst_w2_weight.dtype) | ||
|
|
||
| dst_row, dst_col = dst_w2_weight.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing some subtle differences, but at first glance it seems like these few lines below are a common pattern between the different load_ methods. Is there any way they could be made a helper function?
| block_scales_vec_size = 4 # 4 fp8 values packed into int32 | ||
|
|
||
| # Check if hidden_size is divisible by NVFP4_COL_ALIGNMENT | ||
| if hidden_size % NVFP4_COL_ALIGNMENT != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split out positive and negative tests into their own test functions. If this fails, none of the below get executed.
Note
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.