[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass #9358

Wanli-Jiang · 2025-11-21T07:45:56Z

Note

commit 1 is the same as [FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend #9261
I will rebase the PR once [FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend #9261 is merged, so that only commit 2 is the real code change.

Summary by CodeRabbit

Release Notes

New Features
- Added configurable activation type support for Mixture of Experts (MoE) operations with handling for gated vs non-gated activations
- Enhanced quantization methods for MoE weights (NVFP4, FP8 variants)
- Introduced multi-stream parallel execution for MoE routing and shared paths
- Extended Nemotron-H model architecture with integrated MoE layers
Tests
- Added validation tests for MoE weight shape alignment with quantization
- Added model correctness tests for Nemotron-H variants with MoE support

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

…ackend Signed-off-by: Wanli Jiang <[email protected]>

…oE cutlass Signed-off-by: Wanli Jiang <[email protected]>

coderabbitai · 2025-11-25T08:10:34Z

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive support for gated and non-gated activation types in MoE modules, dynamically adjusting intermediate size expansion and weight quantization shapes accordingly. Changes span C++ quantization logic, Python MoE backends, Nemotron H model integration with auxiliary CUDA streams for parallel execution, and weight mapping for experts. New utility functions for activation type checking and tensor operations were added.

Changes

Cohort / File(s)	Summary
C++ Quantization Backend `cpp/tensorrt_llm/thop/moeOp.cpp`	Added `base_activation_type` parameter to `getQuantParams` method; introduced `expand_ratio` (2 for gated, 1 for non-gated); replaced hard-coded multipliers with computed `expand_ratio` across multiple quantization paths (FP8, MXFP4, NVFP4); updated validation checks and error messages for fc1_weight_block alignment.
Python Utilities `tensorrt_llm/_torch/utils.py`	Added three new public functions: `is_gated_activation(ActivationType) -> bool` to detect gated activations; `split(x, tp_size, idx, dim) -> Tensor` for tensor partitioning; `relu2(x) -> Tensor` for squared ReLU computation. Added import of `torch.nn.functional as F`.
MoE Core Interface & Configuration `tensorrt_llm/_torch/modules/fused_moe/interface.py`	Added `is_gated_activation` import; introduced internal attributes `is_gated_activation` and `intermediate_size_expand_ratio` (computed from activation type) to track gating state and sizing requirements.
MoE Backend Initialization `tensorrt_llm/_torch/modules/fused_moe/create_moe.py`	Added `activation_type: ActivationType = ActivationType.Swiglu` parameter; propagated `activation_type` to `CutlassFusedMoE` and `VanillaMoE` constructors; added `layer_idx=layer_idx` argument to `VanillaMoE`.
MoE Cutlass Implementation `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py`	Added `activation_type: ActivationType = ActivationType.Swiglu` parameter to `CutlassFusedMoE.__init__`; propagated to base class via `super().__init__(..., activation_type=activation_type)`. Added `ActivationType` import from utils.
MoE Vanilla Implementation `tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py`	Added `layer_idx: Optional[int] = None` and `activation_type: ActivationType = ActivationType.Swiglu` parameters; implemented conditional expert creation: instantiate `MLP` with `relu2` activation for Relu2 type, else `GatedMLP`; both receive `layer_idx`. Added imports for `ActivationType`, `is_gated_activation`, `relu2`, `MLP`.
MoE Quantization Methods `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Added `get_weights_shapes()` methods to `NVFP4FusedMoEMethod`, `NVFP4CutlassFusedMoEMethod`, and `MXFP4WeightFusedMoEMethod` returning shape dictionaries (w3_w1_weight_shape, w2_weight_shape, bias shapes, scale shapes); updated `create_weights` paths across FP8, NVFP4, Cutlass, and MXFP4 variants to use `intermediate_size_expand_ratio` for w3_w1 shapes instead of hard-coded multipliers; enhanced weight loading and padding logic for aligned quantization.
Nemotron H Model Core `tensorrt_llm/_torch/models/modeling_nemotron_h.py`	Added `NemotronHMOE` class for MoE support; extended `NemotronHLayer` with `aux_stream_dict` parameter for multi-stream parallel MoE/MLP execution using `maybe_execute_in_parallel`; updated layer_type support to include "E" for MoE; added latent MoE projection support (fc1_latent_proj, fc2_latent_proj); ensured RMS norm epsilon consistency via config; refactored imports to use shared utilities.
Weight Mappers `tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py` `tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py`	Updated import paths: changed `split` source to `tensorrt_llm._torch.utils.split`; added MoE weight remapping for `mixer.experts` (VANILLA passthrough, other backends map up_proj to w1/w3 with scale handling, down_proj to w2); narrowed scale remapping condition to `mixer.in_proj\|mixer.out_proj` with `_scale` suffix.
Test Configuration `tests/integration/test_lists/test-db/l0_a10.yml`	Added new PyTorch test: `unittest/_torch/modules/test_fused_moe.py::test_nvfp4_cutlass_get_weights_shapes`.
Test Configuration `tests/integration/test_lists/test-db/l0_h100.yml`	Added two parameterized Nemotron H correctness test variants for Nemotron-Nano-3-30B-A3.5B-dev-1024 with different mamba_ssm_cache_dtype values.
Nemotron H Test Suite `tests/unittest/_torch/modeling/test_modeling_nemotron_h.py`	Added `model_folder` parameter to `create_nemotron_h_llm()` and `test_nemotron_h_correctness()`; updated test to parameterize over multiple model folders with per-model expected completions and reference logprobs; raised default `max_num_tokens` to 8192; updated all test invocations to pass `model_folder` explicitly.
MoE Quantization Tests `tests/unittest/_torch/modules/test_fused_moe.py`	Added import of `NVFP4CutlassFusedMoEMethod`; implemented `test_nvfp4_cutlass_get_weights_shapes()` test validating shape alignment logic for NVFP4 Cutlass with assertions on row/column alignment constraints and intermediate size expansion.

Sequence Diagram(s)

sequenceDiagram
    participant Python as Python Layer
    participant MoE as MoE Module
    participant Quantization as Quantization Method
    participant Utils as Utils

    Python->>MoE: create_moe(activation_type=Swiglu/Relu2)
    MoE->>Utils: is_gated_activation(activation_type)
    Utils-->>MoE: boolean (gated)
    MoE->>Quantization: Instantiate with activation_type
    Quantization->>Quantization: Compute intermediate_size_expand_ratio (2 if gated, 1 if not)
    Quantization->>Quantization: get_weights_shapes() computes w3_w1_weight_shape<br/>= inter_size * expand_ratio
    Note over Quantization: For Relu2: expand_ratio=1<br/>For Swiglu/Gelu: expand_ratio=2
    Quantization-->>MoE: Shape info for weight allocation
    MoE-->>Python: MoE module ready

sequenceDiagram
    participant Python as Nemotron H Forward
    participant Layer as NemotronHLayer
    participant MoE as NemotronHMOE
    participant Stream as CUDA Streams

    Python->>Layer: forward(x, layer_type='E', aux_stream_dict)
    alt layer_type == 'E'
        Layer->>Stream: Get MoE streams (routing, balancer, etc.)
        Layer->>MoE: Schedule MoE computation on aux streams
        Layer->>Layer: Schedule shared MLP on main/aux streams
        MoE->>Stream: maybe_execute_in_parallel(moe_out, shared_out)
        Stream-->>Layer: Synchronized outputs
        Layer->>Layer: Sum routed + shared results
    else layer_type != 'E'
        Layer->>Layer: Standard FFN path
    end
    Layer-->>Python: Output tensor

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

Activation-type-driven expand_ratio logic (moeOp.cpp, quantization.py, interface.py): Critical flow where expand_ratio affects weight shapes, block validation, and error messages across multiple quantization backends (FP8, MXFP4, NVFP4, Cutlass). Verify consistency across all branches and that hard-coded multipliers were correctly replaced.
NVFP4 weight-shape computation and alignment (quantization.py): Multiple added/modified get_weights_shapes() methods with row/column alignment constraints. Review correctness of split_length calculations, padding logic, and constraint enforcement (especially for Cutlass variant).
NemotronHMOE integration and multi-stream execution (modeling_nemotron_h.py): New class with auxiliary stream handling via maybe_execute_in_parallel and EventType. Verify stream lifecycle, synchronization points, and correctness of parallel MoE/shared-MLP execution and result summation.
Expert instantiation conditionals (fused_moe_vanilla.py): Branch logic selecting MLP vs. GatedMLP based on activation_type; verify constraint enforcement (pack_weights=False for non-gated) and correctness of relu2 usage.
Parameter threading through call chain (create_moe.py, fused_moe_cutlass.py): Verify activation_type is correctly propagated through constructors and base-class initializers across all backends.

Possibly related PRs

[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) #9288: Both PRs modify fused MoE quantization paths (NVFP4, weight shaping) and introduce expand_ratio-driven sizing; this PR adds ActivationType-driven expand_ratio while the other handles grouped-GEMM and interleave logic for the same quantization flow.

Suggested reviewers

limin2021
zongfeijing
chzblych
kaiyux
djns99

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description consists entirely of the repository template with boilerplate text. The required sections (Description, Test Coverage) are present but empty, providing no actual explanation of the changes, rationale, or test coverage.	Fill in the Description section explaining the issue and solution, and the Test Coverage section listing relevant test cases that validate the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly specifies the feature (NVFP4 MoE cutlass support) and the main change (weight and weight_scale padding), directly corresponding to the changes in quantization.py and related MoE files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

cpp/tensorrt_llm/thop/moeOp.cpp (2)

540-541: Align runMoeMinLantency fc1/fc2 shape checks with gated vs non‑gated activations.

runMoeMinLantency still enforces fc1_expert_weights.sizes()[1] == fc2_expert_weights.sizes()[2] * mInnerDimMultiplier * 2 (Line 540), while you now thread base_activation_type through and use isGatedActivation in runMoe and in getQuantParams. If min‑latency mode is ever used with non‑gated activations, this hard-coded * 2 will reject valid shapes.

Consider mirroring the gated/ungated check from runMoe here (using isGatedActivation(base_activation_type) and an expand_ratio) so both paths accept the same configurations, or explicitly document/guard that the min‑latency path only supports gated activations.

Also applies to: 556-559, 614-618

286-316: Duplicate fc1/fc2 bias validation block in runMoe.

The validation of fc1_expert_biases / fc2_expert_biases (dims and expert counts) appears twice in a row with identical logic (Lines 286–300 and 302–316). This is pre‑existing, but you might want to collapse it into a single block to avoid redundant checks and future divergence.
tensorrt_llm/_torch/utils.py (1)
384-392: Harden split against invalid tp_size / index values

Current implementation assumes a valid configuration; a bad tp_size or idx would fail with less clear errors (ZeroDivisionError or IndexError).

Consider tightening it slightly:
-def split(x: torch.Tensor,
-          tp_size: int,
-          idx: int,
-          dim: int = 0) -> torch.Tensor:
-    assert x.shape[dim] % tp_size == 0
+def split(x: torch.Tensor,
+          tp_size: int,
+          idx: int,
+          dim: int = 0) -> torch.Tensor:
+    assert tp_size > 0, "tp_size must be > 0"
+    assert 0 <= idx < tp_size, f"idx must be in [0, {tp_size}), got {idx}"
+    assert x.shape[dim] % tp_size == 0, (
+        f"Dimension {dim} (size={x.shape[dim]}) must be divisible by tp_size={tp_size}"
+    )
     split_size = x.shape[dim] // tp_size
     if tp_size == 1:
         return x
     return torch.split(x, split_size, dim=dim)[idx]
This keeps the helper cheap while surfacing configuration bugs with clearer diagnostics.
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

12-13: Activation-type metadata is correct; just ensure consumers agree on “expanded” vs “base” intermediate size

is_gated_activation(activation_type) and the derived intermediate_size_expand_ratio = 2 if gated else 1 are consistent with the Swiglu/Geglu gating semantics and will help MoE backends derive shapes.

The one thing to keep in mind is that self.intermediate_size and self.intermediate_size_per_partition are still computed from the raw intermediate_size argument. Downstream logic (e.g., quantization get_weights_shapes, expert weight layouts) must consistently treat that argument as either:

already expanded (2× for gated), and then use intermediate_size_expand_ratio only where necessary, or

base size, and multiply by intermediate_size_expand_ratio whenever actual projection dimensions are needed.

It’s worth double-checking those callers to make sure there’s no mixed interpretation.

Also applies to: 167-172
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)
37-52: MoE expert remap logic looks sound; consider clarifying the NVFP4 vs FP8 shape check

The _scale guard for mixer.in_proj/mixer.out_proj correctly prevents those scale tensors from going through the "A"/"D" processing, which would be wrong.

The new mixer.experts. handling:

Cleanly no-ops for moe_backend == 'VANILLA'.

Maps up_proj → w3/w1 by splitting along axis 0, which matches the expected gated layout.

Treats input_scale and weight_scale_2 as shared scalars (duplicated for w1/w3), and weight_scale as:

NVFP4: vector/array split in half to w3/w1.

FP8: scalar duplicated to both.

Maps down_proj → w2 and raises on any unexpected MoE weight key, which is a good fail-fast.

For readability (and to avoid relying on the truthiness of Tensor.shape), it would be clearer to write the NVFP4 vs FP8 branch as:
-                        elif "weight_scale" in key:
-                            # NVFP4 case.
-                            if weights[name].shape:
+                        elif "weight_scale" in key:
+                            # NVFP4 (per-channel) vs FP8 (scalar) scale handling.
+                            if weights[name].ndim > 0:
                                 new_weights[w3_key] = weights[
                                     name][:weights[name].shape[0] // 2]
                                 new_weights[w1_key] = weights[name][
                                     weights[name].shape[0] // 2:]
                             # FP8 case.
                             else:
                                 new_weights[w3_key] = weights[name]
                                 new_weights[w1_key] = weights[name]
Functionally equivalent, but it makes the intent (tensor vs scalar) more explicit.

Also applies to: 98-130
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
1863-1911: Well-implemented alignment logic for Cutlass NVFP4 requirements.

The override correctly applies row alignment (128) and validates column alignment (4). The shape calculations properly account for alignment padding while maintaining consistency between weights and their scales.

Minor note: The static analysis hint (TRY003) suggests defining the error message in the exception class, but the current format provides clear debugging information.

Consider extracting the alignment ceiling calculation as a helper to improve readability:
+    @staticmethod
+    def _align_up(value: int, alignment: int) -> int:
+        return (value + alignment - 1) // alignment * alignment
+
     def get_weights_shapes(self, module: torch.nn.Module, weight_vec_size: int,
                            block_scales_vec_size: int):
         """Override the base method to get aligned weights shapes for Cutlass nvfp4 alignment."""
         intermediate_size_expand = module.intermediate_size_per_partition * module.intermediate_size_expand_ratio
-        intermediate_size_expand_aligned = (
-            intermediate_size_expand + self.NVFP4_ROW_ALIGNMENT -
-            1) // self.NVFP4_ROW_ALIGNMENT * self.NVFP4_ROW_ALIGNMENT
+        intermediate_size_expand_aligned = self._align_up(
+            intermediate_size_expand, self.NVFP4_ROW_ALIGNMENT)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6af01dc and b9389bc.

📒 Files selected for processing (14)

cpp/tensorrt_llm/thop/moeOp.cpp (6 hunks)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (3 hunks)
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1 hunks)
tensorrt_llm/_torch/models/modeling_nemotron_h.py (7 hunks)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (4 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (4 hunks)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (11 hunks)
tensorrt_llm/_torch/utils.py (3 hunks)
tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (7 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/interface.py
tensorrt_llm/_torch/utils.py
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
tensorrt_llm/_torch/models/modeling_nemotron_h.py
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/interface.py
tensorrt_llm/_torch/utils.py
tests/unittest/_torch/modules/test_fused_moe.py
cpp/tensorrt_llm/thop/moeOp.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
tensorrt_llm/_torch/models/modeling_nemotron_h.py
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

cpp/tensorrt_llm/thop/moeOp.cpp

🧠 Learnings (13)

📓 Common learnings

Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/test_lists/test-db/l0_a10.yml
tests/integration/test_lists/test-db/l0_h100.yml

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/test-db/l0_a10.yml
tests/integration/test_lists/test-db/l0_h100.yml
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py

📚 Learning: 2025-10-20T16:54:09.824Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py

📚 Learning: 2025-09-17T02:48:52.732Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

tests/integration/test_lists/test-db/l0_h100.yml

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
tests/unittest/_torch/modules/test_fused_moe.py
cpp/tensorrt_llm/thop/moeOp.cpp

📚 Learning: 2025-10-20T17:07:18.745Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
tensorrt_llm/_torch/models/modeling_nemotron_h.py

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

cpp/tensorrt_llm/thop/moeOp.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
tensorrt_llm/_torch/models/modeling_nemotron_h.py

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

cpp/tensorrt_llm/thop/moeOp.cpp

📚 Learning: 2025-08-17T15:07:01.420Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.

Applied to files:

cpp/tensorrt_llm/thop/moeOp.cpp

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

cpp/tensorrt_llm/thop/moeOp.cpp

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

cpp/tensorrt_llm/thop/moeOp.cpp

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/thop/moeOp.cpp

🧬 Code graph analysis (8)

tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1)

tensorrt_llm/_torch/utils.py (1)

split (384-392)

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (3)

tensorrt_llm/_torch/utils.py (2)

split (384-392)

shape (140-141)

tensorrt_llm/_torch/models/modeling_utils.py (1)

config (522-523)

tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)

config (156-159)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)

tensorrt_llm/_torch/utils.py (2)

ActivationType (38-47)

Fp4QuantizedTensor (134-141)

tests/unittest/_torch/helpers.py (1)

ceil_div (12-13)

tensorrt_llm/quantization/utils/fp8_utils.py (1)

ceil_div (10-21)

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

tensorrt_llm/_torch/utils.py (3)

get_model_extra_attrs (88-89)

is_gated_activation (52-55)

is_torch_compiling (63-65)

tests/unittest/_torch/modules/test_fused_moe.py (1)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)

get_weights_shapes (1567-1601)

get_weights_shapes (1866-1911)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (2)

tensorrt_llm/_torch/utils.py (3)

ActivationType (38-47)

is_gated_activation (52-55)

relu2 (395-396)

tensorrt_llm/_torch/modules/gated_mlp.py (1)

GatedMLP (19-182)

tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)

tensorrt_llm/_torch/utils.py (1)

ActivationType (38-47)

tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (2)

tests/unittest/utils/util.py (1)

skip_gpu_memory_less_than (200-206)

tests/scripts/perf-sanity/run_benchmark_serve.py (1)

llm_models_root (174-175)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py

130-130: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py

55-57: Avoid specifying long messages outside the exception class

(TRY003)

59-61: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/models/modeling_nemotron_h.py

221-221: Unused method argument: kwargs

(ARG002)

422-422: Avoid specifying long messages outside the exception class

(TRY003)

tests/unittest/_torch/modeling/test_modeling_nemotron_h.py

198-198: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/modules/fused_moe/quantization.py

1875-1877: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (34)

cpp/tensorrt_llm/thop/moeOp.cpp (2)

438-439: Base activation type correctly propagated into quant params.

Passing base_activation_type into getQuantParams keeps the quantization logic consistent with the activation-dependent checks earlier in runMoe (including gated vs non-gated handling) and with getWorkspaceInfo. This looks structurally sound and matches the new signature.

864-867: Activation‑aware expand_ratio in MXFP4/NVFP4 quant params looks consistent; verify isGatedActivation coverage.

Using base_activation_type to compute expand_ratio and applying it only to the fc1 weight‑block N‑dim checks in MXFP4 (W4A8_MXFP4_FP8, W4A8_MXFP4_MXFP8) and NVFP4 branches aligns the scale tensor shapes with the gated vs non‑gated inter‑size logic. The updated error messages that mention inter_size * expand_ratio also clearly describe the expected layout. This all looks consistent with the higher‑level activation handling.

Please double‑check that:

isGatedActivation returns true for all gated variants you use here (including ActivationType::SwigluBias), and

the producers of fc1_weight_block / fc2_weight_block already generate N‑dims scaled by the same expand_ratio convention for both gated and non‑gated activations.

If either assumption doesn’t hold, the new checks could become overly strict for some activation types.

Also applies to: 912-959, 981-1004, 1014-1077

tensorrt_llm/_torch/utils.py (1)

50-55: Gated-activation and relu2 helpers look consistent with ActivationType semantics

is_gated_activation covering Swiglu/SwigluBias/Geglu and relu2 implemented as square(relu(x)) match the expected behavior and line up with how these activations are typically handled in MoE kernels. No issues from a correctness standpoint.

Also applies to: 395-396

tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1)

10-10: Consolidating split to shared utils is appropriate

Switching to tensorrt_llm._torch.utils.split keeps the Qwen3 Next mapper aligned with Nemotron and other users and avoids duplicated splitting logic. Given all call sites split along dim 0, the helper’s default matches existing behavior.

tests/integration/test_lists/test-db/l0_a10.yml (1)

22-22: Good addition of targeted NVFP4 Cutlass fused MoE test to A10 pre-merge list

Including test_nvfp4_cutlass_get_weights_shapes here ensures the new alignment logic for NVFP4 Cutlass MoE weights is exercised in the standard PyTorch pre-merge pipeline on A10.

tests/integration/test_lists/test-db/l0_h100.yml (1)

36-37: Nemotron H correctness parametrizations are well-integrated into H100 PyTorch suite

Adding the two test_nemotron_h_correctness[...] variants under the Nemotron modeling block is consistent with how other models are parameterized here and gives coverage for the new Nemotron H MoE path across the intended mamba_ssm_cache_dtype options.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

14-16: Constructor-level activation_type wiring into MoE base is consistent

Adding activation_type: ActivationType = ActivationType.Swiglu to CutlassFusedMoE.__init__ and passing it through as activation_type=activation_type to the MoE base nicely aligns this backend with the new activation metadata in the interface. This keeps the Python/C++ ActivationType enums in sync without changing external callers (which can rely on the default).

Also applies to: 77-95

545-579: Passing activation_type into the fused_moe kernel matches the new enum plumbing

Including activation_type=self.activation_type in the torch.ops.trtllm.fused_moe call ensures the Cutlass kernel can distinguish gated vs non-gated activations and adjust its internal logic (e.g., projection layout, Swiglu vs non-gated paths) accordingly. With self.activation_type already converted to the underlying IntEnum value in MoE, this call site looks correct.

tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)

9-9: Factory now cleanly threads activation_type to MoE backends that need it

The create_moe changes are coherent:

Public API: activation_type: ActivationType = ActivationType.Swiglu is added with a sensible default, so existing callers remain valid.

CutlassFusedMoE and VanillaMoE receive activation_type=activation_type, aligning them with the new MoE interface fields and the fused Cutlass kernel changes.

Other backends continue to use the base-class default ActivationType, which is fine as long as they don’t depend on non-default activations yet.

This is a good spot to centralize activation selection per model/backend.

Also applies to: 77-78, 121-138, 156-168

tests/unittest/_torch/modules/test_fused_moe.py (2)

39-40: LGTM: Import addition for new test.

The import of NVFP4CutlassFusedMoEMethod is correctly placed and follows the existing import pattern in this file.

2711-2818: Well-structured unit test for NVFP4 weight shape alignment.

The test comprehensively validates the get_weights_shapes method by:

Testing error handling for invalid hidden_size alignment.

Verifying shape calculations for weights and scales match expected aligned dimensions.

Testing both bias and no-bias scenarios.

Including edge cases like intermediate_size=120 (not divisible by 128).

The use of a MockModule is appropriate for isolating the method under test from the full module implementation.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (3)

13-15: LGTM: New imports for activation type support.

The imports follow the project's namespace convention, importing from subpackages rather than directly importing individual classes.

36-61: Clean introduction of activation type support with proper validation.

The new parameters maintain backward compatibility via defaults. The validation logic appropriately restricts non-gated activations to Relu2 and ensures pack_weights=False, preventing unsupported configurations at construction time rather than at runtime.

128-147: LGTM: Activation-aware expert creation.

The branching logic correctly handles:

ActivationType.Relu2 → MLP (non-gated, single projection)

Other (gated) activations → GatedMLP (gate + up projections)

Both paths properly propagate layer_idx for layer-aware functionality.

tensorrt_llm/_torch/models/modeling_nemotron_h.py (6)

17-39: LGTM: Import additions for NemotronH MoE support.

The new imports are well-organized and necessary for the new NemotronHMOE class and multi-stream execution support.

112-216: Well-structured NemotronHMOE class initialization.

The implementation correctly:

Handles both list and scalar moe_intermediate_size configurations.

Creates shared experts conditionally based on n_shared_experts.

Uses DeepseekV3Gate for MoE routing (consistent with codebase patterns).

Supports latent MoE projections when moe_latent_size is configured.

Sets up CUDA events for multi-stream synchronization.

The local import of DeepseekV3Gate (line 124) to avoid circular dependency is an acceptable pattern.

217-261: LGTM: Multi-stream parallel execution for MoE.

The forward method efficiently parallelizes routed and shared expert computations using maybe_execute_in_parallel. The pattern of returning 0 from _compute_shared_output when no shared experts exist enables clean addition with the routed output.

Based on learnings, the gate returns topk_indices and topk_weights in the correct shape for the MoE call without needing reshaping.

264-310: LGTM: Layer type "E" support for MoE layers.

The NemotronHLayer extension cleanly adds support for MoE layers (layer_type == "E") while passing the required aux_stream_dict for multi-stream execution.

335-361: LGTM: Auxiliary stream initialization and propagation.

The model correctly initializes three CUDA streams for MoE operations and propagates them to all layers. Creating streams once at model initialization and sharing them across layers is the correct pattern for multi-stream execution.

416-423: LGTM: Config attribute compatibility handling.

The code gracefully handles different config attribute names (rms_norm_eps vs layer_norm_epsilon) that may appear in different model configurations, normalizing to rms_norm_eps for downstream consistency.

tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (5)

4-4: LGTM: Import addition for similarity comparison.

The similar import enables fuzzy string matching for model outputs where exact matches aren't expected.

34-56: LGTM: Parametrized model folder support.

The function signature change makes model_folder explicit, improving test clarity and enabling multi-model testing. The increased max_num_tokens default (8192) accommodates larger model context requirements.

59-66: LGTM: Multi-model parametrization with memory guards.

The parametrization correctly:

Tests both 8B and 30B NemotronH variants.

Uses skip_gpu_memory_less_than to skip tests on GPUs with insufficient memory.

Memory estimates (2 × model_size + 1GB) are reasonable for FP16/BF16 inference.

83-198: Well-structured per-model reference data with appropriate tolerances.

The test correctly handles model-specific variations:

8B model: tighter tolerance (atol=0.2), exact completion matching.

30B model: relaxed tolerance (atol=0.4), similarity-based completion matching.

The comments documenting reference sources (commit hashes, hardware) provide valuable provenance for the reference values.

314-467: LGTM: Explicit model folder in remaining tests.

The test_nemotron_h_cuda_graph_overlap_scheduler and test_nemotron_h_chunked_prefill tests are correctly updated to explicitly specify model_folder="Nemotron-H-8B-Base-8K", maintaining their existing behavior.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (9)

221-223: LGTM!

The default bias shape calculation correctly incorporates intermediate_size_expand_ratio to handle both gated and non-gated activation types.

426-429: LGTM!

Consistent use of intermediate_size_expand_ratio for the fused gate/up projection weight shape.

493-508: LGTM!

The split_length calculation correctly adapts to the intermediate_size_expand_ratio for splitting the fused w3/w1 weights during requantization.

516-519: LGTM!

Consistent application of intermediate_size_expand_ratio for FP8 QDQ method.

1567-1601: Good refactoring to centralize shape computation.

The new get_weights_shapes method cleanly encapsulates shape calculations and enables subclass overrides for alignment requirements. The return tuple structure is well-organized.

1611-1654: LGTM!

The refactored create_weights correctly sets scaling_vector_size before calling get_weights_shapes, ensuring subclass overrides have access to this value. The use of keyword arguments when calling the parent's create_weights improves clarity.

1937-1950: LGTM!

The padding logic correctly handles dimension mismatches by zero-padding to the aligned destination shape. The contiguous() call ensures proper memory layout after padding.

1972-1981: LGTM!

Consistent padding implementation for w2 weight scales, matching the w3_w1 approach.

1993-2045: LGTM!

The weight loading overrides are well-implemented with:

Clear docstrings explaining the Cutlass alignment purpose

Consistent padding logic matching the weight scale loading

Appropriate use of non_blocking=True for GPU efficiency

Helpful inline comment documenting the pad order

2ez4bz · 2025-11-25T17:10:48Z

tensorrt_llm/_torch/modules/fused_moe/quantization.py


+    def get_weights_shapes(self, module: torch.nn.Module, weight_vec_size: int,
+                           block_scales_vec_size: int):
+        # Divide by 16 because we use int64 to pack 16 fp4 values


Maybe noob question: is weight_vec_size expected to be 16? Does it have to be 16? If so, it may be better to hardcode it, if not, maybe the comment can be left out?

EDIT: nevermind, I see you just moved it from the original code.

2ez4bz · 2025-11-25T17:11:40Z

tensorrt_llm/_torch/modules/fused_moe/quantization.py

+
+        if module.bias:
+            w3_w1_bias_shape = (module.expert_size_per_partition,
+                                module.intermediate_size_per_partition *


Nit: seems like the following value could be assigned to a variable as it is used in multiple places

module.intermediate_size_per_partition * module.intermediate_size_expand_ratio

2ez4bz · 2025-11-25T17:14:23Z

tensorrt_llm/_torch/modules/fused_moe/quantization.py

+        dst_row, dst_col = dst_w3_w1_weight_scale.shape
+        _row, _col = cast_w31_weight_scale.shape
+        if _row != dst_row or _col != dst_col:
+            cast_w31_weight_scale = torch.nn.functional.pad(


Nit: leave a comment for why we are doing these few lines? Similar below.

2ez4bz · 2025-11-25T17:15:47Z

tensorrt_llm/_torch/modules/fused_moe/quantization.py

+                                            device=device)
+        cast_w2_weight_shard = w2_weight_shard.view(dst_w2_weight.dtype)
+
+        dst_row, dst_col = dst_w2_weight.shape


I may be missing some subtle differences, but at first glance it seems like these few lines below are a common pattern between the different load_ methods. Is there any way they could be made a helper function?

2ez4bz · 2025-11-25T17:16:27Z

tests/unittest/_torch/modules/test_fused_moe.py

+    block_scales_vec_size = 4  # 4 fp8 values packed into int32
+
+    # Check if hidden_size is divisible by NVFP4_COL_ALIGNMENT
+    if hidden_size % NVFP4_COL_ALIGNMENT != 0:


I would split out positive and negative tests into their own test functions. If this fails, none of the below get executed.

Wanli-Jiang force-pushed the user/williamj/support-nvfp4-moe-alignment branch from bf5c6a7 to 43141a6 Compare November 21, 2025 08:34

[FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch b…

855ea7b

…ackend Signed-off-by: Wanli Jiang <[email protected]>

Wanli-Jiang force-pushed the user/williamj/support-nvfp4-moe-alignment branch from 43141a6 to ff0b584 Compare November 25, 2025 07:54

[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 M…

b9389bc

…oE cutlass Signed-off-by: Wanli Jiang <[email protected]>

Wanli-Jiang force-pushed the user/williamj/support-nvfp4-moe-alignment branch from ff0b584 to b9389bc Compare November 25, 2025 07:56

Wanli-Jiang marked this pull request as ready for review November 25, 2025 07:56

Wanli-Jiang requested review from a team as code owners November 25, 2025 07:56

Wanli-Jiang requested review from 2ez4bz, leslie-fang25, tomeras91 and yuxianq November 25, 2025 07:56

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

Wanli-Jiang requested review from omera-nv and removed request for 2ez4bz, leslie-fang25, tomeras91 and yuxianq November 25, 2025 08:21

2ez4bz reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass #9358

[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass #9358

Uh oh!

Wanli-Jiang commented Nov 21, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 25, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

2ez4bz Nov 25, 2025

Uh oh!

2ez4bz Nov 25, 2025

Uh oh!

2ez4bz Nov 25, 2025

Uh oh!

2ez4bz Nov 25, 2025

Uh oh!

2ez4bz Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass #9358

Are you sure you want to change the base?

[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass #9358

Uh oh!

Conversation

Wanli-Jiang commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Nov 25, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

2ez4bz Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

2ez4bz Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

2ez4bz Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

2ez4bz Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

2ez4bz Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Wanli-Jiang commented Nov 21, 2025 •

edited

Loading