Skip to content

Conversation

@Wanli-Jiang
Copy link
Collaborator

@Wanli-Jiang Wanli-Jiang commented Nov 21, 2025

Note

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable activation type support for Mixture of Experts (MoE) operations with handling for gated vs non-gated activations
    • Enhanced quantization methods for MoE weights (NVFP4, FP8 variants)
    • Introduced multi-stream parallel execution for MoE routing and shared paths
    • Extended Nemotron-H model architecture with integrated MoE layers
  • Tests

    • Added validation tests for MoE weight shape alignment with quantization
    • Added model correctness tests for Nemotron-H variants with MoE support

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nvfp4-moe-alignment branch from bf5c6a7 to 43141a6 Compare November 21, 2025 08:34
@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nvfp4-moe-alignment branch from 43141a6 to ff0b584 Compare November 25, 2025 07:54
@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nvfp4-moe-alignment branch from ff0b584 to b9389bc Compare November 25, 2025 07:56
@Wanli-Jiang Wanli-Jiang marked this pull request as ready for review November 25, 2025 07:56
@Wanli-Jiang Wanli-Jiang requested review from a team as code owners November 25, 2025 07:56
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive support for gated and non-gated activation types in MoE modules, dynamically adjusting intermediate size expansion and weight quantization shapes accordingly. Changes span C++ quantization logic, Python MoE backends, Nemotron H model integration with auxiliary CUDA streams for parallel execution, and weight mapping for experts. New utility functions for activation type checking and tensor operations were added.

Changes

Cohort / File(s) Summary
C++ Quantization Backend
cpp/tensorrt_llm/thop/moeOp.cpp
Added base_activation_type parameter to getQuantParams method; introduced expand_ratio (2 for gated, 1 for non-gated); replaced hard-coded multipliers with computed expand_ratio across multiple quantization paths (FP8, MXFP4, NVFP4); updated validation checks and error messages for fc1_weight_block alignment.
Python Utilities
tensorrt_llm/_torch/utils.py
Added three new public functions: is_gated_activation(ActivationType) -> bool to detect gated activations; split(x, tp_size, idx, dim) -> Tensor for tensor partitioning; relu2(x) -> Tensor for squared ReLU computation. Added import of torch.nn.functional as F.
MoE Core Interface & Configuration
tensorrt_llm/_torch/modules/fused_moe/interface.py
Added is_gated_activation import; introduced internal attributes is_gated_activation and intermediate_size_expand_ratio (computed from activation type) to track gating state and sizing requirements.
MoE Backend Initialization
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
Added activation_type: ActivationType = ActivationType.Swiglu parameter; propagated activation_type to CutlassFusedMoE and VanillaMoE constructors; added layer_idx=layer_idx argument to VanillaMoE.
MoE Cutlass Implementation
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
Added activation_type: ActivationType = ActivationType.Swiglu parameter to CutlassFusedMoE.__init__; propagated to base class via super().__init__(..., activation_type=activation_type). Added ActivationType import from utils.
MoE Vanilla Implementation
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
Added layer_idx: Optional[int] = None and activation_type: ActivationType = ActivationType.Swiglu parameters; implemented conditional expert creation: instantiate MLP with relu2 activation for Relu2 type, else GatedMLP; both receive layer_idx. Added imports for ActivationType, is_gated_activation, relu2, MLP.
MoE Quantization Methods
tensorrt_llm/_torch/modules/fused_moe/quantization.py
Added get_weights_shapes() methods to NVFP4FusedMoEMethod, NVFP4CutlassFusedMoEMethod, and MXFP4WeightFusedMoEMethod returning shape dictionaries (w3_w1_weight_shape, w2_weight_shape, bias shapes, scale shapes); updated create_weights paths across FP8, NVFP4, Cutlass, and MXFP4 variants to use intermediate_size_expand_ratio for w3_w1 shapes instead of hard-coded multipliers; enhanced weight loading and padding logic for aligned quantization.
Nemotron H Model Core
tensorrt_llm/_torch/models/modeling_nemotron_h.py
Added NemotronHMOE class for MoE support; extended NemotronHLayer with aux_stream_dict parameter for multi-stream parallel MoE/MLP execution using maybe_execute_in_parallel; updated layer_type support to include "E" for MoE; added latent MoE projection support (fc1_latent_proj, fc2_latent_proj); ensured RMS norm epsilon consistency via config; refactored imports to use shared utilities.
Weight Mappers
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
Updated import paths: changed split source to tensorrt_llm._torch.utils.split; added MoE weight remapping for mixer.experts (VANILLA passthrough, other backends map up_proj to w1/w3 with scale handling, down_proj to w2); narrowed scale remapping condition to mixer.in_proj|mixer.out_proj with _scale suffix.
Test Configuration
tests/integration/test_lists/test-db/l0_a10.yml
Added new PyTorch test: unittest/_torch/modules/test_fused_moe.py::test_nvfp4_cutlass_get_weights_shapes.
Test Configuration
tests/integration/test_lists/test-db/l0_h100.yml
Added two parameterized Nemotron H correctness test variants for Nemotron-Nano-3-30B-A3.5B-dev-1024 with different mamba_ssm_cache_dtype values.
Nemotron H Test Suite
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
Added model_folder parameter to create_nemotron_h_llm() and test_nemotron_h_correctness(); updated test to parameterize over multiple model folders with per-model expected completions and reference logprobs; raised default max_num_tokens to 8192; updated all test invocations to pass model_folder explicitly.
MoE Quantization Tests
tests/unittest/_torch/modules/test_fused_moe.py
Added import of NVFP4CutlassFusedMoEMethod; implemented test_nvfp4_cutlass_get_weights_shapes() test validating shape alignment logic for NVFP4 Cutlass with assertions on row/column alignment constraints and intermediate size expansion.

Sequence Diagram(s)

sequenceDiagram
    participant Python as Python Layer
    participant MoE as MoE Module
    participant Quantization as Quantization Method
    participant Utils as Utils

    Python->>MoE: create_moe(activation_type=Swiglu/Relu2)
    MoE->>Utils: is_gated_activation(activation_type)
    Utils-->>MoE: boolean (gated)
    MoE->>Quantization: Instantiate with activation_type
    Quantization->>Quantization: Compute intermediate_size_expand_ratio (2 if gated, 1 if not)
    Quantization->>Quantization: get_weights_shapes() computes w3_w1_weight_shape<br/>= inter_size * expand_ratio
    Note over Quantization: For Relu2: expand_ratio=1<br/>For Swiglu/Gelu: expand_ratio=2
    Quantization-->>MoE: Shape info for weight allocation
    MoE-->>Python: MoE module ready
Loading
sequenceDiagram
    participant Python as Nemotron H Forward
    participant Layer as NemotronHLayer
    participant MoE as NemotronHMOE
    participant Stream as CUDA Streams

    Python->>Layer: forward(x, layer_type='E', aux_stream_dict)
    alt layer_type == 'E'
        Layer->>Stream: Get MoE streams (routing, balancer, etc.)
        Layer->>MoE: Schedule MoE computation on aux streams
        Layer->>Layer: Schedule shared MLP on main/aux streams
        MoE->>Stream: maybe_execute_in_parallel(moe_out, shared_out)
        Stream-->>Layer: Synchronized outputs
        Layer->>Layer: Sum routed + shared results
    else layer_type != 'E'
        Layer->>Layer: Standard FFN path
    end
    Layer-->>Python: Output tensor
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • Activation-type-driven expand_ratio logic (moeOp.cpp, quantization.py, interface.py): Critical flow where expand_ratio affects weight shapes, block validation, and error messages across multiple quantization backends (FP8, MXFP4, NVFP4, Cutlass). Verify consistency across all branches and that hard-coded multipliers were correctly replaced.
  • NVFP4 weight-shape computation and alignment (quantization.py): Multiple added/modified get_weights_shapes() methods with row/column alignment constraints. Review correctness of split_length calculations, padding logic, and constraint enforcement (especially for Cutlass variant).
  • NemotronHMOE integration and multi-stream execution (modeling_nemotron_h.py): New class with auxiliary stream handling via maybe_execute_in_parallel and EventType. Verify stream lifecycle, synchronization points, and correctness of parallel MoE/shared-MLP execution and result summation.
  • Expert instantiation conditionals (fused_moe_vanilla.py): Branch logic selecting MLP vs. GatedMLP based on activation_type; verify constraint enforcement (pack_weights=False for non-gated) and correctness of relu2 usage.
  • Parameter threading through call chain (create_moe.py, fused_moe_cutlass.py): Verify activation_type is correctly propagated through constructors and base-class initializers across all backends.

Possibly related PRs

Suggested reviewers

  • limin2021
  • zongfeijing
  • chzblych
  • kaiyux
  • djns99

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description consists entirely of the repository template with boilerplate text. The required sections (Description, Test Coverage) are present but empty, providing no actual explanation of the changes, rationale, or test coverage. Fill in the Description section explaining the issue and solution, and the Test Coverage section listing relevant test cases that validate the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly specifies the feature (NVFP4 MoE cutlass support) and the main change (weight and weight_scale padding), directly corresponding to the changes in quantization.py and related MoE files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
cpp/tensorrt_llm/thop/moeOp.cpp (2)

540-541: Align runMoeMinLantency fc1/fc2 shape checks with gated vs non‑gated activations.

runMoeMinLantency still enforces fc1_expert_weights.sizes()[1] == fc2_expert_weights.sizes()[2] * mInnerDimMultiplier * 2 (Line 540), while you now thread base_activation_type through and use isGatedActivation in runMoe and in getQuantParams. If min‑latency mode is ever used with non‑gated activations, this hard-coded * 2 will reject valid shapes.

Consider mirroring the gated/ungated check from runMoe here (using isGatedActivation(base_activation_type) and an expand_ratio) so both paths accept the same configurations, or explicitly document/guard that the min‑latency path only supports gated activations.

Also applies to: 556-559, 614-618


286-316: Duplicate fc1/fc2 bias validation block in runMoe.

The validation of fc1_expert_biases / fc2_expert_biases (dims and expert counts) appears twice in a row with identical logic (Lines 286–300 and 302–316). This is pre‑existing, but you might want to collapse it into a single block to avoid redundant checks and future divergence.

tensorrt_llm/_torch/utils.py (1)

384-392: Harden split against invalid tp_size / index values

Current implementation assumes a valid configuration; a bad tp_size or idx would fail with less clear errors (ZeroDivisionError or IndexError).

Consider tightening it slightly:

-def split(x: torch.Tensor,
-          tp_size: int,
-          idx: int,
-          dim: int = 0) -> torch.Tensor:
-    assert x.shape[dim] % tp_size == 0
+def split(x: torch.Tensor,
+          tp_size: int,
+          idx: int,
+          dim: int = 0) -> torch.Tensor:
+    assert tp_size > 0, "tp_size must be > 0"
+    assert 0 <= idx < tp_size, f"idx must be in [0, {tp_size}), got {idx}"
+    assert x.shape[dim] % tp_size == 0, (
+        f"Dimension {dim} (size={x.shape[dim]}) must be divisible by tp_size={tp_size}"
+    )
     split_size = x.shape[dim] // tp_size
     if tp_size == 1:
         return x
     return torch.split(x, split_size, dim=dim)[idx]

This keeps the helper cheap while surfacing configuration bugs with clearer diagnostics.

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

12-13: Activation-type metadata is correct; just ensure consumers agree on “expanded” vs “base” intermediate size

is_gated_activation(activation_type) and the derived intermediate_size_expand_ratio = 2 if gated else 1 are consistent with the Swiglu/Geglu gating semantics and will help MoE backends derive shapes.

The one thing to keep in mind is that self.intermediate_size and self.intermediate_size_per_partition are still computed from the raw intermediate_size argument. Downstream logic (e.g., quantization get_weights_shapes, expert weight layouts) must consistently treat that argument as either:

  • already expanded (2× for gated), and then use intermediate_size_expand_ratio only where necessary, or
  • base size, and multiply by intermediate_size_expand_ratio whenever actual projection dimensions are needed.

It’s worth double-checking those callers to make sure there’s no mixed interpretation.

Also applies to: 167-172

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)

37-52: MoE expert remap logic looks sound; consider clarifying the NVFP4 vs FP8 shape check

  • The _scale guard for mixer.in_proj/mixer.out_proj correctly prevents those scale tensors from going through the "A"/"D" processing, which would be wrong.
  • The new mixer.experts. handling:
    • Cleanly no-ops for moe_backend == 'VANILLA'.
    • Maps up_projw3/w1 by splitting along axis 0, which matches the expected gated layout.
    • Treats input_scale and weight_scale_2 as shared scalars (duplicated for w1/w3), and weight_scale as:
      • NVFP4: vector/array split in half to w3/w1.
      • FP8: scalar duplicated to both.
    • Maps down_projw2 and raises on any unexpected MoE weight key, which is a good fail-fast.

For readability (and to avoid relying on the truthiness of Tensor.shape), it would be clearer to write the NVFP4 vs FP8 branch as:

-                        elif "weight_scale" in key:
-                            # NVFP4 case.
-                            if weights[name].shape:
+                        elif "weight_scale" in key:
+                            # NVFP4 (per-channel) vs FP8 (scalar) scale handling.
+                            if weights[name].ndim > 0:
                                 new_weights[w3_key] = weights[
                                     name][:weights[name].shape[0] // 2]
                                 new_weights[w1_key] = weights[name][
                                     weights[name].shape[0] // 2:]
                             # FP8 case.
                             else:
                                 new_weights[w3_key] = weights[name]
                                 new_weights[w1_key] = weights[name]

Functionally equivalent, but it makes the intent (tensor vs scalar) more explicit.

Also applies to: 98-130

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

1863-1911: Well-implemented alignment logic for Cutlass NVFP4 requirements.

The override correctly applies row alignment (128) and validates column alignment (4). The shape calculations properly account for alignment padding while maintaining consistency between weights and their scales.

Minor note: The static analysis hint (TRY003) suggests defining the error message in the exception class, but the current format provides clear debugging information.

Consider extracting the alignment ceiling calculation as a helper to improve readability:

+    @staticmethod
+    def _align_up(value: int, alignment: int) -> int:
+        return (value + alignment - 1) // alignment * alignment
+
     def get_weights_shapes(self, module: torch.nn.Module, weight_vec_size: int,
                            block_scales_vec_size: int):
         """Override the base method to get aligned weights shapes for Cutlass nvfp4 alignment."""
         intermediate_size_expand = module.intermediate_size_per_partition * module.intermediate_size_expand_ratio
-        intermediate_size_expand_aligned = (
-            intermediate_size_expand + self.NVFP4_ROW_ALIGNMENT -
-            1) // self.NVFP4_ROW_ALIGNMENT * self.NVFP4_ROW_ALIGNMENT
+        intermediate_size_expand_aligned = self._align_up(
+            intermediate_size_expand, self.NVFP4_ROW_ALIGNMENT)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6af01dc and b9389bc.

📒 Files selected for processing (14)
  • cpp/tensorrt_llm/thop/moeOp.cpp (6 hunks)
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (3 hunks)
  • tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py (7 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py (4 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (4 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/interface.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py (11 hunks)
  • tensorrt_llm/_torch/utils.py (3 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
  • tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (7 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/utils.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/utils.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
🧠 Learnings (13)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

  • tests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
🧬 Code graph analysis (8)
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1)
tensorrt_llm/_torch/utils.py (1)
  • split (384-392)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (3)
tensorrt_llm/_torch/utils.py (2)
  • split (384-392)
  • shape (140-141)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • config (522-523)
tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
  • config (156-159)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
tensorrt_llm/_torch/utils.py (2)
  • ActivationType (38-47)
  • Fp4QuantizedTensor (134-141)
tests/unittest/_torch/helpers.py (1)
  • ceil_div (12-13)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • ceil_div (10-21)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
tensorrt_llm/_torch/utils.py (3)
  • get_model_extra_attrs (88-89)
  • is_gated_activation (52-55)
  • is_torch_compiling (63-65)
tests/unittest/_torch/modules/test_fused_moe.py (1)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • get_weights_shapes (1567-1601)
  • get_weights_shapes (1866-1911)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (2)
tensorrt_llm/_torch/utils.py (3)
  • ActivationType (38-47)
  • is_gated_activation (52-55)
  • relu2 (395-396)
tensorrt_llm/_torch/modules/gated_mlp.py (1)
  • GatedMLP (19-182)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
tensorrt_llm/_torch/utils.py (1)
  • ActivationType (38-47)
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (2)
tests/unittest/utils/util.py (1)
  • skip_gpu_memory_less_than (200-206)
tests/scripts/perf-sanity/run_benchmark_serve.py (1)
  • llm_models_root (174-175)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py

130-130: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py

55-57: Avoid specifying long messages outside the exception class

(TRY003)


59-61: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/models/modeling_nemotron_h.py

221-221: Unused method argument: kwargs

(ARG002)


422-422: Avoid specifying long messages outside the exception class

(TRY003)

tests/unittest/_torch/modeling/test_modeling_nemotron_h.py

198-198: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/modules/fused_moe/quantization.py

1875-1877: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (34)
cpp/tensorrt_llm/thop/moeOp.cpp (2)

438-439: Base activation type correctly propagated into quant params.

Passing base_activation_type into getQuantParams keeps the quantization logic consistent with the activation-dependent checks earlier in runMoe (including gated vs non-gated handling) and with getWorkspaceInfo. This looks structurally sound and matches the new signature.


864-867: Activation‑aware expand_ratio in MXFP4/NVFP4 quant params looks consistent; verify isGatedActivation coverage.

Using base_activation_type to compute expand_ratio and applying it only to the fc1 weight‑block N‑dim checks in MXFP4 (W4A8_MXFP4_FP8, W4A8_MXFP4_MXFP8) and NVFP4 branches aligns the scale tensor shapes with the gated vs non‑gated inter‑size logic. The updated error messages that mention inter_size * expand_ratio also clearly describe the expected layout. This all looks consistent with the higher‑level activation handling.

Please double‑check that:

  • isGatedActivation returns true for all gated variants you use here (including ActivationType::SwigluBias), and
  • the producers of fc1_weight_block / fc2_weight_block already generate N‑dims scaled by the same expand_ratio convention for both gated and non‑gated activations.

If either assumption doesn’t hold, the new checks could become overly strict for some activation types.

Also applies to: 912-959, 981-1004, 1014-1077

tensorrt_llm/_torch/utils.py (1)

50-55: Gated-activation and relu2 helpers look consistent with ActivationType semantics

is_gated_activation covering Swiglu/SwigluBias/Geglu and relu2 implemented as square(relu(x)) match the expected behavior and line up with how these activations are typically handled in MoE kernels. No issues from a correctness standpoint.

Also applies to: 395-396

tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py (1)

10-10: Consolidating split to shared utils is appropriate

Switching to tensorrt_llm._torch.utils.split keeps the Qwen3 Next mapper aligned with Nemotron and other users and avoids duplicated splitting logic. Given all call sites split along dim 0, the helper’s default matches existing behavior.

tests/integration/test_lists/test-db/l0_a10.yml (1)

22-22: Good addition of targeted NVFP4 Cutlass fused MoE test to A10 pre-merge list

Including test_nvfp4_cutlass_get_weights_shapes here ensures the new alignment logic for NVFP4 Cutlass MoE weights is exercised in the standard PyTorch pre-merge pipeline on A10.

tests/integration/test_lists/test-db/l0_h100.yml (1)

36-37: Nemotron H correctness parametrizations are well-integrated into H100 PyTorch suite

Adding the two test_nemotron_h_correctness[...] variants under the Nemotron modeling block is consistent with how other models are parameterized here and gives coverage for the new Nemotron H MoE path across the intended mamba_ssm_cache_dtype options.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

14-16: Constructor-level activation_type wiring into MoE base is consistent

Adding activation_type: ActivationType = ActivationType.Swiglu to CutlassFusedMoE.__init__ and passing it through as activation_type=activation_type to the MoE base nicely aligns this backend with the new activation metadata in the interface. This keeps the Python/C++ ActivationType enums in sync without changing external callers (which can rely on the default).

Also applies to: 77-95


545-579: Passing activation_type into the fused_moe kernel matches the new enum plumbing

Including activation_type=self.activation_type in the torch.ops.trtllm.fused_moe call ensures the Cutlass kernel can distinguish gated vs non-gated activations and adjust its internal logic (e.g., projection layout, Swiglu vs non-gated paths) accordingly. With self.activation_type already converted to the underlying IntEnum value in MoE, this call site looks correct.

tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)

9-9: Factory now cleanly threads activation_type to MoE backends that need it

The create_moe changes are coherent:

  • Public API: activation_type: ActivationType = ActivationType.Swiglu is added with a sensible default, so existing callers remain valid.
  • CutlassFusedMoE and VanillaMoE receive activation_type=activation_type, aligning them with the new MoE interface fields and the fused Cutlass kernel changes.
  • Other backends continue to use the base-class default ActivationType, which is fine as long as they don’t depend on non-default activations yet.

This is a good spot to centralize activation selection per model/backend.

Also applies to: 77-78, 121-138, 156-168

tests/unittest/_torch/modules/test_fused_moe.py (2)

39-40: LGTM: Import addition for new test.

The import of NVFP4CutlassFusedMoEMethod is correctly placed and follows the existing import pattern in this file.


2711-2818: Well-structured unit test for NVFP4 weight shape alignment.

The test comprehensively validates the get_weights_shapes method by:

  1. Testing error handling for invalid hidden_size alignment.
  2. Verifying shape calculations for weights and scales match expected aligned dimensions.
  3. Testing both bias and no-bias scenarios.
  4. Including edge cases like intermediate_size=120 (not divisible by 128).

The use of a MockModule is appropriate for isolating the method under test from the full module implementation.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py (3)

13-15: LGTM: New imports for activation type support.

The imports follow the project's namespace convention, importing from subpackages rather than directly importing individual classes.


36-61: Clean introduction of activation type support with proper validation.

The new parameters maintain backward compatibility via defaults. The validation logic appropriately restricts non-gated activations to Relu2 and ensures pack_weights=False, preventing unsupported configurations at construction time rather than at runtime.


128-147: LGTM: Activation-aware expert creation.

The branching logic correctly handles:

  • ActivationType.Relu2MLP (non-gated, single projection)
  • Other (gated) activations → GatedMLP (gate + up projections)

Both paths properly propagate layer_idx for layer-aware functionality.

tensorrt_llm/_torch/models/modeling_nemotron_h.py (6)

17-39: LGTM: Import additions for NemotronH MoE support.

The new imports are well-organized and necessary for the new NemotronHMOE class and multi-stream execution support.


112-216: Well-structured NemotronHMOE class initialization.

The implementation correctly:

  • Handles both list and scalar moe_intermediate_size configurations.
  • Creates shared experts conditionally based on n_shared_experts.
  • Uses DeepseekV3Gate for MoE routing (consistent with codebase patterns).
  • Supports latent MoE projections when moe_latent_size is configured.
  • Sets up CUDA events for multi-stream synchronization.

The local import of DeepseekV3Gate (line 124) to avoid circular dependency is an acceptable pattern.


217-261: LGTM: Multi-stream parallel execution for MoE.

The forward method efficiently parallelizes routed and shared expert computations using maybe_execute_in_parallel. The pattern of returning 0 from _compute_shared_output when no shared experts exist enables clean addition with the routed output.

Based on learnings, the gate returns topk_indices and topk_weights in the correct shape for the MoE call without needing reshaping.


264-310: LGTM: Layer type "E" support for MoE layers.

The NemotronHLayer extension cleanly adds support for MoE layers (layer_type == "E") while passing the required aux_stream_dict for multi-stream execution.


335-361: LGTM: Auxiliary stream initialization and propagation.

The model correctly initializes three CUDA streams for MoE operations and propagates them to all layers. Creating streams once at model initialization and sharing them across layers is the correct pattern for multi-stream execution.


416-423: LGTM: Config attribute compatibility handling.

The code gracefully handles different config attribute names (rms_norm_eps vs layer_norm_epsilon) that may appear in different model configurations, normalizing to rms_norm_eps for downstream consistency.

tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (5)

4-4: LGTM: Import addition for similarity comparison.

The similar import enables fuzzy string matching for model outputs where exact matches aren't expected.


34-56: LGTM: Parametrized model folder support.

The function signature change makes model_folder explicit, improving test clarity and enabling multi-model testing. The increased max_num_tokens default (8192) accommodates larger model context requirements.


59-66: LGTM: Multi-model parametrization with memory guards.

The parametrization correctly:

  • Tests both 8B and 30B NemotronH variants.
  • Uses skip_gpu_memory_less_than to skip tests on GPUs with insufficient memory.
  • Memory estimates (2 × model_size + 1GB) are reasonable for FP16/BF16 inference.

83-198: Well-structured per-model reference data with appropriate tolerances.

The test correctly handles model-specific variations:

  • 8B model: tighter tolerance (atol=0.2), exact completion matching.
  • 30B model: relaxed tolerance (atol=0.4), similarity-based completion matching.

The comments documenting reference sources (commit hashes, hardware) provide valuable provenance for the reference values.


314-467: LGTM: Explicit model folder in remaining tests.

The test_nemotron_h_cuda_graph_overlap_scheduler and test_nemotron_h_chunked_prefill tests are correctly updated to explicitly specify model_folder="Nemotron-H-8B-Base-8K", maintaining their existing behavior.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (9)

221-223: LGTM!

The default bias shape calculation correctly incorporates intermediate_size_expand_ratio to handle both gated and non-gated activation types.


426-429: LGTM!

Consistent use of intermediate_size_expand_ratio for the fused gate/up projection weight shape.


493-508: LGTM!

The split_length calculation correctly adapts to the intermediate_size_expand_ratio for splitting the fused w3/w1 weights during requantization.


516-519: LGTM!

Consistent application of intermediate_size_expand_ratio for FP8 QDQ method.


1567-1601: Good refactoring to centralize shape computation.

The new get_weights_shapes method cleanly encapsulates shape calculations and enables subclass overrides for alignment requirements. The return tuple structure is well-organized.


1611-1654: LGTM!

The refactored create_weights correctly sets scaling_vector_size before calling get_weights_shapes, ensuring subclass overrides have access to this value. The use of keyword arguments when calling the parent's create_weights improves clarity.


1937-1950: LGTM!

The padding logic correctly handles dimension mismatches by zero-padding to the aligned destination shape. The contiguous() call ensures proper memory layout after padding.


1972-1981: LGTM!

Consistent padding implementation for w2 weight scales, matching the w3_w1 approach.


1993-2045: LGTM!

The weight loading overrides are well-implemented with:

  • Clear docstrings explaining the Cutlass alignment purpose
  • Consistent padding logic matching the weight scale loading
  • Appropriate use of non_blocking=True for GPU efficiency
  • Helpful inline comment documenting the pad order

@Wanli-Jiang Wanli-Jiang requested review from omera-nv and removed request for 2ez4bz, leslie-fang25, tomeras91 and yuxianq November 25, 2025 08:21

def get_weights_shapes(self, module: torch.nn.Module, weight_vec_size: int,
block_scales_vec_size: int):
# Divide by 16 because we use int64 to pack 16 fp4 values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe noob question: is weight_vec_size expected to be 16? Does it have to be 16? If so, it may be better to hardcode it, if not, maybe the comment can be left out?

EDIT: nevermind, I see you just moved it from the original code.


if module.bias:
w3_w1_bias_shape = (module.expert_size_per_partition,
module.intermediate_size_per_partition *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: seems like the following value could be assigned to a variable as it is used in multiple places

module.intermediate_size_per_partition * module.intermediate_size_expand_ratio

dst_row, dst_col = dst_w3_w1_weight_scale.shape
_row, _col = cast_w31_weight_scale.shape
if _row != dst_row or _col != dst_col:
cast_w31_weight_scale = torch.nn.functional.pad(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: leave a comment for why we are doing these few lines? Similar below.

device=device)
cast_w2_weight_shard = w2_weight_shard.view(dst_w2_weight.dtype)

dst_row, dst_col = dst_w2_weight.shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing some subtle differences, but at first glance it seems like these few lines below are a common pattern between the different load_ methods. Is there any way they could be made a helper function?

block_scales_vec_size = 4 # 4 fp8 values packed into int32

# Check if hidden_size is divisible by NVFP4_COL_ALIGNMENT
if hidden_size % NVFP4_COL_ALIGNMENT != 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split out positive and negative tests into their own test functions. If this fails, none of the below get executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants