[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) #9288

syuoni · 2025-11-19T02:20:10Z

[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion)

Description

This PR integrates two grouped GEMM fusions for CuteDSL MoE backend:

Grouped GEMM1 + SwiGLU + NVFP4 DyQuant
Grouped GEMM2 + Finalize (Unpermute)

It supports B200/GB200 NVFP4.

cat > extra_llm_api_options.yaml <<EOF
enable_attention_dp: true
cuda_graph_config:
  max_batch_size: 128
  enable_padding: true
moe_config:
  backend: CUTEDSL
  max_num_tokens: 8192
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
EOF

trtllm-eval --model nvidia/DeepSeek-R1-FP4 \
    --tp_size 8 \
    --ep_size 8 \
    --max_num_tokens 6144 \
    --max_seq_len 6144 \
    --kv_cache_free_gpu_memory_fraction 0.8 \
    --extra_llm_api_options extra_llm_api_options.yaml \
    gsm8k

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

New Features
- Enhanced Blackwell (SM 100) architecture support with optimized grouped GEMM operations featuring fusion variants
- Streamlined fused MoE pipeline with integrated finalize operations reducing intermediate steps
Performance Improvements
- Improved kernel selection and caching mechanisms for better runtime efficiency
- Optimized data movement and memory access patterns for grouped operations
Tests
- Expanded test coverage for new kernel variants and Blackwell-specific code paths

syuoni · 2025-11-19T10:05:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-19T10:11:13Z

PR_Github #25038 [ run ] triggered by Bot. Commit: e47b434

coderabbitai · 2025-11-19T10:18:50Z

📝 Walkthrough

Walkthrough

Comprehensive refactoring of Blackwell GEMM kernel infrastructure. Renames runner and kernel classes from "Persistent" to "Contiguous" variants, introduces persistent tile scheduling with finalize fusion support, implements per-class kernel caching, refactors fused MoE paths to use new single-stage grouped GEMM with finalize operations, and adds post-load weight transformation hooks.

Changes

Cohort / File(s)	Summary
Core custom ops refactoring `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`	Renamed `CuteDSLNVFP4BlackwellLinear` to `CuteDSLNVFP4BlackwellRunner`; introduced class-level `kernel_class` and `kernel_cache` attributes; updated `get_valid_tactics()` to return `(mma_tiler_mn, cluster_shape_mn)` tuples; replaced per-instance kernel caching with per-class caching keyed by shape/tiler parameters; updated forward paths with new argument ordering and return value handling; renamed grouped/finalize/swiglu runner classes from "Persistent" to "Contiguous" variants.
Kernel class renames & method updates `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py`	Renamed class from `Sm100BlockScaledPersistentGroupedGemmKernel` to `Sm100BlockScaledContiguousGroupedGemmKernel`; converted `can_implement()` from staticmethod to classmethod with `cls` parameter; updated all internal validation calls to use `cls`-based references; changed wrapper parameter `l` type from `cutlass.Constexpr` to `int`; removed `mark_layout_dynamic()` call; imported `is_power_of_2` from `.utils`.
New finalize fusion kernel `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py`	Added new kernel class `Sm100BlockScaledContiguousGroupedGemmFinalizeFusionKernel` with persistent tile scheduling, TMA-based memory access, and elaborate epilogue finalize path; introduced three atomic add helper functions (`vectorized_atomic_add_bf16x8`, `vectorized_atomic_add_fp32x2`, `atomic_add_func`); implements complete GEMM orchestration with scale-factor routing and finalized accumulation.
Dense GEMM kernel updates `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py`	Switched to relative imports for pipeline classes; added `is_power_of_2` import from `.utils`; converted `can_implement()` to classmethod with `cls` parameter; renamed `__call__()` to `wrapper()`; removed `Sm100BlockScaledPersistentDenseGemmKernelWrapper` class; updated internal validation calls to use `cls` references.
Utility additions `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py`	Added module-level utility function `is_power_of_2(x: int) -> bool` for power-of-two validation using bitwise operations.
Fused MoE operator refactoring `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py`	Replaced two-step grouped GEMM path with single `grouped_gemm_swiglu_blackwell()` call; replaced intermediate quantization and second GEMM with single `grouped_gemm_finalize_blackwell()` call; added `global_sf`, `tile_idx_to_mn_limit`, `permuted_idx_to_expanded_idx`, and `token_final_scales` parameters; streamlined multi-stage pipeline to single finalize operation.
Quantization infrastructure `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Added `interleave_linear_and_gate(x, group_size=64, dim=-1)` utility function; imported `swizzle_sf` and `unswizzle_sf` from utils; introduced `post_load_weights(module)` methods in `FusedMoEMethodBase`, `NVFP4FusedMoEMethod`, and `MXFP4TRTLLMGenFusedMoEMethod` for post-load weight interleaving and scale factor transformations.
Test expansion `tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py`	Updated imports to include `interleave_linear_and_gate`, `swizzle_sf`, `unswizzle_sf`; replaced hard-coded quantization size with `sf_vec_size` parameter; added new test functions `test_nvfp4_grouped_gemm_finalize_blackwell()` and `test_nvfp4_grouped_gemm_swiglu_blackwell()` with extended parameterization; introduced interleave/swizzle transformations and SM-100 gating; added validation guards for tile counts and token distributions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant fused_moe_cute_dsl
    participant cute_dsl_custom_ops
    participant GroupedGemmSwigluRunner
    participant GroupedGemmFinalizeRunner
    participant Kernel

    User->>fused_moe_cute_dsl: forward(x)
    
    rect rgb(200, 220, 255)
        Note over fused_moe_cute_dsl: Stage 1: Swiglu Fusion
        fused_moe_cute_dsl->>cute_dsl_custom_ops: grouped_gemm_swiglu_blackwell()
        cute_dsl_custom_ops->>GroupedGemmSwigluRunner: forward(inputs, tactic)
        GroupedGemmSwigluRunner->>Kernel: Launch with swiglu epilogue
        Kernel-->>GroupedGemmSwigluRunner: (x, x_sf)
        GroupedGemmSwigluRunner-->>cute_dsl_custom_ops: (x, x_sf)
        cute_dsl_custom_ops-->>fused_moe_cute_dsl: (x, x_sf)
    end
    
    rect rgb(220, 255, 220)
        Note over fused_moe_cute_dsl: Stage 2: Finalize Fusion
        fused_moe_cute_dsl->>cute_dsl_custom_ops: grouped_gemm_finalize_blackwell()
        cute_dsl_custom_ops->>GroupedGemmFinalizeRunner: forward(inputs, tactic)
        GroupedGemmFinalizeRunner->>Kernel: Launch with finalize epilogue<br/>(atomic writes, routing)
        Kernel-->>GroupedGemmFinalizeRunner: x_final
        GroupedGemmFinalizeRunner-->>cute_dsl_custom_ops: x_final
        cute_dsl_custom_ops-->>fused_moe_cute_dsl: x_final
    end
    
    fused_moe_cute_dsl-->>User: x_final

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

cute_dsl_custom_ops.py: Class-level kernel caching semantics and forward argument ordering changes; ensure tactic tuning flows correctly with new (mma_tiler_mn, cluster_shape_mn) tuple format
blockscaled_contiguous_grouped_gemm_finalize_fusion.py: New 600+ line kernel implementation with complex persistent tile scheduling, TMA setup, and elaborate epilogue logic including atomic operations and per-expert routing
fused_moe_cute_dsl.py: Pipeline restructuring from multi-stage to two-stage finalize path; verify correctness of token_final_scales and permuted_idx_to_expanded_idx propagation through both stages
quantization.py: Post-load weight transformation hooks; verify interleaving and swizzle/unswizzle operations preserve numerical correctness across all MoE variants
Test validation: New test functions exercise novel code paths; ensure SM-100 gating is properly applied and tile/token distribution guards are sufficient

Possibly related PRs

[#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 #9011: Modifies fused MoE codepaths and grouped/finalize GEMM runner interfaces; shares overlapping kernel class refactoring and operator plumbing changes.

Suggested reviewers

yuxianq

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.95% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: integrating CuteDSL NVFP4 grouped GEMM with SwiGLU and Finalize Fusion support for B200/GB200.
Description check	✅ Passed	The description explains the core feature addition (two grouped GEMM fusions for CuteDSL MoE) and includes a practical usage example with configuration, but lacks explicit test coverage details.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
1831-1858: CUTEDSL NVFP4 FC1 post-load interleaving matches the test-side layout; consider making scaling_vector_size explicit

The CUTEDSL-specific post_load_weights:

Reinterprets w3_w1_weight as float4_e2m1x2 and applies interleave_linear_and_gate(..., group_size=64, dim=1), which matches how the SwiGLU fusion tests build b_interleaved.

Unsizzles, interleaves, then re-swizzles fc1_weight_block via unswizzle_sf → interleave_linear_and_gate → swizzle_sf, keeping weight/scales layouts in sync.

Functionally this aligns the Python-side weight layout with what the new grouped GEMM + SwiGLU kernel expects.

To future-proof against possible non‑16 scaling_vector_size for CUTEDSL, you might want to pass module.scaling_vector_size explicitly to unswizzle_sf/swizzle_sf instead of relying on their default, e.g.:
- w3_w1_weight_scale_unswizzled = unswizzle_sf(
-     w3_w1_weight_scale, m, n).view(-1, m,
-                                    n // module.scaling_vector_size)
+ w3_w1_weight_scale_unswizzled = unswizzle_sf(
+     w3_w1_weight_scale, m, n, module.scaling_vector_size).view(
+         -1, m, n // module.scaling_vector_size)
...
- w3_w1_weight_scale_interleaved = swizzle_sf(
-     w3_w1_weight_scale_unswizzled_interleaved, m,
-     n).view(-1, m, n // module.scaling_vector_size)
+ w3_w1_weight_scale_interleaved = swizzle_sf(
+     w3_w1_weight_scale_unswizzled_interleaved, m, n,
+     module.scaling_vector_size).view(-1, m,
+                                      n // module.scaling_vector_size)
Not urgent for the current B200/GB200 path, but it would avoid subtle bugs if the vector size ever changes for CUTEDSL.
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (1)
343-425: NVFP4 grouped GEMM tests look sound; tighten minor nits and double-check finalize reference dimensions

The three new NVFP4 Blackwell tests collectively exercise:

Plain grouped GEMM vs cute_dsl_nvfp4_grouped_gemm_ref under realistic routing distributions, ensuring only the first num_valid_permuted_tokens are compared.

Finalize fusion vs a two-step grouped_gemm_ref + moe_unpermute baseline using moe_sort’s tile mappings and token_final_scales.

SwiGLU fusion vs grouped_gemm_ref → swiglu_ref → fp4_quantize, including weight/scales interleaving via interleave_linear_and_gate and swizzle_sf/unswizzle_sf.

A couple of small cleanups and one thing to sanity-check:
Unused total_num_padded_tokens from moe_sort (RUF059)
In test_nvfp4_grouped_gemm_finalize_blackwell, total_num_padded_tokens is unpacked but never used. To keep linters quiet and signal intent, consider:
-    (
-        tile_idx_to_group_idx,
-        tile_idx_to_mn_limit,
-        expanded_idx_to_permuted_idx,
-        permuted_idx_to_expanded_idx,
-        total_num_padded_tokens,
-        num_non_exiting_tiles,
-    ) = torch.ops.trtllm.moe_sort(
+    (
+        tile_idx_to_group_idx,
+        tile_idx_to_mn_limit,
+        expanded_idx_to_permuted_idx,
+        permuted_idx_to_expanded_idx,
+        _total_num_padded_tokens,
+        num_non_exiting_tiles,
+    ) = torch.ops.trtllm.moe_sort(
Unused loop index i (B007)
In the expert-loop tile filling (both in test_nvfp4_grouped_gemm_blackwell and test_nvfp4_grouped_gemm_swiglu_blackwell), i isn’t used inside the body. Renaming to _ makes that explicit and silences the linter:
-    for i in range(num_tiles_per_expert[expert_idx].item()):
+    for _ in range(num_tiles_per_expert[expert_idx].item()):
        tile_idx_to_group_idx[tile_idx] = expert_idx
        tile_idx += 1
Dimension semantics in the finalize test
In test_nvfp4_grouped_gemm_finalize_blackwell, the reference path uses:

b shaped (num_local_experts, interm_size, hidden_size) and

cute_dsl_nvfp4_grouped_gemm_ref returning a tensor of shape (max_num_permuted_tokens, interm_size),
which is then passed to moe_unpermute, yielding (num_tokens, interm_size).

You then compare this c_ref against c from cute_dsl_nvfp4_grouped_gemm_finalize_blackwell elementwise. If the finalize kernel is intended to perform GEMM2 (down-projection) plus unpermute, its natural output shape would be (num_tokens, hidden_size) rather than (num_tokens, interm_size). The current shapes imply the kernel might instead be finalizing a GEMM whose output dimension is interm_size.

Please double-check that:

The intended finalize fusion really operates in the (max_num_permuted_tokens, interm_size) → (num_tokens, interm_size) space, and

The order of dimensions in b matches the kernel’s W2 layout (vs. its transpose).

If the expectation is actually (num_tokens, hidden_size), then the reference b layout or which dimension you treat as n in grouped_gemm_ref may need to be adjusted accordingly.
Overall, the tests provide good coverage for the new CuteDSL NVFP4 Blackwell kernels; the above are minor polish and a request to confirm the finalize test’s dimensional intent.

Also applies to: 433-524, 532-643
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (4)
345-347: Synthetic token distribution helper is logically sound but can overshoot total tokens.

The GroupedGemmInputsHelper machinery (tuning buckets, shape inference, padding with a large pad_val, and the pre-hooks for grouped and finalize-fusion runners) is a good way to keep autotuning traffic realistic without depending on actual MoE routing. However, generate_num_tokens_per_expert() uses curr_num_tokens = int(balance) + 1, which can make sum(num_tokens_per_expert) > num_tokens * top_k. That in turn inflates tile_idx_to_mn_limit and permuted_idx_to_expanded_idx beyond what a perfectly balanced routing would need.

Even though this is only used for tuning and not for real routing (where moe_sort provides exact mappings), it’s safer and more controllable to cap the per-expert allocation so that the sum never exceeds num_tokens * top_k and all derived indices remain within the natural expanded range. For example, you could clamp curr_num_tokens and track a remaining budget:
-                curr_num_tokens = int(balance) + 1
+                curr_num_tokens = max(1, min(int(balance), num_tokens * self.top_k))
and then stop allocating once the global budget is exhausted.

Also applies to: 360-367, 378-421, 423-441, 442-507

509-575: Grouped contiguous GEMM runner: API, constraints, and kernel invocation align with kernel expectations.

Sm100BlockScaledContiguousGroupedGemmRunner cleanly wraps the contiguous grouped kernel: get_valid_tactics feeds kernel_class.can_implement with the same dtype/layout/tiling parameters the kernel documents, get_tuning_config uses GroupedGemmInputsHelper to derive buckets and enforce scale/tiling constraints on args 2 and 5, and forward checks all tensor shapes, dtypes, and tile-related invariants before constructing CUTLASS pointers. The cute.compile/call of gemm.wrapper matches that kernel’s wrapper signature (including passing tile_size/scaling_vector_size as Constexprs and leaving them out of the runtime call), and the class-level kernel_cache keyed on (scaling_vector_size, tile_size, mma_tiler_mn, cluster_shape_mn) is appropriate for sharing compiled kernels across runner instances.

One small style tweak to consider: kernel_cache is a mutable class attribute and could be annotated as ClassVar[dict] to satisfy Ruff’s RUF012 and make the intent explicit.

Also applies to: 585-595, 623-628, 668-699, 702-717, 719-777

1095-1108: SwiGLU-fusion runner and op: shape checks and kernel invocation look well thought out.

Sm100BlockScaledContiguousGroupedGemmSwigluFusionRunner mirrors the grouped runner while adding SwiGLU-specific constraints: it validates n against the intermediate size and scaling vector, enforces a stricter n % (scaling_vector_size * 4 * 2) == 0, and allocates c/c_sf with shapes matching the fused kernel’s expectations (m × intermediate/2 output and m*intermediate / scaling_vector_size scales). The get_valid_tactics and get_tuning_config re-use the grouped helper correctly, and the CUTLASS pointer wiring plus cute.compile/call of gemm.wrapper include the extra c_sf and global_sf parameters in the right order. The paired custom op with its fake implementation returns tensors of matching shapes and dtypes, so tracing/FX should also see consistent types.

As with other runners, annotating kernel_cache and tuning_config_cache as ClassVar would quiet RUF012 but is not functionally required.

Also applies to: 1116-1179, 1180-1220, 1221-1255, 1256-1322, 1324-1391

35-38: Class-level caches could be annotated as ClassVar to satisfy static analysis.

Several runners declare mutable class attributes (kernel_cache, tuning_config_cache) as bare attributes:

CuteDSLNVFP4BlackwellRunner.kernel_cache

Sm100BlockScaledContiguousGroupedGemmRunner.kernel_cache / tuning_config_cache

Sm100BlockScaledContiguousGroupedGemmFinalizeFusionRunner.kernel_cache / tuning_config_cache

Sm100BlockScaledContiguousGroupedGemmSwigluFusionRunner.kernel_cache / tuning_config_cache

Sm100BlockScaledFusedMoERunner.tuning_config_cache

Ruff’s RUF012 warning applies here, but functionally this pattern is fine. If you’d like to silence the linter and document intent, you can annotate them as ClassVar[dict] (or a more precise mapping type) from typing.

Also applies to: 511-513, 783-784, 1098-1100, 1420-1421

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7905d6c and e47b434.

📒 Files selected for processing (8)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (18 hunks)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (5 hunks)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (1 hunks)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (4 hunks)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3 hunks)
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (7 hunks)

🧰 Additional context used

🧠 Learnings (15)

📓 Common learnings

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py

📚 Learning: 2025-09-24T03:31:28.908Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.

Applied to files:

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py

📚 Learning: 2025-10-20T17:07:18.745Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py

📚 Learning: 2025-10-20T16:54:09.824Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/quantization.py

📚 Learning: 2025-08-21T21:48:35.135Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py

🧬 Code graph analysis (7)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (2)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (1)

is_power_of_2 (191-192)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (5)

can_implement (2149-2231)

is_valid_dtypes_and_scale_factor_vec_size (1941-1992)

is_valid_layouts (1995-2025)

is_valid_mma_tiler_and_cluster_shape (2028-2093)

is_valid_tensor_alignment (2096-2146)

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (5)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

interleave_linear_and_gate (155-166)

tensorrt_llm/_torch/utils.py (2)

swizzle_sf (140-155)

unswizzle_sf (158-173)

cpp/tensorrt_llm/thop/fp4Quantize.cpp (2)

fp4_quantize (41-155)

fp4_quantize (41-42)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (11)

cute_dsl_nvfp4_grouped_gemm_blackwell (722-755)

cute_dsl_nvfp4_grouped_gemm_finalize_blackwell (1027-1065)

GroupedGemmInputsHelper (335-507)

get_max_num_tiles (348-354)

get_max_num_permuted_tokens (356-357)

_ (319-333)

_ (759-777)

_ (1069-1093)

_ (1365-1391)

_ (1582-1602)

cute_dsl_nvfp4_grouped_gemm_swiglu_blackwell (1328-1361)

cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (4)

moe_sort (112-125)

moe_sort (112-114)

moe_unpermute (204-254)

moe_unpermute (204-205)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (2)

cute_dsl_nvfp4_grouped_gemm_swiglu_blackwell (1328-1361)

cute_dsl_nvfp4_grouped_gemm_finalize_blackwell (1027-1065)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (7)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (3)

Sm100BlockScaledContiguousGroupedGemmKernel (58-2282)

can_implement (2141-2223)

wrapper (2226-2282)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (3)

Sm100BlockScaledContiguousGroupedGemmFinalizeFusionKernel (215-2313)

can_implement (2149-2231)

wrapper (2234-2313)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py (3)

Sm100BlockScaledContiguousGroupedGemmSwigluFusionKernel (180-2682)

can_implement (2525-2607)

wrapper (2610-2682)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (3)

Sm100BlockScaledPersistentDenseGemmKernel (65-1963)

can_implement (1820-1877)

wrapper (1881-1963)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (2)

make_ptr (142-188)

dtype (118-119)

tensorrt_llm/_torch/autotuner.py (8)

TunableRunner (153-209)

choose_one (623-778)

get_valid_tactics (156-174)

OptimizationProfile (127-142)

TuningConfig (53-101)

forward (180-206)

AutoTuner (514-1193)

get (545-548)

tensorrt_llm/_torch/utils.py (2)

get_last_power_of_2_num_tokens_buckets (266-273)

last_positive_power_of_2 (243-248)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (2)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/custom_pipeline.py (2)

PipelineTmaUmma (73-269)

PipelineUmmaAsync (273-376)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (1)

is_power_of_2 (191-192)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

tensorrt_llm/_torch/utils.py (2)

swizzle_sf (140-155)

unswizzle_sf (158-173)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (2)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (3)

is_power_of_2 (191-192)

dtype (118-119)

size_in_bytes (95-96)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (15)

_setup_attributes (185-331)

_compute_stages (1748-1867)

_compute_grid (1870-1902)

SharedStorage (523-562)

kernel (647-1598)

mainloop_s2t_copy_and_partition (603-643)

epilog_tmem_copy_and_partition (1600-1661)

_get_tma_atom_kind (1905-1930)

is_valid_dtypes_and_scale_factor_vec_size (1933-1984)

is_valid_layouts (1987-2017)

is_valid_mma_tiler_and_cluster_shape (2020-2085)

is_valid_tensor_alignment (2088-2138)

check_contigous_16B_alignment (2126-2130)

can_implement (2141-2223)

wrapper (2226-2282)

🪛 Ruff (0.14.5)

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py

452-452: Unpacked variable total_num_padded_tokens is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

565-565: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

37-37: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

393-393: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

414-414: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

428-428: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

782-782: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

783-783: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

805-807: Avoid specifying long messages outside the exception class

(TRY003)

812-812: Unused method argument: profile

(ARG002)

813-813: Unused method argument: kwargs

(ARG002)

817-817: Ambiguous variable name: l

(E741)

885-885: Ambiguous variable name: l

(E741)

1072-1072: Unused function argument: input_scale

(ARG001)

1073-1073: Unused function argument: weight_scale

(ARG001)

1074-1074: Unused function argument: alpha

(ARG001)

1075-1075: Unused function argument: tile_idx_to_group_idx

(ARG001)

1076-1076: Unused function argument: tile_idx_to_mn_limit

(ARG001)

1077-1077: Unused function argument: permuted_idx_to_expanded_idx

(ARG001)

1078-1078: Unused function argument: num_non_exiting_tiles

(ARG001)

1080-1080: Unused function argument: num_experts

(ARG001)

1081-1081: Unused function argument: top_k

(ARG001)

1082-1082: Unused function argument: num_local_experts

(ARG001)

1083-1083: Unused function argument: local_expert_offset

(ARG001)

1084-1084: Unused function argument: tile_size

(ARG001)

1086-1086: Unused function argument: scaling_vector_size

(ARG001)

1098-1098: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

1099-1099: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

1117-1119: Avoid specifying long messages outside the exception class

(TRY003)

1124-1124: Unused method argument: profile

(ARG002)

1125-1125: Unused method argument: kwargs

(ARG002)

1129-1129: Ambiguous variable name: l

(E741)

1195-1195: Ambiguous variable name: l

(E741)

1369-1369: Unused function argument: weight_scale

(ARG001)

1370-1370: Unused function argument: alpha

(ARG001)

1371-1371: Unused function argument: tile_idx_to_group_idx

(ARG001)

1372-1372: Unused function argument: num_non_exiting_tiles

(ARG001)

1373-1373: Unused function argument: global_sf

(ARG001)

1374-1374: Unused function argument: num_experts

(ARG001)

1375-1375: Unused function argument: top_k

(ARG001)

1376-1376: Unused function argument: num_local_experts

(ARG001)

1377-1377: Unused function argument: local_expert_offset

(ARG001)

1378-1378: Unused function argument: tile_size

(ARG001)

1420-1420: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

1441-1441: Unused method argument: inputs

(ARG002)

1442-1442: Unused method argument: profile

(ARG002)

1443-1443: Unused method argument: kwargs

(ARG002)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py

154-154: Unused function argument: loc

(ARG001)

154-154: Unused function argument: ip

(ARG001)

171-171: Unused function argument: loc

(ARG001)

171-171: Unused function argument: ip

(ARG001)

563-563: Avoid specifying long messages outside the exception class

(TRY003)

568-568: Avoid specifying long messages outside the exception class

(TRY003)

838-838: Unused method argument: c_smem_layout_staged

(ARG002)

841-841: Unused method argument: epilogue_op

(ARG002)

864-864: Unpacked variable bidy is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

864-864: Unpacked variable bidz is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1220-1220: Loop control variable k_tile not used within loop body

Rename unused k_tile to _k_tile

(B007)

1427-1427: Loop control variable k_tile not used within loop body

Rename unused k_tile to _k_tile

(B007)

1778-1778: Unused static method argument: epi_tile

(ARG004)

1779-1779: Unused static method argument: out_dtype

(ARG004)

1780-1780: Unused static method argument: gemm_output_layout

(ARG004)

1938-1938: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (12)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (1)

191-192: is_power_of_2 implementation is correct and idiomatic

The bitwise check with x > 0 and x & (x - 1) is the standard, efficient way to test for positive powers of two. No changes needed.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1)

391-423: CuteDSL NVFP4 SwiGLU + finalize wiring looks consistent with the new kernels

The new sequence

moe_sort → moe_permute

cute_dsl_nvfp4_grouped_gemm_swiglu_blackwell (returning x, x_sf)

cute_dsl_nvfp4_grouped_gemm_finalize_blackwell (consuming x, x_sf, tile mappings, and token_final_scales)

is consistent with the custom-op signatures and the new tests: dtypes/views (float4_e2m1fn_x2 / uint8) and shapes for tile_idx_to_group_idx, tile_idx_to_mn_limit, permuted_idx_to_expanded_idx, and token_final_scales all line up.

One thing to double-check offline: using self.fc2_input_scale as global_sf for the SwiGLU fused GEMM gives numerically equivalent results to the previous explicit quantize + second GEMM path across your supported configs.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

155-166: interleave_linear_and_gate correctly performs gate/linear interleaving

The reshape + transpose pattern along dim implements the desired [gate[group], linear[group]] interleaving in fixed group_size blocks and preserves the original shape. The modulo-normalized dim and assertion on group_size * 2 keep misuse guarded. This looks good.

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (2)

6-7: Test imports align with new MoE quantization utilities

Bringing in interleave_linear_and_gate plus swizzle_sf/unswizzle_sf from the fused MoE utilities matches how the kernels expect weights and scales to be laid out and keeps the tests close to the production code paths.

292-302: fp4_quantize now consistently uses sf_vec_size in the SwiGLU NVFP4 quantize test

Passing sf_vec_size into torch.ops.trtllm.fp4_quantize keeps the test’s reference path in sync with the kernel under test (moe_swiglu_nvfp4_quantize) and avoids hard-coding the scaling-vector size.

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (2)

55-96: Importing is_power_of_2 and renaming the kernel class are consistent with the new contiguous path.

Using the shared .utils.is_power_of_2 matches the finalize-fusion helper and avoids duplicating the bit-twiddling logic, and the Sm100BlockScaledContiguousGroupedGemmKernel name + docstring/example updates align with the new contiguous/grouped GEMM naming across the PR.

2140-2219: can_implement classmethod and wrapper signature line up with runner usage.

The switch of can_implement to a @classmethod and routing through cls.is_valid_* keeps validation logic reusable for subclasses, and the parameter list (including m_aligned) matches the calls from Sm100BlockScaledContiguousGroupedGemmRunner.get_valid_tactics. The wrapper’s l: int change also matches how l is passed at runtime from the runner.

Also applies to: 2239-2245

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (1)

61-62: Custom pipeline imports, classmethod can_implement, and wrapper/cvt_sf helper integrate cleanly.

Switching to .custom_pipeline.PipelineTmaUmma/PipelineUmmaAsync and to the shared .utils.is_power_of_2 keeps this kernel aligned with the newer Blackwell infrastructure. The @classmethod can_implement signature matches both run() and CuteDSLNVFP4BlackwellRunner.get_valid_tactics, and the wrapper’s argument ordering is consistent with the cute.compile/call pattern (with l, max_active_clusters, and swap_ab as Constexprs). The new cvt_sf_MKL_to_M32x4xrm_K4xrk_L JIT helper is a straightforward MKL→MMA layout convert and is correctly wired into create_scale_factor_tensor().

Also applies to: 1819-1877, 1881-1900, 1966-1984

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (3)

25-38: NVFP4 dense runner wiring and kernel caching look consistent.

The import of Sm100BlockScaledPersistentDenseGemmKernel and its exposure as CuteDSLNVFP4BlackwellRunner.kernel_class, the SM100 guard, tactic generation (can_implement with fixed ab/sf/c dtypes and batch_size=1), and the kernel_cache keyed by (sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab) all line up with the refactored dense kernel. The cute.compile/compiled_gemm calls match the wrapper signature (with l, max_active_clusters, and swap_ab treated as Constexprs), and the custom op correctly drives the autotuner with runner.__class__.tuning_config and then dispatches via the chosen tactic.

Also applies to: 60-64, 65-139, 171-188, 201-215, 216-270, 272-286, 292-317

779-847: Finalize-fusion grouped runner and custom op are consistent with the finalize kernel contract.

Sm100BlockScaledContiguousGroupedGemmFinalizeFusionRunner follows the same pattern as the grouped runner: it gates on SM100 in __init__, uses kernel_class.can_implement with out_dtype=cutlass.BFloat16, and its get_tuning_config indexes (ConstraintSpec on args 2, 5–7, 9) match the finalize custom op’s argument ordering. In forward, the assertions on a/b/a_sf/b_sf/alpha, tile mappings, permuted_idx_to_expanded_idx length, and token_final_scales shape ensure we don’t feed invalid shapes to the kernel; the zero-initialized c matches the “scatter-add into final output” semantics of the finalize epilogue. The pointer construction and cute.compile/call of gemm.wrapper are in the correct order for the kernel’s wrapper, and the kernel_cache/tuning_config_cache usage parallels the grouped runner.

Given the earlier learning to avoid repeating finalize-fusion safety assertions, the current level of validation here feels sufficient.

Also applies to: 851-869, 870-922, 923-953, 957-1001

1393-1417: Fused MoE helper and runner are wired correctly for the new grouped/fusion ops.

FusedMoEInputsHelper.inputs_pre_hook intentionally randomizes token_selected_experts/token_final_scales for tuning, while preserving shapes and dtypes, and Sm100BlockScaledFusedMoERunner uses this plus the existing moe_sort/moe_permute/moe_unpermute ops to drive two grouped GEMMs and an intermediate SwiGLU quantize in sequence. The runner’s get_valid_tactics returning a single tile size (128) simplifies tuning for now, and get_tuning_config’s DynamicTensorSpec and constraint specs on inputs 1–3 line up with the fused custom op’s argument list. Overall, the fused MoE path correctly reuses the new grouped GEMM custom ops while keeping the public interface unchanged.

Also applies to: 1419-1465

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (1)

1646-1681: Dismiss the barrier placement concern — this pattern is standard across related kernels.

After examining the barrier placement across related GEMM kernels (base blockscaled_contiguous_grouped_gemm.py and blockscaled_contiguous_grouped_gemm_swiglu_fusion.py), the epilog_sync_barrier.arrive_and_wait() call at line 1681 inside the subtile loop is intentional and consistent with established patterns, not an anomaly.

All three kernels follow the same epilogue structure:

Inside subtile loop: barrier after per-subtile operations (shared memory stores, atomic adds) to synchronize warps before proceeding to the next subtile

Inside loop but separate: additional barriers after pipeline operations

Outside loop: barrier after TMEM cleanup

The per-subtile synchronization ensures correctness by guaranteeing that atomic operations (which span multiple warps) complete before advancing to the next subtile. This is neither a performance bug nor redundant with the outer barrier, which serves a different synchronization point.

tensorrt-cicd · 2025-11-19T20:43:33Z

PR_Github #25038 [ run ] completed with state SUCCESS. Commit: e47b434
/LLM/main/L0_MergeRequest_PR pipeline #18920 completed with status: 'FAILURE'

syuoni · 2025-11-20T00:02:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-20T00:09:50Z

PR_Github #25107 [ run ] triggered by Bot. Commit: c34bbca

tensorrt_llm/_torch/modules/fused_moe/quantization.py

tensorrt-cicd · 2025-11-20T04:31:46Z

PR_Github #25107 [ run ] completed with state SUCCESS. Commit: c34bbca
/LLM/main/L0_MergeRequest_PR pipeline #18980 completed with status: 'FAILURE'

djns99 · 2025-11-20T04:45:48Z

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

-            tile_tokens_dim=tile_size,
        )
-        x = torch.ops.trtllm.cute_dsl_nvfp4_grouped_gemm_blackwell(
+        x = torch.ops.trtllm.cute_dsl_nvfp4_grouped_gemm_finalize_blackwell(


We should respect moe_disable_finalize_fusion to ensure we can handle cases where determinism is required

Thanks for the reminder, let me update.

Updated, please take a look. Thanks

Signed-off-by: Enwei Zhu <[email protected]>

syuoni changed the title ~~[None][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion)~~ [TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) Nov 19, 2025

syuoni requested review from chzblych, djns99, dongxuy04, hyukn, kaiyux, limin2021, xxi-nv, yuxianq and zongfeijing November 19, 2025 10:05

syuoni self-assigned this Nov 19, 2025

syuoni marked this pull request as ready for review November 19, 2025 10:05

syuoni requested review from a team as code owners November 19, 2025 10:05

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

yuxianq reviewed Nov 20, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/quantization.py Outdated Show resolved Hide resolved

djns99 approved these changes Nov 20, 2025

View reviewed changes

syuoni added 5 commits November 20, 2025 08:22

finalize fusion

43cff87

Signed-off-by: Enwei Zhu <[email protected]>

fix

02b28be

Signed-off-by: Enwei Zhu <[email protected]>

clean

89e7597

Signed-off-by: Enwei Zhu <[email protected]>

rename

2b0d616

Signed-off-by: Enwei Zhu <[email protected]>

update finalize fusion

a2c9f1c

Signed-off-by: Enwei Zhu <[email protected]>

syuoni added 5 commits November 20, 2025 08:22

swiglu fusion

9ce5c52

Signed-off-by: Enwei Zhu <[email protected]>

add module ut

2717692

Signed-off-by: Enwei Zhu <[email protected]>

fix perf

98948a4

Signed-off-by: Enwei Zhu <[email protected]>

NVFP4CuteDslFusedMoEMethod

313cfc3

Signed-off-by: Enwei Zhu <[email protected]>

conditional finalize fusion

2b17db2

Signed-off-by: Enwei Zhu <[email protected]>

syuoni force-pushed the cutedsl-moe-fusion branch from c34bbca to 2b17db2 Compare November 20, 2025 08:23

yuxianq approved these changes Nov 20, 2025

View reviewed changes

[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) #9288

Are you sure you want to change the base?

[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) #9288

Conversation

syuoni commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion)

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Release Notes

Uh oh!

syuoni commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

coderabbitai bot commented Nov 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

syuoni commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

djns99 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

syuoni Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

syuoni Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

syuoni commented Nov 19, 2025 •

edited by coderabbitai bot

Loading