MXFP4 x BF16 CUTLASS MoE backend perf and profiling improvement on Hopper #8721

StudyingShao · 2025-10-28T09:27:52Z

Summary by CodeRabbit

Optimization
- Enhanced mixed-precision conversion support for Hopper GPUs.
- Improved GEMM configuration and tactic selection for SM90 architecture.
- Optimized weight-only quantization with FP4 interleaving for Hopper mixed GEMM operations.
Refactor
- Streamlined Mixture of Experts weight preparation pipeline for 2D weight shards.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Jiang Shao <[email protected]>

… Hopper MXFP4 x BF16 Grouped GEMM. Signed-off-by: Jiang Shao <[email protected]>

Signed-off-by: Jiang Shao <[email protected]>

coderabbitai · 2025-10-28T09:34:34Z

📝 Walkthrough

Walkthrough

The changes add FP4 interleaving support for Hopper mixed GEMM operations, introduce device-level conversion utilities for FP4/BF16 format handling with FP8 scaling overrides, refactor GEMM heuristic configuration selection to enforce COOPERATIVE scheduling and restrict certain tile sizes, and integrate FP4 interleaving into MoE weight preparation pipelines via PyTorch bindings.

Changes

Cohort / File(s)	Change Summary
Device Conversion Utilities `cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp`, `cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp`	Added `psx_cvt_lut_prmt_fp4x8_to_bf16x8_interleaved` device function for interleaved FP4-to-BF16 conversion with updated call sites. Introduced `scale_convertor` template method that overrides FP8 conversion behavior when `ElementA` is `cutlass::float_e2m1_t`, replacing three direct `static_cast<float>` conversions.
GEMM Heuristic Configuration `cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp`	Refactored `get_candidate_configs_sm90` weight-only grouped GEMM path: removed PINGPONG mainloop schedule option, enforced COOPERATIVE-only scheduling, added constraints to skip unsupported cooperative configurations and restrict CtaShape128x128x128B tile, replaced loop-based candidate generation with explicit construction of four configurations.
FP4 Interleaving Kernel & Header `cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu`, `cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.h`	Introduced new CUDA kernel `interleave_fp4_for_Hopper_mixed_gemm_kernel` that rearranges FP4 values by swapping nibbles between row pairs for column-major layout, paired with host launcher `interleave_fp4_for_Hopper_mixed_gemm` on a 1024×1024 grid. Provided corresponding header declarations.
PyTorch Weight Quantization Binding `cpp/tensorrt_llm/thop/weightOnlyQuantOp.cpp`	Added `fp4_interleave_for_Hopper_mixed_gemm` PyTorch function with input validation (uint8, CUDA, contiguous, n/k divisibility) and registered corresponding Torch operator.
MoE Weight Preparation Integration `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Integrated conditional FP4 interleaving in W4A8 Cutlass/FP8 fused MoE paths: applied `fp4_interleave_for_Hopper_mixed_gemm` to concatenated `w31_weight_shard` and `w2_weight_shard` after padding in 2D cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

CUDA kernel logic (moe_gemm_mixed_utils.cu): Verify nibble-swapping algorithm correctness, indexing calculations for row/column block mapping, and memory access patterns for column-major input.
GEMM heuristic refactor (cutlass_heuristic.cpp): Confirm that removal of PINGPONG schedules and tile restrictions do not unexpectedly limit valid tactic space; validate early-exit conditions and cluster shape enumeration.
Scale conversion override (gemm collective header): Ensure FP8 fast-path conversion behavior and interaction with surrounding MMA calculations are correct.
Integration consistency (quantization.py): Verify conditional 2D logic, tensor shape assumptions (n/k divisibility), and alignment with kernel expectations.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is largely incomplete and consists entirely of empty template sections. The "Description" section that should explain the issue and solution is blank, the "Test Coverage" section listing relevant tests is empty, and the PR Checklist items are unchecked with no actual content provided. No actual implementation details, testing information, or justification for the changes appears anywhere in the description body. This fails to meet the repository's requirement for a meaningful description that explains what changes were made and why.	Fill out the PR description with concrete details: provide a brief explanation of the FP4 interleaving optimization and its performance benefits, list the specific test cases or benchmarks that validate the changes (especially for Hopper GPU scenarios), and complete the PR Checklist by reviewing each item against the CODING GUIDELINES. Use the AI summary from the raw_summary as a starting point if needed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "MXFP4 x BF16 CUTLASS MoE backend perf and profiling improvement on Hopper" directly aligns with the changes in the pull request. The changeset introduces FP4 interleaving utilities, optimizes CUTLASS kernel scheduling, and modifies MoE quantization paths specifically for Hopper GPU performance. The title accurately captures the domain (MXFP4 x BF16, CUTLASS, MoE, Hopper) and the primary intent of performance optimization. While the title is somewhat high-level and doesn't explicitly mention "interleaving," it provides sufficient clarity for a developer scanning the history to understand the main improvement.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
1393-1402: Fix typo in error message

w3_weight_shard referenced twice. Use the intended variable for clarity.
-                f"Invalid shape of w1_weight_shard {w1_weight_shard.shape} and w3_weight_shard {w1_weight_shard.shape}"
+                f"Invalid shape of w1_weight_shard {w1_weight_shard.shape} and w3_weight_shard {w3_weight_shard.shape}"

🧹 Nitpick comments (10)

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

319-330: Consider refactoring repetitive config construction.

The four candidate configurations differ only in their cluster shape parameter. This repetition could be reduced by iterating over the cluster shapes.

Apply this diff to refactor:

-            CutlassGemmConfig candidate(
-                tile_config, mainloop_schedule, epilogue_schedule, ClusterShape::ClusterShape_1x1x1);
-            candidate_configs.push_back(candidate);
-            candidate = CutlassGemmConfig(
-                tile_config, mainloop_schedule, epilogue_schedule, ClusterShape::ClusterShape_2x1x1);
-            candidate_configs.push_back(candidate);
-            candidate = CutlassGemmConfig(
-                tile_config, mainloop_schedule, epilogue_schedule, ClusterShape::ClusterShape_1x2x1);
-            candidate_configs.push_back(candidate);
-            candidate = CutlassGemmConfig(
-                tile_config, mainloop_schedule, epilogue_schedule, ClusterShape::ClusterShape_2x2x1);
-            candidate_configs.push_back(candidate);
+            ClusterShape const cluster_shapes[] = {ClusterShape::ClusterShape_1x1x1,
+                ClusterShape::ClusterShape_2x1x1, ClusterShape::ClusterShape_1x2x1,
+                ClusterShape::ClusterShape_2x2x1};
+            for (auto const cluster_shape : cluster_shapes)
+            {
+                CutlassGemmConfig candidate(tile_config, mainloop_schedule, epilogue_schedule, cluster_shape);
+                candidate_configs.push_back(candidate);
+            }

cpp/tensorrt_llm/thop/weightOnlyQuantOp.cpp (2)

402-425: Propagate CUDA stream and add post-kernel error checks

The kernel is launched from the .cu side on the default stream; propagate the current stream and check errors to avoid silent failures and stream hazards.

Apply these coordinated changes:

In this file (pass stream):
-    interleave_fp4_for_Hopper_mixed_gemm(weight_ptr, weight_interleaved_ptr, n, k);
+    interleave_fp4_for_Hopper_mixed_gemm(
+        weight_ptr, weight_interleaved_ptr, n, k, at::cuda::getCurrentCUDAStream());
In moe_gemm_mixed_utils.h (new signature):
-void interleave_fp4_for_Hopper_mixed_gemm(uint8_t* weight, uint8_t* weight_interleaved, int const rows, int const cols);
+void interleave_fp4_for_Hopper_mixed_gemm(
+    uint8_t* weight, uint8_t* weight_interleaved, int const rows, int const cols, cudaStream_t stream);
In moe_gemm_mixed_utils.cu (use stream + error check):
- interleave_fp4_for_Hopper_mixed_gemm_kernel<<<1024, 1024>>>(weight, weight_interleaved, rows, cols);
+ interleave_fp4_for_Hopper_mixed_gemm_kernel<<<grid, block, 0, stream>>>(weight, weight_interleaved, rows, cols);
+ CUDA_CHECK(cudaGetLastError());
Also compute grid/block from rows/cols instead of fixed 1024x1024.

Please confirm the kernel expects column-major buffers. If it does, clarify or adjust input layout to avoid interleave mismatches with PyTorch’s row-major tensors. Based on learnings

462-463: Prefer modern Torch operator registration

torch::RegisterOperators is legacy. Consider switching to TORCH_LIBRARY_FRAGMENT(trtllm, m) for forward‑compatibility.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp (1)

102-129: Initialize bf16x8_raw for parity with the non‑interleaved path

Not functionally required, but zero‑init matches the style used above and avoids any tool false‑positives.
-    __nv_bf16x8_storage_t bf16x8_raw;
+    __nv_bf16x8_storage_t bf16x8_raw = {0, 0};

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp (1)

903-921: Use CUDA bit-cast helper to avoid strict‑aliasing UB

Replace pointer reinterpret/deref with __uint_as_float (or std::bit_cast if available) to keep the conversion well-defined.

-            uint32_t temp = 0;
-            temp = (temp | *reinterpret_cast<uint8_t*>(&scale_ue8m0)) << 23;
-            return *reinterpret_cast<float*>(&temp);
+            uint32_t temp = static_cast<uint32_t>(*reinterpret_cast<uint8_t*>(&scale_ue8m0)) << 23;
+#if defined(__CUDA_ARCH__)
+            return __uint_as_float(temp);
+#else
+            float out;
+            memcpy(&out, &temp, sizeof(out));
+            return out;
+#endif

Note: include if the fallback path is used on host. As per coding guidelines

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu (5)

22-24: Add const/restrict to improve safety and optimization.

weight is read‑only; both pointers can be __restrict__ to aid codegen. As per coding guidelines.

-__global__ void interleave_fp4_for_Hopper_mixed_gemm_kernel(
-    uint8_t* weight, uint8_t* weight_interleaved, int const rows, int const cols)
+__global__ void interleave_fp4_for_Hopper_mixed_gemm_kernel(
+    uint8_t const* __restrict__ weight,
+    uint8_t* __restrict__ weight_interleaved,
+    int const rows, int const cols)

31-33: Use wider index type for large tensors.

row_id * (cols/2) + col_id can overflow 32‑bit for very large shapes; prefer size_t. As per coding guidelines.

-            int index_a = row_id * cols / 2 + col_id;
-            int index_b = (row_id + 8) * cols / 2 + col_id;
+            size_t const stride = static_cast<size_t>(colBytes);
+            size_t const index_a = static_cast<size_t>(row_id) * stride + static_cast<size_t>(col_id);
+            size_t const index_b = static_cast<size_t>(row_id + 8) * stride + static_cast<size_t>(col_id);

37-42: Replace magic numbers with named constexprs.

Use named masks/shifts and tile sizes to improve clarity and maintainability. As per coding guidelines.

+// File-scope (anonymous namespace) or before the kernel:
+namespace {
+constexpr int  kHALF_TILE = 8;
+constexpr int  kTILE_ROWS = 16;
+constexpr int  kNIBBLE_SHIFT = 4;
+constexpr uint8_t kHI_NIBBLE = 0xF0u;
+constexpr uint8_t kLO_NIBBLE = 0x0Fu;
+} // anonymous
@@
-            uint8_t fp4_temp_a = (fp4x2_a & 0xF0U) >> 4;
-            uint8_t fp4_temp_b = (fp4x2_b & 0x0FU) << 4;
+            uint8_t const hi_a = static_cast<uint8_t>((fp4x2_a & kHI_NIBBLE) >> kNIBBLE_SHIFT);
+            uint8_t const lo_b = static_cast<uint8_t>((fp4x2_b & kLO_NIBBLE) << kNIBBLE_SHIFT);
@@
-            fp4x2_a = (fp4x2_a & 0x0FU) | fp4_temp_b;
-            fp4x2_b = (fp4x2_b & 0xF0U) | fp4_temp_a;
+            fp4x2_a = static_cast<uint8_t>((fp4x2_a & kLO_NIBBLE) | lo_b);
+            fp4x2_b = static_cast<uint8_t>((fp4x2_b & kHI_NIBBLE) | hi_a);

49-53: Avoid hardcoded launch dims; add stream and error check.

Compute grid/block from problem size, accept a stream (default 0), and check launch status. As per coding guidelines.

-void interleave_fp4_for_Hopper_mixed_gemm(uint8_t* weight, uint8_t* weight_interleaved, int const rows, int const cols)
+void interleave_fp4_for_Hopper_mixed_gemm(uint8_t const* weight,
+    uint8_t* weight_interleaved, int const rows, int const cols, cudaStream_t stream /* = 0 */)
 {
-    // column-major input
-    interleave_fp4_for_Hopper_mixed_gemm_kernel<<<1024, 1024>>>(weight, weight_interleaved, rows, cols);
+    // Preconditions match kernel’s expectations; consider host-side validation too.
+    if ((rows & 0xF) != 0 || (cols & 0x1) != 0)
+    {
+        // Option: return or assert/log based on project conventions.
+        return;
+    }
+
+    int const rowPairs = rows / 2;
+    int const colBytes = cols / 2;
+
+    // Heuristic: cap block to colBytes; prefer multiples of warpSize.
+    int const block_x = colBytes >= 256 ? 256 : (colBytes >= 128 ? 128 : 64);
+    dim3 block(block_x);
+    // Each block handles one rowPair in grid-stride; cap grid to device limit as needed.
+    dim3 grid(std::min(rowPairs, 65535));
+
+    interleave_fp4_for_Hopper_mixed_gemm_kernel<<<grid, block, 0, stream>>>(
+        weight, weight_interleaved, rows, cols);
+    // Error check macro per project convention could replace this.
+    auto const err = cudaPeekAtLastError();
+    if (err != cudaSuccess)
+    {
+        // Option: propagate via return code/logging per project standards.
+    }
 }

17-18: Optional: include if not guaranteed by the header.

If moe_gemm_mixed_utils.h doesn’t include <cstdint>, add it here to define uint8_t. As per coding guidelines.

 #include "moe_gemm_mixed_utils.h"
+// #include <cstdint> // if not included transitively

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b37a8a9 and 5f14619.

📒 Files selected for processing (7)

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp (2 hunks)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp (4 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.h (1 hunks)
cpp/tensorrt_llm/thop/weightOnlyQuantOp.cpp (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}