-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) #9288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #25038 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughComprehensive refactoring of Blackwell GEMM kernel infrastructure. Renames runner and kernel classes from "Persistent" to "Contiguous" variants, introduces persistent tile scheduling with finalize fusion support, implements per-class kernel caching, refactors fused MoE paths to use new single-stage grouped GEMM with finalize operations, and adds post-load weight transformation hooks. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant fused_moe_cute_dsl
participant cute_dsl_custom_ops
participant GroupedGemmSwigluRunner
participant GroupedGemmFinalizeRunner
participant Kernel
User->>fused_moe_cute_dsl: forward(x)
rect rgb(200, 220, 255)
Note over fused_moe_cute_dsl: Stage 1: Swiglu Fusion
fused_moe_cute_dsl->>cute_dsl_custom_ops: grouped_gemm_swiglu_blackwell()
cute_dsl_custom_ops->>GroupedGemmSwigluRunner: forward(inputs, tactic)
GroupedGemmSwigluRunner->>Kernel: Launch with swiglu epilogue
Kernel-->>GroupedGemmSwigluRunner: (x, x_sf)
GroupedGemmSwigluRunner-->>cute_dsl_custom_ops: (x, x_sf)
cute_dsl_custom_ops-->>fused_moe_cute_dsl: (x, x_sf)
end
rect rgb(220, 255, 220)
Note over fused_moe_cute_dsl: Stage 2: Finalize Fusion
fused_moe_cute_dsl->>cute_dsl_custom_ops: grouped_gemm_finalize_blackwell()
cute_dsl_custom_ops->>GroupedGemmFinalizeRunner: forward(inputs, tactic)
GroupedGemmFinalizeRunner->>Kernel: Launch with finalize epilogue<br/>(atomic writes, routing)
Kernel-->>GroupedGemmFinalizeRunner: x_final
GroupedGemmFinalizeRunner-->>cute_dsl_custom_ops: x_final
cute_dsl_custom_ops-->>fused_moe_cute_dsl: x_final
end
fused_moe_cute_dsl-->>User: x_final
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
1831-1858: CUTEDSL NVFP4 FC1 post-load interleaving matches the test-side layout; consider makingscaling_vector_sizeexplicitThe CUTEDSL-specific
post_load_weights:
- Reinterprets
w3_w1_weightasfloat4_e2m1x2and appliesinterleave_linear_and_gate(..., group_size=64, dim=1), which matches how the SwiGLU fusion tests buildb_interleaved.- Unsizzles, interleaves, then re-swizzles
fc1_weight_blockviaunswizzle_sf→interleave_linear_and_gate→swizzle_sf, keeping weight/scales layouts in sync.Functionally this aligns the Python-side weight layout with what the new grouped GEMM + SwiGLU kernel expects.
To future-proof against possible non‑16
scaling_vector_sizefor CUTEDSL, you might want to passmodule.scaling_vector_sizeexplicitly tounswizzle_sf/swizzle_sfinstead of relying on their default, e.g.:- w3_w1_weight_scale_unswizzled = unswizzle_sf( - w3_w1_weight_scale, m, n).view(-1, m, - n // module.scaling_vector_size) + w3_w1_weight_scale_unswizzled = unswizzle_sf( + w3_w1_weight_scale, m, n, module.scaling_vector_size).view( + -1, m, n // module.scaling_vector_size) ... - w3_w1_weight_scale_interleaved = swizzle_sf( - w3_w1_weight_scale_unswizzled_interleaved, m, - n).view(-1, m, n // module.scaling_vector_size) + w3_w1_weight_scale_interleaved = swizzle_sf( + w3_w1_weight_scale_unswizzled_interleaved, m, n, + module.scaling_vector_size).view(-1, m, + n // module.scaling_vector_size)Not urgent for the current B200/GB200 path, but it would avoid subtle bugs if the vector size ever changes for CUTEDSL.
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (1)
343-425: NVFP4 grouped GEMM tests look sound; tighten minor nits and double-check finalize reference dimensionsThe three new NVFP4 Blackwell tests collectively exercise:
- Plain grouped GEMM vs
cute_dsl_nvfp4_grouped_gemm_refunder realistic routing distributions, ensuring only the firstnum_valid_permuted_tokensare compared.- Finalize fusion vs a two-step
grouped_gemm_ref+moe_unpermutebaseline usingmoe_sort’s tile mappings andtoken_final_scales.- SwiGLU fusion vs
grouped_gemm_ref→swiglu_ref→fp4_quantize, including weight/scales interleaving viainterleave_linear_and_gateandswizzle_sf/unswizzle_sf.A couple of small cleanups and one thing to sanity-check:
Unused
total_num_padded_tokensfrommoe_sort(RUF059)
Intest_nvfp4_grouped_gemm_finalize_blackwell,total_num_padded_tokensis unpacked but never used. To keep linters quiet and signal intent, consider:- ( - tile_idx_to_group_idx, - tile_idx_to_mn_limit, - expanded_idx_to_permuted_idx, - permuted_idx_to_expanded_idx, - total_num_padded_tokens, - num_non_exiting_tiles, - ) = torch.ops.trtllm.moe_sort( + ( + tile_idx_to_group_idx, + tile_idx_to_mn_limit, + expanded_idx_to_permuted_idx, + permuted_idx_to_expanded_idx, + _total_num_padded_tokens, + num_non_exiting_tiles, + ) = torch.ops.trtllm.moe_sort(Unused loop index
i(B007)
In the expert-loop tile filling (both intest_nvfp4_grouped_gemm_blackwellandtest_nvfp4_grouped_gemm_swiglu_blackwell),iisn’t used inside the body. Renaming to_makes that explicit and silences the linter:- for i in range(num_tiles_per_expert[expert_idx].item()): + for _ in range(num_tiles_per_expert[expert_idx].item()): tile_idx_to_group_idx[tile_idx] = expert_idx tile_idx += 1Dimension semantics in the finalize test
Intest_nvfp4_grouped_gemm_finalize_blackwell, the reference path uses:
bshaped(num_local_experts, interm_size, hidden_size)andcute_dsl_nvfp4_grouped_gemm_refreturning a tensor of shape(max_num_permuted_tokens, interm_size),
which is then passed tomoe_unpermute, yielding(num_tokens, interm_size).You then compare this
c_refagainstcfromcute_dsl_nvfp4_grouped_gemm_finalize_blackwellelementwise. If the finalize kernel is intended to perform GEMM2 (down-projection) plus unpermute, its natural output shape would be(num_tokens, hidden_size)rather than(num_tokens, interm_size). The current shapes imply the kernel might instead be finalizing a GEMM whose output dimension isinterm_size.Please double-check that:
- The intended finalize fusion really operates in the
(max_num_permuted_tokens, interm_size)→(num_tokens, interm_size)space, and- The order of dimensions in
bmatches the kernel’s W2 layout (vs. its transpose).If the expectation is actually
(num_tokens, hidden_size), then the referenceblayout or which dimension you treat asningrouped_gemm_refmay need to be adjusted accordingly.Overall, the tests provide good coverage for the new CuteDSL NVFP4 Blackwell kernels; the above are minor polish and a request to confirm the finalize test’s dimensional intent.
Also applies to: 433-524, 532-643
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (4)
345-347: Synthetic token distribution helper is logically sound but can overshoot total tokens.The
GroupedGemmInputsHelpermachinery (tuning buckets, shape inference, padding with a largepad_val, and the pre-hooks for grouped and finalize-fusion runners) is a good way to keep autotuning traffic realistic without depending on actual MoE routing. However,generate_num_tokens_per_expert()usescurr_num_tokens = int(balance) + 1, which can makesum(num_tokens_per_expert) > num_tokens * top_k. That in turn inflatestile_idx_to_mn_limitandpermuted_idx_to_expanded_idxbeyond what a perfectly balanced routing would need.Even though this is only used for tuning and not for real routing (where
moe_sortprovides exact mappings), it’s safer and more controllable to cap the per-expert allocation so that the sum never exceedsnum_tokens * top_kand all derived indices remain within the natural expanded range. For example, you could clampcurr_num_tokensand track a remaining budget:- curr_num_tokens = int(balance) + 1 + curr_num_tokens = max(1, min(int(balance), num_tokens * self.top_k))and then stop allocating once the global budget is exhausted.
Also applies to: 360-367, 378-421, 423-441, 442-507
509-575: Grouped contiguous GEMM runner: API, constraints, and kernel invocation align with kernel expectations.
Sm100BlockScaledContiguousGroupedGemmRunnercleanly wraps the contiguous grouped kernel:get_valid_tacticsfeedskernel_class.can_implementwith the same dtype/layout/tiling parameters the kernel documents,get_tuning_configusesGroupedGemmInputsHelperto derive buckets and enforce scale/tiling constraints on args 2 and 5, andforwardchecks all tensor shapes, dtypes, and tile-related invariants before constructing CUTLASS pointers. Thecute.compile/call ofgemm.wrappermatches that kernel’s wrapper signature (including passingtile_size/scaling_vector_sizeas Constexprs and leaving them out of the runtime call), and the class-levelkernel_cachekeyed on(scaling_vector_size, tile_size, mma_tiler_mn, cluster_shape_mn)is appropriate for sharing compiled kernels across runner instances.One small style tweak to consider:
kernel_cacheis a mutable class attribute and could be annotated asClassVar[dict]to satisfy Ruff’s RUF012 and make the intent explicit.Also applies to: 585-595, 623-628, 668-699, 702-717, 719-777
1095-1108: SwiGLU-fusion runner and op: shape checks and kernel invocation look well thought out.
Sm100BlockScaledContiguousGroupedGemmSwigluFusionRunnermirrors the grouped runner while adding SwiGLU-specific constraints: it validatesnagainst the intermediate size and scaling vector, enforces a strictern % (scaling_vector_size * 4 * 2) == 0, and allocatesc/c_sfwith shapes matching the fused kernel’s expectations (m × intermediate/2output andm*intermediate / scaling_vector_sizescales). Theget_valid_tacticsandget_tuning_configre-use the grouped helper correctly, and the CUTLASS pointer wiring pluscute.compile/call ofgemm.wrapperinclude the extrac_sfandglobal_sfparameters in the right order. The paired custom op with its fake implementation returns tensors of matching shapes and dtypes, so tracing/FX should also see consistent types.As with other runners, annotating
kernel_cacheandtuning_config_cacheasClassVarwould quiet RUF012 but is not functionally required.Also applies to: 1116-1179, 1180-1220, 1221-1255, 1256-1322, 1324-1391
35-38: Class-level caches could be annotated asClassVarto satisfy static analysis.Several runners declare mutable class attributes (
kernel_cache,tuning_config_cache) as bare attributes:
CuteDSLNVFP4BlackwellRunner.kernel_cacheSm100BlockScaledContiguousGroupedGemmRunner.kernel_cache/tuning_config_cacheSm100BlockScaledContiguousGroupedGemmFinalizeFusionRunner.kernel_cache/tuning_config_cacheSm100BlockScaledContiguousGroupedGemmSwigluFusionRunner.kernel_cache/tuning_config_cacheSm100BlockScaledFusedMoERunner.tuning_config_cacheRuff’s RUF012 warning applies here, but functionally this pattern is fine. If you’d like to silence the linter and document intent, you can annotate them as
ClassVar[dict](or a more precise mapping type) fromtyping.Also applies to: 511-513, 783-784, 1098-1100, 1420-1421
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py(18 hunks)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py(5 hunks)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py(1 hunks)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py(4 hunks)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py(3 hunks)tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py(7 hunks)
🧰 Additional context used
🧠 Learnings (15)
📓 Common learnings
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.pytensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.pytensorrt_llm/_torch/modules/fused_moe/quantization.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
📚 Learning: 2025-09-24T03:31:28.908Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.
Applied to files:
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.pytensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.
Applied to files:
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/quantization.py
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
🧬 Code graph analysis (7)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (2)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (1)
is_power_of_2(191-192)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (5)
can_implement(2149-2231)is_valid_dtypes_and_scale_factor_vec_size(1941-1992)is_valid_layouts(1995-2025)is_valid_mma_tiler_and_cluster_shape(2028-2093)is_valid_tensor_alignment(2096-2146)
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (5)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
interleave_linear_and_gate(155-166)tensorrt_llm/_torch/utils.py (2)
swizzle_sf(140-155)unswizzle_sf(158-173)cpp/tensorrt_llm/thop/fp4Quantize.cpp (2)
fp4_quantize(41-155)fp4_quantize(41-42)tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (11)
cute_dsl_nvfp4_grouped_gemm_blackwell(722-755)cute_dsl_nvfp4_grouped_gemm_finalize_blackwell(1027-1065)GroupedGemmInputsHelper(335-507)get_max_num_tiles(348-354)get_max_num_permuted_tokens(356-357)_(319-333)_(759-777)_(1069-1093)_(1365-1391)_(1582-1602)cute_dsl_nvfp4_grouped_gemm_swiglu_blackwell(1328-1361)cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (4)
moe_sort(112-125)moe_sort(112-114)moe_unpermute(204-254)moe_unpermute(204-205)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (2)
cute_dsl_nvfp4_grouped_gemm_swiglu_blackwell(1328-1361)cute_dsl_nvfp4_grouped_gemm_finalize_blackwell(1027-1065)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (7)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (3)
Sm100BlockScaledContiguousGroupedGemmKernel(58-2282)can_implement(2141-2223)wrapper(2226-2282)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (3)
Sm100BlockScaledContiguousGroupedGemmFinalizeFusionKernel(215-2313)can_implement(2149-2231)wrapper(2234-2313)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py (3)
Sm100BlockScaledContiguousGroupedGemmSwigluFusionKernel(180-2682)can_implement(2525-2607)wrapper(2610-2682)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (3)
Sm100BlockScaledPersistentDenseGemmKernel(65-1963)can_implement(1820-1877)wrapper(1881-1963)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (2)
make_ptr(142-188)dtype(118-119)tensorrt_llm/_torch/autotuner.py (8)
TunableRunner(153-209)choose_one(623-778)get_valid_tactics(156-174)OptimizationProfile(127-142)TuningConfig(53-101)forward(180-206)AutoTuner(514-1193)get(545-548)tensorrt_llm/_torch/utils.py (2)
get_last_power_of_2_num_tokens_buckets(266-273)last_positive_power_of_2(243-248)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (2)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/custom_pipeline.py (2)
PipelineTmaUmma(73-269)PipelineUmmaAsync(273-376)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (1)
is_power_of_2(191-192)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
tensorrt_llm/_torch/utils.py (2)
swizzle_sf(140-155)unswizzle_sf(158-173)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (2)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (3)
is_power_of_2(191-192)dtype(118-119)size_in_bytes(95-96)tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (15)
_setup_attributes(185-331)_compute_stages(1748-1867)_compute_grid(1870-1902)SharedStorage(523-562)kernel(647-1598)mainloop_s2t_copy_and_partition(603-643)epilog_tmem_copy_and_partition(1600-1661)_get_tma_atom_kind(1905-1930)is_valid_dtypes_and_scale_factor_vec_size(1933-1984)is_valid_layouts(1987-2017)is_valid_mma_tiler_and_cluster_shape(2020-2085)is_valid_tensor_alignment(2088-2138)check_contigous_16B_alignment(2126-2130)can_implement(2141-2223)wrapper(2226-2282)
🪛 Ruff (0.14.5)
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
452-452: Unpacked variable total_num_padded_tokens is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
565-565: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
37-37: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
393-393: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
414-414: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
428-428: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
782-782: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
783-783: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
805-807: Avoid specifying long messages outside the exception class
(TRY003)
812-812: Unused method argument: profile
(ARG002)
813-813: Unused method argument: kwargs
(ARG002)
817-817: Ambiguous variable name: l
(E741)
885-885: Ambiguous variable name: l
(E741)
1072-1072: Unused function argument: input_scale
(ARG001)
1073-1073: Unused function argument: weight_scale
(ARG001)
1074-1074: Unused function argument: alpha
(ARG001)
1075-1075: Unused function argument: tile_idx_to_group_idx
(ARG001)
1076-1076: Unused function argument: tile_idx_to_mn_limit
(ARG001)
1077-1077: Unused function argument: permuted_idx_to_expanded_idx
(ARG001)
1078-1078: Unused function argument: num_non_exiting_tiles
(ARG001)
1080-1080: Unused function argument: num_experts
(ARG001)
1081-1081: Unused function argument: top_k
(ARG001)
1082-1082: Unused function argument: num_local_experts
(ARG001)
1083-1083: Unused function argument: local_expert_offset
(ARG001)
1084-1084: Unused function argument: tile_size
(ARG001)
1086-1086: Unused function argument: scaling_vector_size
(ARG001)
1098-1098: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
1099-1099: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
1117-1119: Avoid specifying long messages outside the exception class
(TRY003)
1124-1124: Unused method argument: profile
(ARG002)
1125-1125: Unused method argument: kwargs
(ARG002)
1129-1129: Ambiguous variable name: l
(E741)
1195-1195: Ambiguous variable name: l
(E741)
1369-1369: Unused function argument: weight_scale
(ARG001)
1370-1370: Unused function argument: alpha
(ARG001)
1371-1371: Unused function argument: tile_idx_to_group_idx
(ARG001)
1372-1372: Unused function argument: num_non_exiting_tiles
(ARG001)
1373-1373: Unused function argument: global_sf
(ARG001)
1374-1374: Unused function argument: num_experts
(ARG001)
1375-1375: Unused function argument: top_k
(ARG001)
1376-1376: Unused function argument: num_local_experts
(ARG001)
1377-1377: Unused function argument: local_expert_offset
(ARG001)
1378-1378: Unused function argument: tile_size
(ARG001)
1420-1420: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
1441-1441: Unused method argument: inputs
(ARG002)
1442-1442: Unused method argument: profile
(ARG002)
1443-1443: Unused method argument: kwargs
(ARG002)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
154-154: Unused function argument: loc
(ARG001)
154-154: Unused function argument: ip
(ARG001)
171-171: Unused function argument: loc
(ARG001)
171-171: Unused function argument: ip
(ARG001)
563-563: Avoid specifying long messages outside the exception class
(TRY003)
568-568: Avoid specifying long messages outside the exception class
(TRY003)
838-838: Unused method argument: c_smem_layout_staged
(ARG002)
841-841: Unused method argument: epilogue_op
(ARG002)
864-864: Unpacked variable bidy is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
864-864: Unpacked variable bidz is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1220-1220: Loop control variable k_tile not used within loop body
Rename unused k_tile to _k_tile
(B007)
1427-1427: Loop control variable k_tile not used within loop body
Rename unused k_tile to _k_tile
(B007)
1778-1778: Unused static method argument: epi_tile
(ARG004)
1779-1779: Unused static method argument: out_dtype
(ARG004)
1780-1780: Unused static method argument: gemm_output_layout
(ARG004)
1938-1938: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py (1)
191-192:is_power_of_2implementation is correct and idiomaticThe bitwise check with
x > 0andx & (x - 1)is the standard, efficient way to test for positive powers of two. No changes needed.tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1)
391-423: CuteDSL NVFP4 SwiGLU + finalize wiring looks consistent with the new kernelsThe new sequence
moe_sort→moe_permutecute_dsl_nvfp4_grouped_gemm_swiglu_blackwell(returningx, x_sf)cute_dsl_nvfp4_grouped_gemm_finalize_blackwell(consumingx, x_sf, tile mappings, andtoken_final_scales)is consistent with the custom-op signatures and the new tests: dtypes/views (
float4_e2m1fn_x2/uint8) and shapes fortile_idx_to_group_idx,tile_idx_to_mn_limit,permuted_idx_to_expanded_idx, andtoken_final_scalesall line up.One thing to double-check offline: using
self.fc2_input_scaleasglobal_sffor the SwiGLU fused GEMM gives numerically equivalent results to the previous explicit quantize + second GEMM path across your supported configs.tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
155-166:interleave_linear_and_gatecorrectly performs gate/linear interleavingThe reshape + transpose pattern along
dimimplements the desired[gate[group], linear[group]]interleaving in fixedgroup_sizeblocks and preserves the original shape. The modulo-normalizeddimand assertion ongroup_size * 2keep misuse guarded. This looks good.tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (2)
6-7: Test imports align with new MoE quantization utilitiesBringing in
interleave_linear_and_gateplusswizzle_sf/unswizzle_sffrom the fused MoE utilities matches how the kernels expect weights and scales to be laid out and keeps the tests close to the production code paths.
292-302:fp4_quantizenow consistently usessf_vec_sizein the SwiGLU NVFP4 quantize testPassing
sf_vec_sizeintotorch.ops.trtllm.fp4_quantizekeeps the test’s reference path in sync with the kernel under test (moe_swiglu_nvfp4_quantize) and avoids hard-coding the scaling-vector size.tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py (2)
55-96: Importingis_power_of_2and renaming the kernel class are consistent with the new contiguous path.Using the shared
.utils.is_power_of_2matches the finalize-fusion helper and avoids duplicating the bit-twiddling logic, and theSm100BlockScaledContiguousGroupedGemmKernelname + docstring/example updates align with the new contiguous/grouped GEMM naming across the PR.
2140-2219:can_implementclassmethod and wrapper signature line up with runner usage.The switch of
can_implementto a@classmethodand routing throughcls.is_valid_*keeps validation logic reusable for subclasses, and the parameter list (includingm_aligned) matches the calls fromSm100BlockScaledContiguousGroupedGemmRunner.get_valid_tactics. The wrapper’sl: intchange also matches howlis passed at runtime from the runner.Also applies to: 2239-2245
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (1)
61-62: Custom pipeline imports, classmethodcan_implement, and wrapper/cvt_sf helper integrate cleanly.Switching to
.custom_pipeline.PipelineTmaUmma/PipelineUmmaAsyncand to the shared.utils.is_power_of_2keeps this kernel aligned with the newer Blackwell infrastructure. The@classmethod can_implementsignature matches bothrun()andCuteDSLNVFP4BlackwellRunner.get_valid_tactics, and thewrapper’s argument ordering is consistent with thecute.compile/call pattern (withl,max_active_clusters, andswap_abas Constexprs). The newcvt_sf_MKL_to_M32x4xrm_K4xrk_LJIT helper is a straightforward MKL→MMA layout convert and is correctly wired intocreate_scale_factor_tensor().Also applies to: 1819-1877, 1881-1900, 1966-1984
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (3)
25-38: NVFP4 dense runner wiring and kernel caching look consistent.The import of
Sm100BlockScaledPersistentDenseGemmKerneland its exposure asCuteDSLNVFP4BlackwellRunner.kernel_class, the SM100 guard, tactic generation (can_implementwith fixedab/sf/cdtypes andbatch_size=1), and the kernel_cache keyed by(sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab)all line up with the refactored dense kernel. Thecute.compile/compiled_gemmcalls match thewrappersignature (withl,max_active_clusters, andswap_abtreated as Constexprs), and the custom op correctly drives the autotuner withrunner.__class__.tuning_configand then dispatches via the chosen tactic.Also applies to: 60-64, 65-139, 171-188, 201-215, 216-270, 272-286, 292-317
779-847: Finalize-fusion grouped runner and custom op are consistent with the finalize kernel contract.
Sm100BlockScaledContiguousGroupedGemmFinalizeFusionRunnerfollows the same pattern as the grouped runner: it gates on SM100 in__init__, useskernel_class.can_implementwithout_dtype=cutlass.BFloat16, and itsget_tuning_configindexes (ConstraintSpecon args 2, 5–7, 9) match the finalize custom op’s argument ordering. Inforward, the assertions ona/b/a_sf/b_sf/alpha, tile mappings,permuted_idx_to_expanded_idxlength, andtoken_final_scalesshape ensure we don’t feed invalid shapes to the kernel; the zero-initializedcmatches the “scatter-add into final output” semantics of the finalize epilogue. The pointer construction andcute.compile/call ofgemm.wrapperare in the correct order for the kernel’s wrapper, and thekernel_cache/tuning_config_cache usage parallels the grouped runner.Given the earlier learning to avoid repeating finalize-fusion safety assertions, the current level of validation here feels sufficient.
Also applies to: 851-869, 870-922, 923-953, 957-1001
1393-1417: Fused MoE helper and runner are wired correctly for the new grouped/fusion ops.
FusedMoEInputsHelper.inputs_pre_hookintentionally randomizestoken_selected_experts/token_final_scalesfor tuning, while preserving shapes and dtypes, andSm100BlockScaledFusedMoERunneruses this plus the existingmoe_sort/moe_permute/moe_unpermuteops to drive two grouped GEMMs and an intermediate SwiGLU quantize in sequence. The runner’sget_valid_tacticsreturning a single tile size (128) simplifies tuning for now, andget_tuning_config’sDynamicTensorSpecand constraint specs on inputs 1–3 line up with the fused custom op’s argument list. Overall, the fused MoE path correctly reuses the new grouped GEMM custom ops while keeping the public interface unchanged.Also applies to: 1419-1465
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py (1)
1646-1681: Dismiss the barrier placement concern — this pattern is standard across related kernels.After examining the barrier placement across related GEMM kernels (base
blockscaled_contiguous_grouped_gemm.pyandblockscaled_contiguous_grouped_gemm_swiglu_fusion.py), theepilog_sync_barrier.arrive_and_wait()call at line 1681 inside the subtile loop is intentional and consistent with established patterns, not an anomaly.All three kernels follow the same epilogue structure:
- Inside subtile loop: barrier after per-subtile operations (shared memory stores, atomic adds) to synchronize warps before proceeding to the next subtile
- Inside loop but separate: additional barriers after pipeline operations
- Outside loop: barrier after TMEM cleanup
The per-subtile synchronization ensures correctness by guaranteeing that atomic operations (which span multiple warps) complete before advancing to the next subtile. This is neither a performance bug nor redundant with the outer barrier, which serves a different synchronization point.
|
PR_Github #25038 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25107 [ run ] triggered by Bot. Commit: |
|
PR_Github #25107 [ run ] completed with state |
| tile_tokens_dim=tile_size, | ||
| ) | ||
| x = torch.ops.trtllm.cute_dsl_nvfp4_grouped_gemm_blackwell( | ||
| x = torch.ops.trtllm.cute_dsl_nvfp4_grouped_gemm_finalize_blackwell( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should respect moe_disable_finalize_fusion to ensure we can handle cases where determinism is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder, let me update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, please take a look. Thanks
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
c34bbca to
2b17db2
Compare
[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion)
Description
This PR integrates two grouped GEMM fusions for CuteDSL MoE backend:
It supports B200/GB200 NVFP4.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Release Notes
New Features
Performance Improvements
Tests