Skip to content

Conversation

@nvchenghaoz
Copy link
Collaborator

@nvchenghaoz nvchenghaoz commented Oct 26, 2025

Summary by CodeRabbit

  • Tests

    • Extended mixture-of-experts testing to cover additional routing and token distribution scenarios.
  • Performance

    • Optimized tensor operations in Mamba model inference through streamlined data handling.

Signed-off-by: nvchenghaoz <[email protected]>
@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22521 [ run ] triggered by Bot. Commit: c5bbc31

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 26, 2025

📝 Walkthrough

Walkthrough

The changes refactor the Mamba Triton backend to replace index-based tensor selections with direct slicing operations for both prefill and decode stages, removing the prefill_idx construct. Additionally, test cases for MoE Triton kernels are extended with early_exit parameterization to cover both balanced and imbalanced routing scenarios.

Changes

Cohort / File(s) Summary
Mamba Backend Optimization
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
Replaced index-based selections with direct tensor slicing for prefill and decode data construction. Removed prefill_idx usage and simplified hs_prefill, B_prefill, C_prefill, dt_prefill, x_decode, B_decode, C_decode, dt_decode derivation. Updated output assignment for both prefill and decode stages to use slice assignment instead of index_copy_.
MoE Test Parameterization
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_moe.py
Added early_exit parameterization (False, True) to test_triton_moe_matches_torch_moe_mlp_relu2 and test_triton_quant_fp8_moe_matches_torch_quant_fp8_moe. Extended tests to conditionally vary M parameter and apply either random top-k or imbalanced routing (75% token concentration on first two experts) based on early_exit flag.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py: Requires careful verification that the direct slicing semantics correctly replace index-based operations across prefill and decode paths without altering tensor shapes or data ordering.
  • test file: Pay attention to ensure early_exit branching correctly produces intended token distributions and routing imbalance for both test cases.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is incomplete and does not follow the required template. The author provided only "@coderabbitai summary," which is a command to invoke CodeRabbit's AI rather than an actual description. The required template specifies sections for Description (explaining the issue and solution), Test Coverage (listing relevant tests), and a PR Checklist confirming various requirements. None of these sections have been completed or filled out with substantive information about what changes were made and why.
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "[None][feat] Autodeploy: Update the ssm to use slice" directly relates to the main changes in the pull request. The primary modification involves refactoring the Mamba triton backend to replace index-based selections with direct slicing operations for both prefill and decode paths, which is accurately captured by the phrase "use slice." The title follows the required format with "[None]" indicating no ticket reference and "[feat]" denoting a feature change. It is concise, specific enough for a developer scanning history to understand the core refactoring involved, and does not include extraneous information.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py (1)

137-137: Consider computing total_prefill_tokens once to avoid redundancy.

The value is recomputed here because it was previously computed inside the if num_prefill > 0 block (line 82) and isn't in scope. Consider computing total_prefill_tokens before both conditional blocks to avoid the redundant calculation.

Apply this diff to eliminate redundant computation:

     # Prefill: concatenate tokens at the front and run combined scan
+    total_prefill_tokens = 0 if num_prefill == 0 else int(seq_len[:num_prefill].sum().item())
+    
     if num_prefill > 0:
         seq_len_prefill = seq_len[:num_prefill].to(torch.int32)
-        total_prefill_tokens = int(seq_len_prefill.sum().item())

         hs_prefill = hs_flat[:total_prefill_tokens].unsqueeze(0)  # [1, S_p, H, D]

And remove the recomputation at line 137:

     # Decode: batch single-token updates via selective_state_update
     if num_decode > 0:
-        total_prefill_tokens = 0 if num_prefill == 0 else int(seq_len[:num_prefill].sum().item())
         slot_idx_decode = slot_idx[num_prefill:].to(torch.long)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6d20f6 and c5bbc31.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py (3 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_moe.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_moe.py (2)

78-125: LGTM! Good test coverage enhancement.

The parameterization of early_exit to test both balanced and imbalanced routing scenarios is well-designed. The imbalanced routing (concentrating 75% of tokens on first 2 experts) will help validate the MoE kernel's behavior under skewed load distribution.


237-348: LGTM! Consistent test parameterization for FP8 quantized MoE.

The parameterization follows the same sound pattern as the BF16 test, appropriately adjusted for larger token counts in the FP8 test. The routing logic correctly implements both balanced and imbalanced scenarios.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22521 [ run ] completed with state SUCCESS. Commit: c5bbc31
/LLM/main/L0_MergeRequest_PR pipeline #16977 completed with status: 'FAILURE'

@nvchenghaoz
Copy link
Collaborator Author

/bot run

2 similar comments
@nvchenghaoz
Copy link
Collaborator Author

/bot run

@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22568 [ run ] triggered by Bot. Commit: c5bbc31

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22568 [ run ] completed with state SUCCESS. Commit: c5bbc31
/LLM/main/L0_MergeRequest_PR pipeline #17012 completed with status: 'SUCCESS'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants