Skip to content

Conversation

@yizhang-nv
Copy link
Member

@yizhang-nv yizhang-nv commented Oct 28, 2025

This PR disables torch.compile for cat and copy inside attention custom op when piecewise cuda graph is running. For iterations that run under piecewise cuda graph mode, it has a high possibility that they are host-bound cases. Thus, disabling it would improve TTFT.

Summary by CodeRabbit

  • New Features
    • Added support for piecewise execution mode with improved runner lifecycle management.
    • Implemented conditional torch.compile functionality that intelligently enables or disables compilation based on the current execution mode, optimizing performance across different inference scenarios.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yizhang-nv yizhang-nv requested review from a team as code owners October 28, 2025 03:00
@yizhang-nv yizhang-nv requested review from hyukn, kaiyux, liji-nv, yuantailing and yuxianq and removed request for hyukn and yuxianq October 28, 2025 03:00
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a piecewise runner orchestration system with conditional torch.compile execution. It adds runner lifecycle tracking to PiecewiseInterpreter and PiecewiseRunner, implements first/last runner signaling via set_piecewise_running, and conditionally bypasses torch.compile during piecewise execution in attention modules.

Changes

Cohort / File(s) Summary
Global state management
tensorrt_llm/_torch/utils.py
Added global flag is_piecewise_running_flag with getter/setter functions: set_piecewise_running(enable: bool) and is_piecewise_running() -> bool to manage piecewise execution state.
Runner orchestration
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
Enhanced PiecewiseInterpreter with piecewise_runner_num and piecewise_runner_idx tracking; updated PiecewiseRunner with is_first_runner and is_last_runner state; added first/last runner signaling via set_piecewise_running() calls; imported set_piecewise_running from utils; propagated runner flags through constructor and call flow.
Conditional compilation
tensorrt_llm/_torch/modules/attention.py
Introduced maybe_compile() wrapper that conditionally applies torch.compile based on is_piecewise_running() state; replaced direct compiled helper usages with conditional variants (compiled_copy_maybe_compiled_copy_, compiled_catmaybe_compiled_cat) across multiple forward methods; added is_piecewise_running import.

Sequence Diagram(s)

sequenceDiagram
    participant Optimizer as piecewise_optimizer
    participant Interpreter as PiecewiseInterpreter
    participant Runner as PiecewiseRunner
    participant Utils as utils

    Optimizer->>Interpreter: __init__(piecewise_runner_num)
    Interpreter->>Interpreter: Initialize piecewise_runner_idx = 0

    loop For each module
        Interpreter->>Runner: __init__(..., is_first_runner, is_last_runner)
        Runner->>Runner: Store is_first_runner, is_last_runner
        
        alt is_first_runner or is_last_runner
            Runner->>Utils: set_piecewise_running(True)
        end
        
        Interpreter->>Interpreter: piecewise_runner_idx += 1
    end

    Note over Interpreter,Runner: Runner lifecycle complete
Loading
sequenceDiagram
    participant Module as attention.forward
    participant Compile as maybe_compile wrapper
    participant Utils as is_piecewise_running()
    participant Torch as torch.compile

    Module->>Compile: maybe_compiled_copy_(dst, src)
    Compile->>Utils: is_piecewise_running()?
    
    alt Piecewise running (True)
        Compile-->>Module: Return original function
        Module->>Module: Execute uncompiled
    else Normal execution (False)
        Compile->>Torch: Apply torch.compile
        Torch-->>Compile: Return compiled variant
        Compile-->>Module: Return compiled function
        Module->>Module: Execute compiled
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • piecewise_optimizer.py: Verify runner indexing logic, state initialization order, and correct threading of is_first_runner/is_last_runner flags through constructors.
  • attention.py: Confirm all compiled_cat and compiled_copy_ call sites are properly replaced; validate that conditional compilation doesn't affect functional correctness.
  • utils.py: Confirm thread-safety of global flag if applicable; validate getter/setter interface semantics.
  • Cross-file integration: Verify that set_piecewise_running() calls align with runner lifecycle expectations and that is_piecewise_running() checks in attention modules correctly bypass compilation during piecewise execution.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The PR description is largely incomplete and does not follow the required template structure. While the author provided a brief explanation at the top describing what the PR does ("disables torch.compile for cat and copy inside attention custom op when piecewise cuda graph is running"), this text was placed outside the template sections rather than in the "Description" section. Critical sections are missing: the "Test Coverage" section is entirely empty with no tests listed to safeguard the changes, and the "PR Checklist" contains mostly unchecked items with only one final checkbox marked. The template structure itself is not properly filled, leaving readers without the organized information typically required for PR review. The author should properly fill out the PR description by placing the explanation in the dedicated "Description" section of the template, providing a comprehensive list of relevant tests in the "Test Coverage" section that validate the torch.compile conditional logic changes, and completing the PR Checklist items to confirm adherence to coding guidelines, test coverage, and documentation requirements. This structured information is essential for reviewers to properly assess the scope and quality of the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[https://nvbugs/5550409][fix] Disable torch compile in piecewise attention part to Avoid host overhead" is directly related to the main functional change in the changeset. The raw summary confirms that the attention.py modifications conditionally disable torch.compile when piecewise execution is active, and the PR description explicitly states the goal is to disable torch.compile for cat and copy operations during piecewise cuda graph execution to improve performance. While the PR also includes supporting infrastructure changes to the piecewise optimizer and utilities, these are necessary enablers for the primary change, not deviations from it. The title clearly conveys what a teammate would need to understand the core change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/utils.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025) to comply with repo policy.
Please prepend the standard NVIDIA Apache-2.0 copyright header with year 2025.

As per coding guidelines.

tensorrt_llm/_torch/compilation/piecewise_optimizer.py (2)

1-1: Add NVIDIA Apache-2.0 header (2025).
Required by project policy.

As per coding guidelines.


2-2: Python 3.8+ typing: replace 3.10+ syntax and fix tuple/optional types.
The project targets Python 3.8+. Current hints (int | torch.SymInt, list[int], tuple[…], tuple[GraphModule, int]) won’t parse; also runtime_num_tokens_idx is a pair or None.

Apply:

-from typing import Callable, List, Optional, Sequence, Union
+from typing import Callable, List, Optional, Sequence, Tuple, Union

 class PiecewiseInterpreter(Interpreter):
     def __init__(
         self,
         module: GraphModule,
         enable_inductor: bool,
-        compile_time_num_tokens: Union[int | torch.SymInt],
-        capture_num_tokens: list[int],
-        exclude_modules_id: list[int],
+        compile_time_num_tokens: Union[int, torch.SymInt],
+        capture_num_tokens: List[int],
+        exclude_modules_id: List[int],
         piecewise_runner_num: int,
-        graph_pool_handle: tuple[int, int],
+        graph_pool_handle: Tuple[int, int],
         ...
     ):

 class PiecewiseRunner(object):
     def __init__(
         self,
         graph: GraphModule,
         name: str,
-        compile_time_num_tokens: Union[int | torch.SymInt],
-        runtime_num_tokens_idx: tuple[int],
+        compile_time_num_tokens: Union[int, torch.SymInt],
+        runtime_num_tokens_idx: Optional[Tuple[int, int]],
         capture_num_tokens: List[int],
         graph_pool_handle,
         default_callable: Callable,
         enable_inductor: bool,
         is_first_runner: bool,
         is_last_runner: bool,
     ):

 def piecewise_optimizer(
     gm: GraphModule,
     example_inputs: List[torch.Tensor],
     enable_inductor: bool,
-    input_num_tokens: Union[int | torch.SymInt],
+    input_num_tokens: Union[int, torch.SymInt],
     capture_num_tokens: Sequence[int],
-    graph_pool_handle: tuple[int, int],
+    graph_pool_handle: Tuple[int, int],
     max_num_streams: int = 1,
-) -> tuple[GraphModule, int]:
+) -> Tuple[GraphModule, int]:

Optional cleanups: prefer “is not None” over “!= None”.

Also applies to: 24-36, 123-135, 235-244, 243-244

tensorrt_llm/_torch/modules/attention.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).
Please add the standard header at file top.

As per coding guidelines.

🧹 Nitpick comments (2)
tensorrt_llm/_torch/utils.py (1)

15-15: Naming: consider G_IS_PIECEWISE_RUNNING for globals per style.
If you keep any process‑global flags, prefer G_UPPER_SNAKE_CASE.

As per coding guidelines.

tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)

136-167: None checks and small nits.
Use “is not None” for None checks; rename runtime_num_of_token -> num_tokens_at_runtime for clarity.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a02f5f and ca3f47b.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py (8 hunks)
  • tensorrt_llm/_torch/modules/attention.py (6 hunks)
  • tensorrt_llm/_torch/utils.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/modules/attention.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/modules/attention.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)
tensorrt_llm/_torch/utils.py (3)
  • get_piecewise_cuda_graph_flag (278-280)
  • make_weak_ref (91-106)
  • set_piecewise_running (44-46)
tensorrt_llm/_torch/modules/attention.py (1)
tensorrt_llm/_torch/utils.py (2)
  • is_piecewise_running (49-51)
  • is_torch_compiling (39-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

1231-1235: Verification confirms all old compiled_* references have been successfully migrated to maybe_compiled_* wrappers.

The ripgrep search found only the new wrapper function definitions (line 87: maybe_compiled_copy_, line 92: maybe_compiled_cat) and their usages at lines 1232, 1328, 1423, and 1481—all correctly using the maybe_compiled_* prefix. No legacy compiled_copy_ or compiled_cat calls remain.

@yizhang-nv yizhang-nv force-pushed the disable-compile-in-piecewise branch from 17bcf22 to 05fff03 Compare October 28, 2025 06:09
@yizhang-nv
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22736 [ run ] triggered by Bot. Commit: 05fff03

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22736 [ run ] completed with state SUCCESS. Commit: 05fff03
/LLM/main/L0_MergeRequest_PR pipeline #17144 completed with status: 'SUCCESS'

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yizhang-nv yizhang-nv changed the title [https://nvbugs/5550409][fix]Disable torch compile in piecewise attention part to Avoid host overhead [https://nvbugs/5550409][fix] Disable torch compile in piecewise attention part to Avoid host overhead Oct 29, 2025
Copy link
Member

@yuantailing yuantailing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved according to the following discussion:

Q: Are there any pitfalls with piecewise_runner_idx? For example, could it happen that a larger idx gets called first? Or could the number of call_module calls not equal len(set(node_to_graph_id.values())) - len(exclude_modules_id)?

A: This won't happen, the idx in piecewise is guaranteed to maintain order. len(set(node_to_graph_id.values())) calculates how many graphs were split in total, and len(exclude_modules_id) represents how many subgraphs run under eager mode.

@yizhang-nv yizhang-nv enabled auto-merge (squash) October 29, 2025 10:12
@yizhang-nv yizhang-nv merged commit a69bd2a into NVIDIA:main Oct 29, 2025
6 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants