Skip to content

Conversation

@yuxianq
Copy link
Collaborator

@yuxianq yuxianq commented Nov 21, 2025

Summary by CodeRabbit

Release Notes

  • Refactor
    • Optimized distributed pipeline-parallel communication backend for improved performance.
    • Enhanced execution request synchronization ordering in multi-device pipelines.
    • Improved profiling instrumentation for distributed execution tracing and diagnostics.
    • Refined internal API organization for distributed communication module.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yuxianq yuxianq requested review from QiJune, jiaganc and kaiyux November 21, 2025 01:54
@yuxianq yuxianq requested review from a team as code owners November 21, 2025 01:54
@yuxianq yuxianq requested a review from nv-yilinf November 21, 2025 01:54
@yuxianq
Copy link
Collaborator Author

yuxianq commented Nov 21, 2025

/bot run --disable-fail-fast

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

📝 Walkthrough

Walkthrough

These changes refactor the distributed pipeline-parallel communication layer by replacing the PPComm class with PPCommNCCL, removing PPComm from the public API, reordering synchronization in request broadcasting, and standardizing NVTX profiling instrumentation with context managers instead of manual push/pop calls.

Changes

Cohort / File(s) Summary
Pipeline-parallel communication refactor
tensorrt_llm/_torch/distributed/__init__.py, tensorrt_llm/_torch/distributed/communicator.py
Removed PPComm from public exports; replaced PPComm implementation with PPCommNCCL; updated init_pp_comm factory to instantiate PPCommNCCL under MPI-enabled paths
Request broadcast synchronization
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
Relocated wait on send_requests_handler from inside the send block to before the send path, preserving synchronization semantics with revised ordering
NVTX instrumentation standardization
tensorrt_llm/_torch/pyexecutor/py_executor.py
Replaced manual nvtx.range_push/pop pairs with nvtx_range context managers for sample state handling; added nvtx_range decorator to wait_on_pp_send_handles; removed decorator from _fetch_and_activate_new_requests

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant init_pp_comm
    participant PPCommNCCL
    
    Caller->>init_pp_comm: init_pp_comm(mapping)
    alt MPI enabled
        init_pp_comm->>PPCommNCCL: create PPCommNCCL(mapping)
        PPCommNCCL-->>init_pp_comm: instance
    else MPI not enabled
        init_pp_comm-->>Caller: None
    end
    init_pp_comm-->>Caller: _pp_comm = PPCommNCCL instance
Loading
sequenceDiagram
    participant Code
    participant send_requests_handler
    participant send_to_next_pp
    
    Note over Code: Old flow
    Code->>send_to_next_pp: enter send block
    send_to_next_pp->>send_requests_handler: wait() inside send
    send_requests_handler-->>send_to_next_pp: ready
    send_to_next_pp->>send_to_next_pp: isend_object()
    
    Note over Code: New flow
    Code->>send_requests_handler: wait() before send block
    send_requests_handler-->>Code: ready
    Code->>send_to_next_pp: enter send block
    send_to_next_pp->>send_to_next_pp: isend_object()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • PPCommNCCL implementation: Verify that PPCommNCCL properly replaces PPComm's interface and behavior for pipeline-parallel communication
  • Synchronization ordering change: Confirm that moving the wait operation before the send block does not introduce race conditions or deadlocks with the isend_object call
  • Public API removal: Ensure PPComm removal from __all__ doesn't break downstream dependencies or imports
  • NVTX instrumentation shifts: Verify that context manager replacements maintain identical profiling scope boundaries as the original push/pop pairs

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is essentially the template with no substantive content filled in. Required sections like 'Description' and 'Test Coverage' are empty. Fill in the Description section explaining what changes were made and why, and the Test Coverage section listing relevant tests that validate the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: reducing nested NVTX ranges across multiple files in the profiling instrumentation.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1379cfa and d2fd44b.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/__init__.py (1 hunks)
  • tensorrt_llm/_torch/distributed/communicator.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

  • tensorrt_llm/_torch/distributed/communicator.py
📚 Learning: 2025-09-24T03:31:28.908Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.

Applied to files:

  • tensorrt_llm/_torch/distributed/communicator.py
  • tensorrt_llm/_torch/distributed/__init__.py
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

  • tensorrt_llm/_torch/distributed/communicator.py
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

  • tensorrt_llm/_torch/distributed/communicator.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

  • tensorrt_llm/_torch/distributed/communicator.py
  • tensorrt_llm/_torch/distributed/__init__.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/__init__.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/distributed/__init__.py (1)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • Distributed (33-116)
  • MPIDist (339-412)
  • TorchDist (432-766)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)
tensorrt_llm/_utils.py (1)
  • nvtx_range (879-898)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • wait (424-429)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
tensorrt_llm/_utils.py (1)
  • nvtx_range (879-898)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • prev_pp_rank (95-96)
tensorrt_llm/mapping.py (1)
  • prev_pp_rank (261-265)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/distributed/communicator.py (2)

816-822: LGTM! Clean switch to NCCL-backed implementation.

The initialization logic correctly chooses between PPCommTorch (for Ray/TorchDist orchestrator) and PPCommNCCL (for MPI-enabled paths with NCCL backend). This aligns with the PR objective to replace the PP communication backend with NCCL.


769-787: NcclCommunicatorOp is properly defined and available.

The class is defined in cpp/tensorrt_llm/thop/ncclCommunicatorOp.h and registered with PyTorch via torch::jit::class_. The interface matches the expected pattern: constructor takes int64_t worldSize and int64_t rank, with send(th::Tensor, int64_t) and recv(th::Tensor&, int64_t) methods. This matches the usage pattern already established in tensorrt_llm/runtime/generation.py and tensorrt_llm/runtime/enc_dec_model_runner.py.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

610-616: LGTM! Correct synchronization ordering.

Moving the wait on send_requests_handler to before the new isend_object call ensures that the previous asynchronous send completes before reusing the handle. This prevents potential race conditions and aligns with the PR's goal of refining inter-PP communication synchronization.

The added NVTX range provides better profiling visibility for this synchronization point.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

923-927: LGTM! Improved NVTX profiling instrumentation.

The use of nvtx_range context manager for wrapping the sample state receive operation provides cleaner profiling instrumentation compared to manual push/pop calls. This change aligns with the PR objective to standardize NVTX profiling patterns.


1002-1006: LGTM! Clear profiling for PP send handle synchronization.

The @nvtx_range decorator addition provides profiling visibility for the wait operation on pipeline-parallel send handles. This helps identify synchronization bottlenecks in multi-rank communication patterns.

tensorrt_llm/_torch/distributed/__init__.py (1)

3-3: Original review comment is incorrect.

PPComm was never part of the public API—it's not defined as a class in the codebase (only PPCommNCCL and PPCommTorch exist) and was never included in the __all__ list. No breaking change occurred. The import statement correctly exports only Distributed, MPIDist, and TorchDist, which are all present in __all__. No remaining references to PPComm exist in the codebase.

Likely an incorrect or invalid review comment.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25272 [ run ] triggered by Bot. Commit: d2fd44b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25272 [ run ] completed with state FAILURE. Commit: d2fd44b
/LLM/main/L0_MergeRequest_PR pipeline #19117 completed with status: 'FAILURE'

@yuxianq
Copy link
Collaborator Author

yuxianq commented Nov 21, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25313 [ run ] triggered by Bot. Commit: d2fd44b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25313 [ run ] completed with state SUCCESS. Commit: d2fd44b
/LLM/main/L0_MergeRequest_PR pipeline #19149 completed with status: 'FAILURE'

@yuxianq
Copy link
Collaborator Author

yuxianq commented Nov 24, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25473 [ run ] triggered by Bot. Commit: 6cd6526

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25473 [ run ] completed with state SUCCESS. Commit: 6cd6526
/LLM/main/L0_MergeRequest_PR pipeline #19287 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuxianq yuxianq merged commit 8a02950 into NVIDIA:main Nov 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants