Skip to content

Conversation

@SimengLiu-nv
Copy link
Collaborator

@SimengLiu-nv SimengLiu-nv commented Oct 28, 2025

Summary by CodeRabbit

  • Bug Fixes
    • Fixed device placement inconsistencies in FP8 quantized linear layers to ensure tensor operations complete correctly across different hardware configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@SimengLiu-nv SimengLiu-nv requested a review from a team as a code owner October 28, 2025 01:57
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

📝 Walkthrough

Walkthrough

The FP8QDQLinearMethod class in tensorrt_llm/_torch/modules/linear.py was updated to normalize device placement between fused weight tensors and their corresponding weight scales during fusion operations. The two affected methods now explicitly track original device, conditionally move tensors to a consistent device for computations, and restore original device placement afterward.

Changes

Cohort / File(s) Summary
Device normalization for tensor operations
tensorrt_llm/_torch/modules/linear.py
Modified load_weights_fused_qkv_linear and load_weights_fused_gate_up_linear to track original device of weight_scale, temporarily move it to match fused_weight.device when needed for division operations, and restore original device afterward. Ensures consistent device placement during tensor operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Device restoration paths: Verify that device restoration occurs correctly in all control flow branches, including potential error scenarios
  • Tensor operation consistency: Confirm that the temporary device movement and restoration does not introduce unintended side effects or device mismatches
  • Pattern completeness: Ensure both affected methods implement the device handling identically and determine if similar patterns are needed elsewhere in the codebase

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is largely incomplete and does not meet the template requirements. While the template structure is present, the substantive sections are empty: the Description section contains only a comment placeholder with no explanation of the issue or solution, and the Test Coverage section similarly lacks any listed tests. Additionally, the PR Checklist items are provided but not checked off, and the PR body provides no implementation details, rationale, or evidence of test coverage despite these being required sections according to the template. This falls short of the standard "mostly complete" threshold needed for a pass.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title follows the required format specified in the template: [https://nvbugs/5599086][fix] Fix FP8 Linear module for spark. It includes a valid NVBugs ID, the correct type tag [fix], and a concise summary. The title clearly relates to the changeset, which involves fixing device placement issues in FP8QDQLinearMethod for fused linear operations in the TensorRT-LLM PyTorch modules. The title is specific, clear, and appropriately conveys the main purpose of the change without being vague or overly broad.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/linear.py (2)

452-462: Simplify device handling logic and avoid multiple Parameter reassignments.

The current implementation has a redundant conditional check and creates new Parameter objects twice during device transfers. This can be simplified:

  1. The second check if original_device != module.weight_scale.device: (line 460) will only be true if the first condition at line 455 was true, making it redundant.
  2. Creating new Parameter objects repeatedly can affect object identity and may have unintended side effects.

Consider this cleaner approach:

-        original_device = module.weight_scale.device
-        # module.weight_scale and fused_weight must be on the same device for division operation
-        # Reset the device of module.weight_scale to the original device if changed
+        # Ensure module.weight_scale and fused_weight are on the same device for division
         if module.weight_scale.device != fused_weight.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(fused_weight.device))
-        fused_weight = (fused_weight / module.weight_scale).to(
-            torch.float8_e4m3fn)
-        if original_device != module.weight_scale.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(original_device))
+            weight_scale_for_div = module.weight_scale.to(fused_weight.device)
+        else:
+            weight_scale_for_div = module.weight_scale
+        fused_weight = (fused_weight / weight_scale_for_div).to(torch.float8_e4m3fn)

This approach:

  • Eliminates the redundant second conditional check
  • Avoids mutating the Parameter object, preserving object identity
  • Achieves the same result with clearer logic

498-508: Simplify device handling logic and avoid multiple Parameter reassignments.

The implementation here has the same issues as in load_weights_fused_qkv_linear: redundant conditional check and multiple Parameter object creations.

Apply the same simplification as suggested for the QKV method:

-        original_device = module.weight_scale.device
-        # module.weight_scale and fused_weight must be on the same device for division operation
-        # Reset the device of module.weight_scale to the original device if changed
         if module.weight_scale.device != fused_weight.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(fused_weight.device))
-        fused_weight = (fused_weight / module.weight_scale).to(
-            torch.float8_e4m3fn)
-        if original_device != module.weight_scale.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(original_device))
+            weight_scale_for_div = module.weight_scale.to(fused_weight.device)
+        else:
+            weight_scale_for_div = module.weight_scale
+        fused_weight = (fused_weight / weight_scale_for_div).to(torch.float8_e4m3fn)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5265a0 and 2311624.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/modules/linear.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/modules/linear.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/modules/linear.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/linear.py (1)

452-462: Device transfer fix is correct and necessary.

Verification confirms the fix is required for correctness:

  1. Root cause of device mismatch: On integrated GPUs, load_weight_shard skips device transfers (lines 73-80) and weights stay on CPU, while weight_scale loaded from checkpoint ends up on the default GPU device.

  2. Fix is necessary: The division operation at line 458/504 requires both fused_weight and module.weight_scale on the same device. The temporary device swap enables this without conflicting with the integrated GPU optimization.

  3. Symmetric implementation: Both load_weights_fused_qkv_linear (lines 452-462) and load_weights_fused_gate_up_linear (lines 498-508) have identical fixes, which is correct.

The fix properly handles the device mismatch that results from the memory optimization on integrated GPUs and causes no performance concern.

@SimengLiu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22711 [ run ] triggered by Bot. Commit: 2311624

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22711 [ run ] completed with state SUCCESS. Commit: 2311624
/LLM/main/L0_MergeRequest_PR pipeline #17124 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Copy link
Collaborator

@farazkh80 farazkh80 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the root cause, LGTM, except that one comment.

@SimengLiu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22796 [ run ] triggered by Bot. Commit: 422ae7c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22796 [ run ] completed with state SUCCESS. Commit: 422ae7c
/LLM/main/L0_MergeRequest_PR pipeline #17193 completed with status: 'FAILURE'

@SimengLiu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22906 [ run ] triggered by Bot. Commit: 422ae7c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22906 [ run ] completed with state SUCCESS. Commit: 422ae7c
/LLM/main/L0_MergeRequest_PR pipeline #17276 completed with status: 'SUCCESS'

@SimengLiu-nv
Copy link
Collaborator Author

/bot reuse-pipeline

@SimengLiu-nv SimengLiu-nv enabled auto-merge (squash) October 29, 2025 16:36
@tensorrt-cicd
Copy link
Collaborator

PR_Github #22915 [ reuse-pipeline ] triggered by Bot. Commit: b31d0ad

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22915 [ reuse-pipeline ] completed with state SUCCESS. Commit: b31d0ad
Reusing PR_Github #22906 for commit b31d0ad

@SimengLiu-nv SimengLiu-nv merged commit 834a780 into NVIDIA:main Oct 29, 2025
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants