[https://nvbugs/5599086][fix] Fix FP8 Linear module for spark #8707

SimengLiu-nv · 2025-10-28T01:56:58Z

Summary by CodeRabbit

Bug Fixes
- Fixed device placement inconsistencies in FP8 quantized linear layers to ensure tensor operations complete correctly across different hardware configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-10-28T02:01:22Z

📝 Walkthrough

Walkthrough

The FP8QDQLinearMethod class in tensorrt_llm/_torch/modules/linear.py was updated to normalize device placement between fused weight tensors and their corresponding weight scales during fusion operations. The two affected methods now explicitly track original device, conditionally move tensors to a consistent device for computations, and restore original device placement afterward.

Changes

Cohort / File(s)	Summary
Device normalization for tensor operations `tensorrt_llm/_torch/modules/linear.py`	Modified `load_weights_fused_qkv_linear` and `load_weights_fused_gate_up_linear` to track original device of `weight_scale`, temporarily move it to match `fused_weight.device` when needed for division operations, and restore original device afterward. Ensures consistent device placement during tensor operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Device restoration paths: Verify that device restoration occurs correctly in all control flow branches, including potential error scenarios
Tensor operation consistency: Confirm that the temporary device movement and restoration does not introduce unintended side effects or device mismatches
Pattern completeness: Ensure both affected methods implement the device handling identically and determine if similar patterns are needed elsewhere in the codebase

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is largely incomplete and does not meet the template requirements. While the template structure is present, the substantive sections are empty: the Description section contains only a comment placeholder with no explanation of the issue or solution, and the Test Coverage section similarly lacks any listed tests. Additionally, the PR Checklist items are provided but not checked off, and the PR body provides no implementation details, rationale, or evidence of test coverage despite these being required sections according to the template. This falls short of the standard "mostly complete" threshold needed for a pass.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title follows the required format specified in the template: `[https://nvbugs/5599086][fix] Fix FP8 Linear module for spark`. It includes a valid NVBugs ID, the correct type tag `[fix]`, and a concise summary. The title clearly relates to the changeset, which involves fixing device placement issues in FP8QDQLinearMethod for fused linear operations in the TensorRT-LLM PyTorch modules. The title is specific, clear, and appropriately conveys the main purpose of the change without being vague or overly broad.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/modules/linear.py (2)

452-462: Simplify device handling logic and avoid multiple Parameter reassignments.

The current implementation has a redundant conditional check and creates new Parameter objects twice during device transfers. This can be simplified:

The second check if original_device != module.weight_scale.device: (line 460) will only be true if the first condition at line 455 was true, making it redundant.
Creating new Parameter objects repeatedly can affect object identity and may have unintended side effects.

Consider this cleaner approach:

-        original_device = module.weight_scale.device
-        # module.weight_scale and fused_weight must be on the same device for division operation
-        # Reset the device of module.weight_scale to the original device if changed
+        # Ensure module.weight_scale and fused_weight are on the same device for division
         if module.weight_scale.device != fused_weight.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(fused_weight.device))
-        fused_weight = (fused_weight / module.weight_scale).to(
-            torch.float8_e4m3fn)
-        if original_device != module.weight_scale.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(original_device))
+            weight_scale_for_div = module.weight_scale.to(fused_weight.device)
+        else:
+            weight_scale_for_div = module.weight_scale
+        fused_weight = (fused_weight / weight_scale_for_div).to(torch.float8_e4m3fn)

This approach:

Eliminates the redundant second conditional check
Avoids mutating the Parameter object, preserving object identity
Achieves the same result with clearer logic

498-508: Simplify device handling logic and avoid multiple Parameter reassignments.

The implementation here has the same issues as in load_weights_fused_qkv_linear: redundant conditional check and multiple Parameter object creations.

Apply the same simplification as suggested for the QKV method:

-        original_device = module.weight_scale.device
-        # module.weight_scale and fused_weight must be on the same device for division operation
-        # Reset the device of module.weight_scale to the original device if changed
         if module.weight_scale.device != fused_weight.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(fused_weight.device))
-        fused_weight = (fused_weight / module.weight_scale).to(
-            torch.float8_e4m3fn)
-        if original_device != module.weight_scale.device:
-            module.weight_scale = Parameter(
-                module.weight_scale.data.to(original_device))
+            weight_scale_for_div = module.weight_scale.to(fused_weight.device)
+        else:
+            weight_scale_for_div = module.weight_scale
+        fused_weight = (fused_weight / weight_scale_for_div).to(torch.float8_e4m3fn)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5265a0 and 2311624.

📒 Files selected for processing (1)

tensorrt_llm/_torch/modules/linear.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/modules/linear.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/modules/linear.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/modules/linear.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/modules/linear.py (1)

452-462: Device transfer fix is correct and necessary.

Verification confirms the fix is required for correctness:

Root cause of device mismatch: On integrated GPUs, load_weight_shard skips device transfers (lines 73-80) and weights stay on CPU, while weight_scale loaded from checkpoint ends up on the default GPU device.

Fix is necessary: The division operation at line 458/504 requires both fused_weight and module.weight_scale on the same device. The temporary device swap enables this without conflicting with the integrated GPU optimization.

Symmetric implementation: Both load_weights_fused_qkv_linear (lines 452-462) and load_weights_fused_gate_up_linear (lines 498-508) have identical fixes, which is correct.

The fix properly handles the device mismatch that results from the memory optimization on integrated GPUs and causes no performance concern.

SimengLiu-nv · 2025-10-28T02:04:05Z

/bot run

tensorrt-cicd · 2025-10-28T02:09:56Z

PR_Github #22711 [ run ] triggered by Bot. Commit: 2311624

tensorrt-cicd · 2025-10-28T08:55:04Z

PR_Github #22711 [ run ] completed with state SUCCESS. Commit: 2311624
/LLM/main/L0_MergeRequest_PR pipeline #17124 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tensorrt_llm/_torch/modules/linear.py

farazkh80

Thanks for finding the root cause, LGTM, except that one comment.

tensorrt_llm/_torch/modules/linear.py

SimengLiu-nv · 2025-10-28T17:35:17Z

/bot run

tensorrt-cicd · 2025-10-28T17:41:09Z

PR_Github #22796 [ run ] triggered by Bot. Commit: 422ae7c

tensorrt-cicd · 2025-10-28T20:23:04Z

PR_Github #22796 [ run ] completed with state SUCCESS. Commit: 422ae7c
/LLM/main/L0_MergeRequest_PR pipeline #17193 completed with status: 'FAILURE'

SimengLiu-nv · 2025-10-29T15:43:27Z

/bot run

tensorrt-cicd · 2025-10-29T15:49:32Z

PR_Github #22906 [ run ] triggered by Bot. Commit: 422ae7c

tensorrt-cicd · 2025-10-29T16:21:13Z

PR_Github #22906 [ run ] completed with state SUCCESS. Commit: 422ae7c
/LLM/main/L0_MergeRequest_PR pipeline #17276 completed with status: 'SUCCESS'

Signed-off-by: Simeng Liu <[email protected]>

SimengLiu-nv · 2025-10-29T16:36:15Z

/bot reuse-pipeline

tensorrt-cicd · 2025-10-29T16:42:08Z

PR_Github #22915 [ reuse-pipeline ] triggered by Bot. Commit: b31d0ad

tensorrt-cicd · 2025-10-29T17:19:53Z

PR_Github #22915 [ reuse-pipeline ] completed with state SUCCESS. Commit: b31d0ad
Reusing PR_Github #22906 for commit b31d0ad

…#8707) Signed-off-by: Simeng Liu <[email protected]>

SimengLiu-nv requested a review from farazkh80 October 28, 2025 01:56

SimengLiu-nv requested a review from a team as a code owner October 28, 2025 01:57

SimengLiu-nv requested a review from mikeiovine October 28, 2025 01:57

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

farazkh80 reviewed Oct 28, 2025

View reviewed changes

tensorrt_llm/_torch/modules/linear.py Outdated Show resolved Hide resolved

farazkh80 approved these changes Oct 28, 2025

View reviewed changes

SimengLiu-nv commented Oct 28, 2025

View reviewed changes

tensorrt_llm/_torch/modules/linear.py Show resolved Hide resolved

farazkh80 reviewed Oct 28, 2025

View reviewed changes

tensorrt_llm/_torch/modules/linear.py Show resolved Hide resolved

mikeiovine approved these changes Oct 28, 2025

View reviewed changes

SimengLiu-nv added 2 commits October 29, 2025 09:35

[https://nvbugs/5599086][fix] Fix FP8 Linear module for spark

d7d8e21

Signed-off-by: Simeng Liu <[email protected]>

Adapt comments and optimize memory footprint.

b31d0ad

Signed-off-by: Simeng Liu <[email protected]>

SimengLiu-nv force-pushed the fix-spark-fp8 branch from 422ae7c to b31d0ad Compare October 29, 2025 16:35

SimengLiu-nv enabled auto-merge (squash) October 29, 2025 16:36

SimengLiu-nv merged commit 834a780 into NVIDIA:main Oct 29, 2025
5 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025

[https://nvbugs/5599086][fix] Fix FP8 Linear module for spark (NVIDIA…

e32ed14

…#8707) Signed-off-by: Simeng Liu <[email protected]>

[https://nvbugs/5599086][fix] Fix FP8 Linear module for spark #8707

[https://nvbugs/5599086][fix] Fix FP8 Linear module for spark #8707

Uh oh!

Conversation

SimengLiu-nv commented Oct 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Oct 28, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

SimengLiu-nv commented Oct 28, 2025

Uh oh!

tensorrt-cicd commented Oct 28, 2025

Uh oh!

tensorrt-cicd commented Oct 28, 2025

Uh oh!

Uh oh!

farazkh80 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SimengLiu-nv commented Oct 28, 2025

Uh oh!

tensorrt-cicd commented Oct 28, 2025

Uh oh!

tensorrt-cicd commented Oct 28, 2025

Uh oh!

SimengLiu-nv commented Oct 29, 2025

Uh oh!

tensorrt-cicd commented Oct 29, 2025

Uh oh!

tensorrt-cicd commented Oct 29, 2025

Uh oh!

SimengLiu-nv commented Oct 29, 2025

Uh oh!

tensorrt-cicd commented Oct 29, 2025

Uh oh!

tensorrt-cicd commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SimengLiu-nv commented Oct 28, 2025 •

edited by coderabbitai bot

Loading