Overhaul backend function execution for improved performance and flex… #489

jhalakpatel · 2025-02-05T21:27:00Z

…ibility

This PR replaces the DPS-style calling convention with a non-DPS approach, eliminating the requirement for call sites to preallocate output buffers. This change enables us to bypass the computation of output shapes and advance allocation of output buffers, laying the groundwork for supporting data-dependent shapes where network outputs can have dynamic dimensions.

The underlying compiler stack has been enhanced to avoid allocating oversized buffers and eliminate an extra device-to-device copy operation from TensorRT-allocated memory to MLIR-TRT managed memory.

Additionally, we've improved the copy operation to support copying to host memory. This enhancement removes the need to track output device allocations for device-to-host copies. Previously, copy outputs were restricted to device allocations; now they can be allocated on both device and host.

Tests have been updated to align with the new calling convention, ensuring compatibility and correctness.

Other changes:
Fix type constraints tests
Address review comments

pranavm-nvidia · 2025-02-05T23:19:08Z

tripy/nvtripy/backend/api/executable.py

@@ -46,21 +46,19 @@ class Executable:
    """

    # The constructor is intentionally undocumented because it is not meant to be called by users.
-    # TODO(#155): output_devices is not needed after they can be queried from executable


Does this PR also resolve 155?

…ibility This PR replaces the DPS-style calling convention with a non-DPS approach, eliminating the requirement for call sites to preallocate output buffers. This change enables us to bypass the computation of output shapes and advance allocation of output buffers, laying the groundwork for supporting data-dependent shapes where network outputs can have dynamic dimensions. The underlying compiler stack has been enhanced to avoid allocating oversized buffers and eliminate an extra device-to-device copy operation from TensorRT-allocated memory to MLIR-TRT managed memory. Additionally, we've improved the copy operation to support copying to host memory. This enhancement removes the need to track output device allocations for device-to-host copies. Previously, copy outputs were restricted to device allocations; now they can be allocated on both device and host. Tests have been updated to align with the new calling convention, ensuring compatibility and correctness. Other changes: Fix type constraints tests Address review comments

pranavm-nvidia · 2025-02-11T17:55:19Z

tripy/tests/integration/test_dequantize.py

@@ -29,6 +29,7 @@ class TestDequantize:
    @pytest.mark.parametrize(
        "dtype", [tp.float32, tp.float16, pytest.param(tp.bfloat16, marks=skip_if_older_than_sm80)]
    )
+    @pytest.mark.skip("StableHLO QDQ broken")


Can we xfail these instead of skip? That way we will remember to turn them on again when we switch to TRT dialect.

Fixes debug options, uses Executable in eager mode

jhalakpatel mentioned this pull request Feb 5, 2025

Overhaul backend function execution for improved performance and flexibility #270

Closed

pranavm-nvidia reviewed Feb 5, 2025

View reviewed changes

jhalakpatel force-pushed the jhalakp-tripy-update-non-dps branch from 14e9d9d to de4ec3b Compare February 6, 2025 01:37

jhalakpatel added 2 commits February 10, 2025 17:39

Fix broken QDQ tests due to stablehlo bug

b0e9c48

jhalakpatel force-pushed the jhalakp-tripy-update-non-dps branch from de4ec3b to b0e9c48 Compare February 11, 2025 02:39

pranavm-nvidia reviewed Feb 11, 2025

View reviewed changes

pranavm-nvidia and others added 2 commits February 13, 2025 09:48

Fixes debug options, uses Executable in eager mode

dcb2a65

Merge pull request #2 from NVIDIA/pranavm-fix-debug

4477cea

Fixes debug options, uses Executable in eager mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul backend function execution for improved performance and flex… #489

Overhaul backend function execution for improved performance and flex… #489

jhalakpatel commented Feb 5, 2025

pranavm-nvidia Feb 5, 2025

pranavm-nvidia Feb 11, 2025

Overhaul backend function execution for improved performance and flex… #489

Are you sure you want to change the base?

Overhaul backend function execution for improved performance and flex… #489

Conversation

jhalakpatel commented Feb 5, 2025

pranavm-nvidia Feb 5, 2025

Choose a reason for hiding this comment

pranavm-nvidia Feb 11, 2025

Choose a reason for hiding this comment