Add trt cudagraph #369

wsttiger · 2025-11-25T17:24:34Z

Add CUDA Graph Optimization to TensorRT Decoder

Overview

This PR implements CUDA graph optimization for the TensorRT decoder, providing a measurable 20% performance improvement while maintaining full numerical accuracy. The implementation uses a clean executor abstraction pattern that automatically handles models with dynamic shapes.

Key Features

1. CUDA Graph Optimization

Captures TensorRT inference operations into a CUDA graph on first decode call
Replays captured graph on subsequent calls, eliminating kernel launch overhead
~20% reduction in inference latency (1.24x speedup)
Follows NVIDIA TensorRT best practices

2. Executor Abstraction (PIMPL + Variant Pattern)

TraditionalExecutor: Standard TensorRT execution path
CudaGraphExecutor: CUDA graph-optimized execution path
Zero-overhead std::variant dispatch (no virtual calls)
Clean separation of concerns, easy to extend

3. Dynamic Shape Detection

Automatically detects models with dynamic tensor dimensions
Detects models with multiple optimization profiles
Falls back to traditional execution when CUDA graphs are incompatible
Prevents runtime errors from improper CUDA graph usage

4. User Control

New parameter: use_cuda_graph (default: true)
Users can explicitly disable CUDA graphs if needed
Clear logging of executor selection and reasoning

Performance Results

Benchmark Test (200 iterations):

Implement CUDA graph capture and replay for TensorRT inference operations to reduce kernel launch overhead and improve performance. Following NVIDIA TensorRT best practices, the decoder now: - Captures inference operations into a CUDA graph on first decode() call - Executes the captured graph on all subsequent decode() calls - Properly manages CUDA graph lifecycle with cleanup in destructor Changes: - Add CUDA graph member variables (cuda_graph_captured_, cuda_graph_, cuda_graph_exec_) - Modify decode() to capture setTensorAddress() and enqueueV3() operations on first invocation and replay the graph on subsequent calls - Add CUDA graph cleanup to destructor Performance benefits: - 10-20% reduction in inference latency from reduced launch overhead - Better GPU utilization through optimized command submission - Memory operations outside graph allow per-call data updates Tested with all unit tests passing (10/10) including actual GPU inference validation with numerical accuracy verified. Ref: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html Signed-off-by: Scott Thornton <[email protected]>

Signed-off-by: Scott Thornton <[email protected]>

… detection Refactor the TensorRT decoder implementation to use a PIMPL + variant pattern with separate executor strategies, enabling better extensibility and automatic handling of models with dynamic shapes. Key changes: 1. Executor Abstraction: - Introduce TraditionalExecutor for standard TensorRT execution - Introduce CudaGraphExecutor for CUDA graph-optimized execution - Use std::variant for zero-overhead executor selection - Executors encapsulate their state and lifecycle management 2. Dynamic Shape Detection: - Add supports_cuda_graphs() to detect model compatibility - Automatically fall back to traditional execution for: * Models with dynamic tensor dimensions * Models with multiple optimization profiles - Prevents runtime errors from incompatible CUDA graph usage 3. User Control: - Support 'use_cuda_graph' parameter (default: true) - Users can explicitly disable CUDA graphs if needed - Clear logging of executor selection and reasoning 4. PIMPL Pattern: - Hide all TensorRT implementation details in Impl struct - Clean separation between interface and implementation - Improved compilation times for users of the decoder 5. Code Quality: - Rename 'initialized_' to 'decoder_ready_' for clarity - Simplified decode() method with single execution path - Better organized resource cleanup in Impl destructor Architecture benefits: - Extensible: Easy to add new executor types (e.g., batched, streamed) - Type-safe: Compile-time guarantees via std::variant - Zero overhead: No virtual call overhead vs manual branching - Maintainable: Clear separation of concerns All tests passing (10/10) with numerical accuracy verified. Signed-off-by: Scott Thornton <[email protected]>

Add PerformanceComparisonCudaGraphVsTraditional test that quantifies the performance benefit of CUDA graph execution vs traditional TensorRT execution. Test measures 200 iterations after warm-up and demonstrates: - CUDA Graph: ~14.4 μs average - Traditional: ~17.8 μs average - Speedup: 1.24x (24% faster, ~20% improvement) Includes assertions ensuring ≥5% speedup and convergence validation. Provides empirical evidence that CUDA graphs deliver measurable performance benefits without sacrificing accuracy. All 11 tests passing. Signed-off-by: Scott Thornton <[email protected]>

Signed-off-by: Scott Thornton <[email protected]>

Replace lazy CUDA graph capture with eager capture during decoder construction. This provides fail-fast behavior, predictable performance, and better error handling. Key changes: - Add try_capture_cuda_graph() helper function that captures graphs during initialization using dummy input data - Refactor CudaGraphExecutor to accept pre-captured graph/exec handles via constructor instead of performing lazy capture on first inference - Implement proper move semantics (move ctor/assignment) and delete copy operations to prevent double-free issues when stored in std::variant - Update trt_decoder constructor to attempt eager capture and gracefully fall back to TraditionalExecutor on failure with detailed error messages - Remove 'captured' flag and conditional logic from execution hot path Benefits: - Immediate error detection at construction time (fail-fast) - Consistent, predictable performance across all decode() calls - No capture overhead on first inference - Simplified executor code with cleaner separation of concerns - Robust fallback mechanism with diagnostic error messages Performance: Benchmark shows 1.26x speedup (20.9% improvement) with CUDA graphs vs traditional execution. Signed-off-by: Scott Thornton <[email protected]>

…dd_trt_cudagraph

Add comprehensive test suite for CUDA graph functionality in TensorRT decoder: - Add TensorRT Python API imports with availability check - Add build_simple_mlp_engine() helper to create test engines with configurable dynamic shapes (for testing CUDA graph compatibility) - Add test_performance_comparison_cuda_graph_vs_traditional() to benchmark performance improvements (validates >5% speedup from graph optimization) - Add test_cuda_graph_vs_traditional_correctness() to verify identical outputs between CUDA graph and traditional execution paths The tests validate both performance gains and correctness guarantees of the CUDA graph execution path. Signed-off-by: Scott Thornton <[email protected]>

Signed-off-by: Scott Thornton <[email protected]>

wsttiger · 2025-12-08T17:09:54Z

/ok to test 5ddefa0

bmhowe23

Assuming the Build wheels job passes, these changes look good to me.

wsttiger added 4 commits November 19, 2025 03:35

Formatting

2e9f66a

Signed-off-by: Scott Thornton <[email protected]>

wsttiger requested review from bmhowe23, cketcham2333, kvmto and melody-ren November 25, 2025 17:25

wsttiger and others added 6 commits November 25, 2025 17:26

Formatting

cd3058a

Signed-off-by: Scott Thornton <[email protected]>

Merge branch 'main' into add_trt_cudagraph

8bc4734

Merge branch 'add_trt_cudagraph' of github.com:wsttiger/cudaqx into a…

d704034

…dd_trt_cudagraph

Formatting

5ddefa0

Signed-off-by: Scott Thornton <[email protected]>

Merge branch 'main' into add_trt_cudagraph

984c98b

bmhowe23 reviewed Dec 9, 2025

View reviewed changes

Merge branch 'main' into add_trt_cudagraph

5629d66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add trt cudagraph #369

Add trt cudagraph #369

Uh oh!

wsttiger commented Nov 25, 2025

Uh oh!

wsttiger commented Dec 8, 2025

Uh oh!

bmhowe23 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add trt cudagraph #369

Are you sure you want to change the base?

Add trt cudagraph #369

Uh oh!

Conversation

wsttiger commented Nov 25, 2025

Add CUDA Graph Optimization to TensorRT Decoder

Overview

Key Features

1. CUDA Graph Optimization

2. Executor Abstraction (PIMPL + Variant Pattern)

3. Dynamic Shape Detection

4. User Control

Performance Results

Uh oh!

wsttiger commented Dec 8, 2025

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants