Skip to content

Conversation

@wsttiger
Copy link
Collaborator

Add CUDA Graph Optimization to TensorRT Decoder

Overview

This PR implements CUDA graph optimization for the TensorRT decoder, providing a measurable 20% performance improvement while maintaining full numerical accuracy. The implementation uses a clean executor abstraction pattern that automatically handles models with dynamic shapes.

Key Features

1. CUDA Graph Optimization

  • Captures TensorRT inference operations into a CUDA graph on first decode call
  • Replays captured graph on subsequent calls, eliminating kernel launch overhead
  • ~20% reduction in inference latency (1.24x speedup)
  • Follows NVIDIA TensorRT best practices

2. Executor Abstraction (PIMPL + Variant Pattern)

  • TraditionalExecutor: Standard TensorRT execution path
  • CudaGraphExecutor: CUDA graph-optimized execution path
  • Zero-overhead std::variant dispatch (no virtual calls)
  • Clean separation of concerns, easy to extend

3. Dynamic Shape Detection

  • Automatically detects models with dynamic tensor dimensions
  • Detects models with multiple optimization profiles
  • Falls back to traditional execution when CUDA graphs are incompatible
  • Prevents runtime errors from improper CUDA graph usage

4. User Control

  • New parameter: use_cuda_graph (default: true)
  • Users can explicitly disable CUDA graphs if needed
  • Clear logging of executor selection and reasoning

Performance Results

Benchmark Test (200 iterations):

Implement CUDA graph capture and replay for TensorRT inference operations
to reduce kernel launch overhead and improve performance. Following NVIDIA
TensorRT best practices, the decoder now:

- Captures inference operations into a CUDA graph on first decode() call
- Executes the captured graph on all subsequent decode() calls
- Properly manages CUDA graph lifecycle with cleanup in destructor

Changes:
- Add CUDA graph member variables (cuda_graph_captured_, cuda_graph_,
  cuda_graph_exec_)
- Modify decode() to capture setTensorAddress() and enqueueV3() operations
  on first invocation and replay the graph on subsequent calls
- Add CUDA graph cleanup to destructor

Performance benefits:
- 10-20% reduction in inference latency from reduced launch overhead
- Better GPU utilization through optimized command submission
- Memory operations outside graph allow per-call data updates

Tested with all unit tests passing (10/10) including actual GPU inference
validation with numerical accuracy verified.

Ref: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
Signed-off-by: Scott Thornton <[email protected]>
Signed-off-by: Scott Thornton <[email protected]>
… detection

Refactor the TensorRT decoder implementation to use a PIMPL + variant pattern
with separate executor strategies, enabling better extensibility and automatic
handling of models with dynamic shapes.

Key changes:

1. Executor Abstraction:
   - Introduce TraditionalExecutor for standard TensorRT execution
   - Introduce CudaGraphExecutor for CUDA graph-optimized execution
   - Use std::variant for zero-overhead executor selection
   - Executors encapsulate their state and lifecycle management

2. Dynamic Shape Detection:
   - Add supports_cuda_graphs() to detect model compatibility
   - Automatically fall back to traditional execution for:
     * Models with dynamic tensor dimensions
     * Models with multiple optimization profiles
   - Prevents runtime errors from incompatible CUDA graph usage

3. User Control:
   - Support 'use_cuda_graph' parameter (default: true)
   - Users can explicitly disable CUDA graphs if needed
   - Clear logging of executor selection and reasoning

4. PIMPL Pattern:
   - Hide all TensorRT implementation details in Impl struct
   - Clean separation between interface and implementation
   - Improved compilation times for users of the decoder

5. Code Quality:
   - Rename 'initialized_' to 'decoder_ready_' for clarity
   - Simplified decode() method with single execution path
   - Better organized resource cleanup in Impl destructor

Architecture benefits:
- Extensible: Easy to add new executor types (e.g., batched, streamed)
- Type-safe: Compile-time guarantees via std::variant
- Zero overhead: No virtual call overhead vs manual branching
- Maintainable: Clear separation of concerns

All tests passing (10/10) with numerical accuracy verified.

Signed-off-by: Scott Thornton <[email protected]>
Add PerformanceComparisonCudaGraphVsTraditional test that quantifies the
performance benefit of CUDA graph execution vs traditional TensorRT execution.

Test measures 200 iterations after warm-up and demonstrates:
- CUDA Graph: ~14.4 μs average
- Traditional: ~17.8 μs average
- Speedup: 1.24x (24% faster, ~20% improvement)

Includes assertions ensuring ≥5% speedup and convergence validation.
Provides empirical evidence that CUDA graphs deliver measurable performance
benefits without sacrificing accuracy.

All 11 tests passing.

Signed-off-by: Scott Thornton <[email protected]>
wsttiger and others added 6 commits November 25, 2025 17:26
Signed-off-by: Scott Thornton <[email protected]>
Replace lazy CUDA graph capture with eager capture during decoder construction.
This provides fail-fast behavior, predictable performance, and better error handling.

Key changes:
- Add try_capture_cuda_graph() helper function that captures graphs during
  initialization using dummy input data
- Refactor CudaGraphExecutor to accept pre-captured graph/exec handles via
  constructor instead of performing lazy capture on first inference
- Implement proper move semantics (move ctor/assignment) and delete copy
  operations to prevent double-free issues when stored in std::variant
- Update trt_decoder constructor to attempt eager capture and gracefully
  fall back to TraditionalExecutor on failure with detailed error messages
- Remove 'captured' flag and conditional logic from execution hot path

Benefits:
- Immediate error detection at construction time (fail-fast)
- Consistent, predictable performance across all decode() calls
- No capture overhead on first inference
- Simplified executor code with cleaner separation of concerns
- Robust fallback mechanism with diagnostic error messages

Performance: Benchmark shows 1.26x speedup (20.9% improvement) with CUDA
graphs vs traditional execution.

Signed-off-by: Scott Thornton <[email protected]>
Add comprehensive test suite for CUDA graph functionality in TensorRT decoder:

- Add TensorRT Python API imports with availability check
- Add build_simple_mlp_engine() helper to create test engines with
  configurable dynamic shapes (for testing CUDA graph compatibility)
- Add test_performance_comparison_cuda_graph_vs_traditional() to benchmark
  performance improvements (validates >5% speedup from graph optimization)
- Add test_cuda_graph_vs_traditional_correctness() to verify identical
  outputs between CUDA graph and traditional execution paths

The tests validate both performance gains and correctness guarantees
of the CUDA graph execution path.

Signed-off-by: Scott Thornton <[email protected]>
Signed-off-by: Scott Thornton <[email protected]>
@wsttiger
Copy link
Collaborator Author

wsttiger commented Dec 8, 2025

/ok to test 5ddefa0

Copy link
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the Build wheels job passes, these changes look good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants