-
Notifications
You must be signed in to change notification settings - Fork 37
Add trt cudagraph #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wsttiger
wants to merge
12
commits into
NVIDIA:main
Choose a base branch
from
wsttiger:add_trt_cudagraph
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add trt cudagraph #369
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implement CUDA graph capture and replay for TensorRT inference operations to reduce kernel launch overhead and improve performance. Following NVIDIA TensorRT best practices, the decoder now: - Captures inference operations into a CUDA graph on first decode() call - Executes the captured graph on all subsequent decode() calls - Properly manages CUDA graph lifecycle with cleanup in destructor Changes: - Add CUDA graph member variables (cuda_graph_captured_, cuda_graph_, cuda_graph_exec_) - Modify decode() to capture setTensorAddress() and enqueueV3() operations on first invocation and replay the graph on subsequent calls - Add CUDA graph cleanup to destructor Performance benefits: - 10-20% reduction in inference latency from reduced launch overhead - Better GPU utilization through optimized command submission - Memory operations outside graph allow per-call data updates Tested with all unit tests passing (10/10) including actual GPU inference validation with numerical accuracy verified. Ref: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html Signed-off-by: Scott Thornton <[email protected]>
Signed-off-by: Scott Thornton <[email protected]>
… detection
Refactor the TensorRT decoder implementation to use a PIMPL + variant pattern
with separate executor strategies, enabling better extensibility and automatic
handling of models with dynamic shapes.
Key changes:
1. Executor Abstraction:
- Introduce TraditionalExecutor for standard TensorRT execution
- Introduce CudaGraphExecutor for CUDA graph-optimized execution
- Use std::variant for zero-overhead executor selection
- Executors encapsulate their state and lifecycle management
2. Dynamic Shape Detection:
- Add supports_cuda_graphs() to detect model compatibility
- Automatically fall back to traditional execution for:
* Models with dynamic tensor dimensions
* Models with multiple optimization profiles
- Prevents runtime errors from incompatible CUDA graph usage
3. User Control:
- Support 'use_cuda_graph' parameter (default: true)
- Users can explicitly disable CUDA graphs if needed
- Clear logging of executor selection and reasoning
4. PIMPL Pattern:
- Hide all TensorRT implementation details in Impl struct
- Clean separation between interface and implementation
- Improved compilation times for users of the decoder
5. Code Quality:
- Rename 'initialized_' to 'decoder_ready_' for clarity
- Simplified decode() method with single execution path
- Better organized resource cleanup in Impl destructor
Architecture benefits:
- Extensible: Easy to add new executor types (e.g., batched, streamed)
- Type-safe: Compile-time guarantees via std::variant
- Zero overhead: No virtual call overhead vs manual branching
- Maintainable: Clear separation of concerns
All tests passing (10/10) with numerical accuracy verified.
Signed-off-by: Scott Thornton <[email protected]>
Add PerformanceComparisonCudaGraphVsTraditional test that quantifies the performance benefit of CUDA graph execution vs traditional TensorRT execution. Test measures 200 iterations after warm-up and demonstrates: - CUDA Graph: ~14.4 μs average - Traditional: ~17.8 μs average - Speedup: 1.24x (24% faster, ~20% improvement) Includes assertions ensuring ≥5% speedup and convergence validation. Provides empirical evidence that CUDA graphs deliver measurable performance benefits without sacrificing accuracy. All 11 tests passing. Signed-off-by: Scott Thornton <[email protected]>
Signed-off-by: Scott Thornton <[email protected]>
Replace lazy CUDA graph capture with eager capture during decoder construction. This provides fail-fast behavior, predictable performance, and better error handling. Key changes: - Add try_capture_cuda_graph() helper function that captures graphs during initialization using dummy input data - Refactor CudaGraphExecutor to accept pre-captured graph/exec handles via constructor instead of performing lazy capture on first inference - Implement proper move semantics (move ctor/assignment) and delete copy operations to prevent double-free issues when stored in std::variant - Update trt_decoder constructor to attempt eager capture and gracefully fall back to TraditionalExecutor on failure with detailed error messages - Remove 'captured' flag and conditional logic from execution hot path Benefits: - Immediate error detection at construction time (fail-fast) - Consistent, predictable performance across all decode() calls - No capture overhead on first inference - Simplified executor code with cleaner separation of concerns - Robust fallback mechanism with diagnostic error messages Performance: Benchmark shows 1.26x speedup (20.9% improvement) with CUDA graphs vs traditional execution. Signed-off-by: Scott Thornton <[email protected]>
Add comprehensive test suite for CUDA graph functionality in TensorRT decoder: - Add TensorRT Python API imports with availability check - Add build_simple_mlp_engine() helper to create test engines with configurable dynamic shapes (for testing CUDA graph compatibility) - Add test_performance_comparison_cuda_graph_vs_traditional() to benchmark performance improvements (validates >5% speedup from graph optimization) - Add test_cuda_graph_vs_traditional_correctness() to verify identical outputs between CUDA graph and traditional execution paths The tests validate both performance gains and correctness guarantees of the CUDA graph execution path. Signed-off-by: Scott Thornton <[email protected]>
Signed-off-by: Scott Thornton <[email protected]>
Collaborator
Author
|
/ok to test 5ddefa0 |
bmhowe23
reviewed
Dec 9, 2025
Collaborator
bmhowe23
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the Build wheels job passes, these changes look good to me.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add CUDA Graph Optimization to TensorRT Decoder
Overview
This PR implements CUDA graph optimization for the TensorRT decoder, providing a measurable 20% performance improvement while maintaining full numerical accuracy. The implementation uses a clean executor abstraction pattern that automatically handles models with dynamic shapes.
Key Features
1. CUDA Graph Optimization
2. Executor Abstraction (PIMPL + Variant Pattern)
std::variantdispatch (no virtual calls)3. Dynamic Shape Detection
4. User Control
use_cuda_graph(default:true)Performance Results
Benchmark Test (200 iterations):