-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
The NVIDIA Programmatic Dependent Launch (PDL) mechanism, available for Hopper, Blackwell (+ future) GPUs, allows for a dependent secondary kernel to launch before the primary kernel it depends on in the same CUDA stream has finished executing.
Benefits can be achieved though enabling PDL, even without attempting to overlap any kernel logic, since it allows e.g. the scheduling of blocks in the secondary kernel to start even before the primary kernel is completely finished. Further benefits may be available from more in-depth refactoring to enhance overlap.
#15480 introduces PDL to llama.cpp via:
- Adjusting each freshly captured CUDA graph to set the appropriate PDL launch properties to each node/edge
- Systematically adding cudaTriggerProgrammaticLaunchCompletion()/cudaGridDependencySynchronize() to the entry/exit points of all kernels, to guarantee appropriate synchronization.
For 2. these sync points are currently set as conservatively as possible, not attempting to overlap any kernel logic, so this change doesn't alter the ordering of (application-level) instructions in any way. But this opens up possibilities for further optimisation through refactoring, e.g. weights could be loaded before cudaGridDependencySynchronize() since they don't depend on any previous kernel. Note that cudaTriggerProgrammaticLaunchCompletion() is not strictly required in this current implementation, since the appropriate instruction is added at the end of all kernels anyway. But I've included it to open up scope for any further overlapping of work at the end of kernels, for which the subsequent kernel does not depend on the result.
Motivation
Here are the benefits I observe from #15480:
As above, this may open up possibilities for further gains through refactoring kernels to achieve more overlapping.
Note that the PR currently disables PDL in the presence of any library (e.g. BLAS) kernels, since further work is required to ensure appropriate synchronization for these (see comments in code).