NVIDIA Programmatic Dependent Launch for Llama.cpp

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

The NVIDIA Programmatic Dependent Launch (PDL) mechanism, available for Hopper, Blackwell (+ future) GPUs, allows for a dependent secondary kernel to launch before the primary kernel it depends on in the same CUDA stream has finished executing. 

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization

Benefits can be achieved though enabling PDL, even without attempting to overlap any kernel logic, since it allows e.g. the  scheduling of blocks in the secondary kernel to start even before the primary kernel is completely finished. Further benefits may be available from more in-depth refactoring to enhance overlap. 

https://github.com/ggml-org/llama.cpp/pull/15480 introduces PDL to llama.cpp via:
1. Adjusting each freshly captured CUDA graph to set the appropriate PDL launch properties to each node/edge
2. Systematically adding cudaTriggerProgrammaticLaunchCompletion()/cudaGridDependencySynchronize() to the entry/exit points of all kernels, to guarantee appropriate synchronization.

For 2. these sync points are currently set as conservatively as possible, not attempting to overlap any kernel logic, so this change doesn't alter the ordering of (application-level) instructions in any way. But this opens up possibilities for further optimisation through refactoring, e.g. weights could be loaded before cudaGridDependencySynchronize() since they don't depend on any previous kernel. Note that cudaTriggerProgrammaticLaunchCompletion() is not strictly required in this current implementation, since the appropriate instruction is added at the end of all kernels anyway. But I've included it to open up scope for any further overlapping of work at the end of kernels, for which the subsequent kernel does not depend on the result.

### Motivation

Here are the benefits I observe from https://github.com/ggml-org/llama.cpp/pull/15480:
<img width="846" height="568" alt="Image" src="https://github.com/user-attachments/assets/1bb7fa2f-025b-46ea-bb31-a4cadae74f89" />

As above, this may open up possibilities for further gains through refactoring kernels to achieve more overlapping.
Note that the PR currently disables PDL in the presence of any library (e.g. BLAS) kernels, since further work is required to ensure appropriate synchronization for these (see comments in code).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVIDIA Programmatic Dependent Launch for Llama.cpp #15479

Prerequisites

Feature Description

Motivation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVIDIA Programmatic Dependent Launch for Llama.cpp #15479

Description

Prerequisites

Feature Description

Motivation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions