Skip to content

NVIDIA Programmatic Dependent Launch for Llama.cpp #15479

@agray3

Description

@agray3

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

The NVIDIA Programmatic Dependent Launch (PDL) mechanism, available for Hopper, Blackwell (+ future) GPUs, allows for a dependent secondary kernel to launch before the primary kernel it depends on in the same CUDA stream has finished executing.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization

Benefits can be achieved though enabling PDL, even without attempting to overlap any kernel logic, since it allows e.g. the scheduling of blocks in the secondary kernel to start even before the primary kernel is completely finished. Further benefits may be available from more in-depth refactoring to enhance overlap.

#15480 introduces PDL to llama.cpp via:

  1. Adjusting each freshly captured CUDA graph to set the appropriate PDL launch properties to each node/edge
  2. Systematically adding cudaTriggerProgrammaticLaunchCompletion()/cudaGridDependencySynchronize() to the entry/exit points of all kernels, to guarantee appropriate synchronization.

For 2. these sync points are currently set as conservatively as possible, not attempting to overlap any kernel logic, so this change doesn't alter the ordering of (application-level) instructions in any way. But this opens up possibilities for further optimisation through refactoring, e.g. weights could be loaded before cudaGridDependencySynchronize() since they don't depend on any previous kernel. Note that cudaTriggerProgrammaticLaunchCompletion() is not strictly required in this current implementation, since the appropriate instruction is added at the end of all kernels anyway. But I've included it to open up scope for any further overlapping of work at the end of kernels, for which the subsequent kernel does not depend on the result.

Motivation

Here are the benefits I observe from #15480:
Image

As above, this may open up possibilities for further gains through refactoring kernels to achieve more overlapping.
Note that the PR currently disables PDL in the presence of any library (e.g. BLAS) kernels, since further work is required to ensure appropriate synchronization for these (see comments in code).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions