Statically link to CUDA Runtime #517

vzhurba01 · 2025-03-13T22:23:40Z

Close #100

This version of statically linking to CUDA Runtime has no user facing breaking runtime changes and can therefore be merged early to let it soak before our next release. This change leaves the graphics APIs as is because those types are redefined and would cause a type conflict if we were to extern the definitions as described in #488.

This change also couples in a fix for callback functions. Each callback API should have the GIL enabled when processing the callback but the driver APIs were missing this.

copy-pr-bot · 2025-03-13T22:23:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vzhurba01 · 2025-03-13T22:23:54Z

/ok to test

vzhurba01 · 2025-03-13T22:38:34Z

/ok to test

vzhurba01 · 2025-03-13T22:43:01Z

/ok to test

cuda_bindings/cuda/bindings/_bindings/cyruntime.pxd.in

kkraus14 · 2025-03-14T03:20:03Z

cuda_bindings/cuda/bindings/cydriver.pyx.in

+    with gil:
+        cbData.callback(cbData.userData)


Why do we need the gil here? Given these are low level bindings this is a bit surprising to me that we'd need to acquire the gil as a user could relatively easily create a function that doesn't require the GIL?

We want to support Python callables and this is a shared path between both the Python users and Cython users. (since Python binding layer calls into the Cython binding layer)

I found a thread all the way back from 2021 with @shwina investigation on the consequence of not acquiring the GIL. My rough understanding is that it's possible to have the main thread hold the GIL, then the callback thread can't acquire it and so it deadlocks. So by enabling the GIL for the callback, the interpreter should regularly switch between the two threads and avoid the deadlock.

Based on my current understanding, we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer, so I suggest we move callback-related changes to a separate PR to make it easier to review. (The static linking part seems almost ready to merge to me.)

Another thought is that we should handle this in the Python layer, and keep the Cython layer (this file) as lean/thin as it is today. I can imagine I want to define a pure C callback in Cython that does not require GIL, and pass it to the Cython binding. Then we should not need the GIL-holding wrapper at all.

we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer

We already handle it in Cython layer for Runtime: https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/cuda/bindings/_lib/cyruntime/utils.pyx.in#L885-L915

I can keep the Runtime changes to avoid regressing those, but remove the Driver ones.

Ah! I see, I overlooked the streamAddCallbackCommon function when checking the cudart code yesterday, thanks Vlad!

Sounds like a good idea to

move the lifetime-related changes to another PR

ensure both driver/runtime Python layers have the lifetime management for callbacks

ensure both driver/runtime Cython layers have no lifetime management for callbacks (as if the CUDA C APIs are called from Cython, which was the intent AFAIK)

Created #531 to track this.

vzhurba01 · 2025-03-20T22:16:20Z

/ok to test

vzhurba01 · 2025-03-20T22:21:38Z

/ok to test

vzhurba01 · 2025-03-20T22:50:21Z

/ok to test

vzhurba01 · 2025-03-20T23:15:06Z

/ok to test

vzhurba01 · 2025-03-20T23:25:09Z

/ok to test

cuda_bindings/cuda/bindings/_lib/cyruntime/utils.pxd.in

leofang · 2025-03-23T19:04:18Z

cuda_bindings/MANIFEST.in

 # at least with setuptools 75.0.0 this folder was added erroneously
 # to the payload, causing file copying to the build environment failed
 exclude cuda/bindings cuda?bindings
+exclude cuda/bindings/_bindings cuda?bindings?_bindings


Note: I guess after #493 is merged this bit can be removed.

cuda_bindings/cuda/bindings/_bindings/cyruntime.pxi.in

cuda_bindings/cuda/bindings/_bindings/cyruntime.pyx.in

cuda_bindings/cuda/bindings/_bindings/cyruntime_ptds.pxd.in

cuda_bindings/cuda/bindings/_lib/cyruntime/utils.pxd.in

leofang · 2025-03-23T20:46:20Z

cuda_bindings/cuda/bindings/cydriver.pyx.in

+    with gil:
+        cbData.callback(cbData.userData)


Based on my current understanding, we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer, so I suggest we move callback-related changes to a separate PR to make it easier to review. (The static linking part seems almost ready to merge to me.)

Another thought is that we should handle this in the Python layer, and keep the Cython layer (this file) as lean/thin as it is today. I can imagine I want to define a pure C callback in Cython that does not require GIL, and pass it to the Cython binding. Then we should not need the GIL-holding wrapper at all.

leofang · 2025-03-23T20:47:51Z

cuda_bindings/cuda/bindings/cydriver.pyx.in

+ctypedef cuHostCallbackData_st cuHostCallbackData
+
+@cython.show_performance_hints(False)
+cdef void cuHostCallbackWrapper(void *data) nogil:


Once we move this wrapper to the Python layer, we can define it as with gil in the function signature, then we don't need the with gil block in the function body. The GIL can be acquired at function call time, and Cython has a bit more info for the GIL analysis.
https://cython.readthedocs.io/en/latest/src/userguide/nogil.html#releasing-and-reacquiring-the-gil

leofang · 2025-03-23T20:51:58Z

cuda_bindings/cuda/bindings/cydriver.pyx.in

+    cbData.userData = userData
+    err = cydriver._cuLaunchHostFunc(hStream, <CUhostFn>cuHostCallbackWrapper, <void *>cbData)
+    if err != CUDA_SUCCESS:
+        free(cbData)


I am a bit worried about double-free -- Do we know for sure that when the error code is not CUDA_SUCCESS, the callback is guaranteed not executed (otherwise the buffer would have been freed inside the callback)? Could it be possible that the callback is executed and then we get an error code here?

I see that the docs mainly talk about what happens in situations where there's a context failure before the callback is reached (i.e. "Note that, in contrast to cuStreamAddCallback, the function will not be called in the event of an error in the CUDA context.").

Since this doesn't answer the question directly though, I took a look at the driver source. While it does appear to do all the needed checks before the callback gets enqueued, I'm not confident enough in my understanding of internal structures to be certain. What I can be certain about though is that there's plenty of error checks before the enqueue is even considered. The checks are those involved user args, context, GPU state, memory allocations. So in an event where an error does get returned, my impression after looking through it is that it's much more likely to have occurred before the enqueue happened.

Overall, if the API fails but still executes as if it passed... then I think it should be treated as a driver bug.

cuda_bindings/setup.py

vzhurba01 · 2025-03-24T22:52:41Z

/ok to test

cuda_bindings/setup.py

vzhurba01 · 2025-03-24T23:10:05Z

/ok to test

cuda_bindings/setup.py

leofang · 2025-03-24T23:28:37Z

/ok to test

leofang · 2025-03-25T13:48:26Z

Thanks, Vlad! Since the CI is green (after a few retries) and all major comments are resolved, let's merge and address any potential issues in a separate PR.

github-actions · 2025-03-25T14:11:41Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

vzhurba01 added P0 High priority - Must do! RFC Plans and announcements cuda.bindings Everything related to the cuda.bindings module labels Mar 13, 2025

vzhurba01 added this to the cuda-python 12-next, 11-next milestone Mar 13, 2025

vzhurba01 self-assigned this Mar 13, 2025

leofang reviewed Mar 13, 2025

View reviewed changes

cuda_bindings/cuda/bindings/_bindings/cyruntime.pxd.in Outdated Show resolved Hide resolved

kkraus14 reviewed Mar 14, 2025

View reviewed changes

vzhurba01 added 5 commits March 20, 2025 14:34

Statically link to CUDA Runtime

9ffb2c9

Specify LIB for Windows

4480ccb

Corrected parentheses

a5e948b

Link to cudart_static

9e08cef

Support PTDS for Runtime

3024f18

vzhurba01 force-pushed the 100-static-runtime branch from 154644c to 3024f18 Compare March 20, 2025 22:09

Run pre-commit

f427481

Add LIBRARY_PATH

dc00d08

Look inside lib

a798523

Add missing files to source list

27ea147

Build WAR for excluding folder copy

776e762

This comment has been minimized.

Sign in to view

rwgk reviewed Mar 21, 2025

View reviewed changes

cuda_bindings/cuda/bindings/_lib/cyruntime/utils.pxd.in Show resolved Hide resolved

leofang added the enhancement Any code-related improvements label Mar 23, 2025

leofang requested changes Mar 23, 2025

View reviewed changes

vzhurba01 mentioned this pull request Mar 24, 2025

Move callback wrappers to the Python layer #531

Open

vzhurba01 added 3 commits March 24, 2025 15:36

Remove callback wrappers for Driver

be6f948

Cleanup imports

2b7288f

Update docs

9e138fa

leofang reviewed Mar 24, 2025

View reviewed changes

cuda_bindings/setup.py Show resolved Hide resolved

Add accidentally remove source files

60d1681

leofang reviewed Mar 24, 2025

View reviewed changes

cuda_bindings/setup.py Outdated Show resolved Hide resolved

remove repeated entry

56532c9

leofang approved these changes Mar 24, 2025

View reviewed changes

leofang merged commit 1c6f3bc into NVIDIA:main Mar 25, 2025
74 checks passed

vzhurba01 mentioned this pull request Mar 25, 2025

Document static library source build requirement #532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statically link to CUDA Runtime #517

Statically link to CUDA Runtime #517

vzhurba01 commented Mar 13, 2025

copy-pr-bot bot commented Mar 13, 2025

vzhurba01 commented Mar 13, 2025

vzhurba01 commented Mar 13, 2025

vzhurba01 commented Mar 13, 2025

kkraus14 Mar 14, 2025

vzhurba01 Mar 14, 2025 •

edited by leofang

Loading

leofang Mar 23, 2025

vzhurba01 Mar 24, 2025

leofang Mar 24, 2025 •

edited

Loading

vzhurba01 Mar 24, 2025

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

This comment has been minimized.

leofang Mar 23, 2025

leofang Mar 23, 2025

leofang Mar 23, 2025

leofang Mar 23, 2025

vzhurba01 Mar 24, 2025

vzhurba01 commented Mar 24, 2025

vzhurba01 commented Mar 24, 2025

leofang commented Mar 24, 2025

leofang commented Mar 25, 2025

github-actions bot commented Mar 25, 2025

Statically link to CUDA Runtime #517

Statically link to CUDA Runtime #517

Conversation

vzhurba01 commented Mar 13, 2025

copy-pr-bot bot commented Mar 13, 2025

vzhurba01 commented Mar 13, 2025

vzhurba01 commented Mar 13, 2025

vzhurba01 commented Mar 13, 2025

Choose a reason for hiding this comment

vzhurba01 Mar 14, 2025 • edited by leofang Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

vzhurba01 commented Mar 20, 2025

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vzhurba01 commented Mar 24, 2025

vzhurba01 commented Mar 24, 2025

leofang commented Mar 24, 2025

leofang commented Mar 25, 2025

github-actions bot commented Mar 25, 2025

vzhurba01 Mar 14, 2025 •

edited by leofang

Loading

leofang Mar 24, 2025 •

edited

Loading