Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statically link to CUDA Runtime #517

Merged
merged 15 commits into from
Mar 25, 2025
Merged

Conversation

vzhurba01
Copy link
Collaborator

Close #100

This version of statically linking to CUDA Runtime has no user facing breaking runtime changes and can therefore be merged early to let it soak before our next release. This change leaves the graphics APIs as is because those types are redefined and would cause a type conflict if we were to extern the definitions as described in #488.

This change also couples in a fix for callback functions. Each callback API should have the GIL enabled when processing the callback but the driver APIs were missing this.

@vzhurba01 vzhurba01 added P0 High priority - Must do! RFC Plans and announcements cuda.bindings Everything related to the cuda.bindings module labels Mar 13, 2025
@vzhurba01 vzhurba01 self-assigned this Mar 13, 2025
Copy link
Contributor

copy-pr-bot bot commented Mar 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vzhurba01
Copy link
Collaborator Author

/ok to test

2 similar comments
@vzhurba01
Copy link
Collaborator Author

/ok to test

@vzhurba01
Copy link
Collaborator Author

/ok to test

Comment on lines 1670 to 1671
with gil:
cbData.callback(cbData.userData)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the gil here? Given these are low level bindings this is a bit surprising to me that we'd need to acquire the gil as a user could relatively easily create a function that doesn't require the GIL?

Copy link
Collaborator Author

@vzhurba01 vzhurba01 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to support Python callables and this is a shared path between both the Python users and Cython users. (since Python binding layer calls into the Cython binding layer)

I found a thread all the way back from 2021 with @shwina investigation on the consequence of not acquiring the GIL. My rough understanding is that it's possible to have the main thread hold the GIL, then the callback thread can't acquire it and so it deadlocks. So by enabling the GIL for the callback, the interpreter should regularly switch between the two threads and avoid the deadlock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my current understanding, we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer, so I suggest we move callback-related changes to a separate PR to make it easier to review. (The static linking part seems almost ready to merge to me.)

Another thought is that we should handle this in the Python layer, and keep the Cython layer (this file) as lean/thin as it is today. I can imagine I want to define a pure C callback in Cython that does not require GIL, and pass it to the Cython binding. Then we should not need the GIL-holding wrapper at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer

We already handle it in Cython layer for Runtime: https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/cuda/bindings/_lib/cyruntime/utils.pyx.in#L885-L915

I can keep the Runtime changes to avoid regressing those, but remove the Driver ones.

Copy link
Member

@leofang leofang Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I see, I overlooked the streamAddCallbackCommon function when checking the cudart code yesterday, thanks Vlad!

Sounds like a good idea to

  • move the lifetime-related changes to another PR
  • ensure both driver/runtime Python layers have the lifetime management for callbacks
  • ensure both driver/runtime Cython layers have no lifetime management for callbacks (as if the CUDA C APIs are called from Cython, which was the intent AFAIK)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #531 to track this.

@vzhurba01 vzhurba01 force-pushed the 100-static-runtime branch from 154644c to 3024f18 Compare March 20, 2025 22:09
@vzhurba01
Copy link
Collaborator Author

/ok to test

@vzhurba01
Copy link
Collaborator Author

/ok to test

@vzhurba01
Copy link
Collaborator Author

/ok to test

@vzhurba01
Copy link
Collaborator Author

/ok to test

@vzhurba01
Copy link
Collaborator Author

/ok to test

This comment has been minimized.

@leofang leofang added the enhancement Any code-related improvements label Mar 23, 2025
Comment on lines 2 to +5
# at least with setuptools 75.0.0 this folder was added erroneously
# to the payload, causing file copying to the build environment failed
exclude cuda/bindings cuda?bindings
exclude cuda/bindings/_bindings cuda?bindings?_bindings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I guess after #493 is merged this bit can be removed.

Comment on lines 1670 to 1671
with gil:
cbData.callback(cbData.userData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my current understanding, we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer, so I suggest we move callback-related changes to a separate PR to make it easier to review. (The static linking part seems almost ready to merge to me.)

Another thought is that we should handle this in the Python layer, and keep the Cython layer (this file) as lean/thin as it is today. I can imagine I want to define a pure C callback in Cython that does not require GIL, and pass it to the Cython binding. Then we should not need the GIL-holding wrapper at all.

ctypedef cuHostCallbackData_st cuHostCallbackData

@cython.show_performance_hints(False)
cdef void cuHostCallbackWrapper(void *data) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we move this wrapper to the Python layer, we can define it as with gil in the function signature, then we don't need the with gil block in the function body. The GIL can be acquired at function call time, and Cython has a bit more info for the GIL analysis.
https://cython.readthedocs.io/en/latest/src/userguide/nogil.html#releasing-and-reacquiring-the-gil

cbData.userData = userData
err = cydriver._cuLaunchHostFunc(hStream, <CUhostFn>cuHostCallbackWrapper, <void *>cbData)
if err != CUDA_SUCCESS:
free(cbData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried about double-free -- Do we know for sure that when the error code is not CUDA_SUCCESS, the callback is guaranteed not executed (otherwise the buffer would have been freed inside the callback)? Could it be possible that the callback is executed and then we get an error code here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the docs mainly talk about what happens in situations where there's a context failure before the callback is reached (i.e. "Note that, in contrast to cuStreamAddCallback, the function will not be called in the event of an error in the CUDA context.").

Since this doesn't answer the question directly though, I took a look at the driver source. While it does appear to do all the needed checks before the callback gets enqueued, I'm not confident enough in my understanding of internal structures to be certain. What I can be certain about though is that there's plenty of error checks before the enqueue is even considered. The checks are those involved user args, context, GPU state, memory allocations. So in an event where an error does get returned, my impression after looking through it is that it's much more likely to have occurred before the enqueue happened.

Overall, if the API fails but still executes as if it passed... then I think it should be treated as a driver bug.

@vzhurba01
Copy link
Collaborator Author

/ok to test

@vzhurba01
Copy link
Collaborator Author

/ok to test

@leofang
Copy link
Member

leofang commented Mar 24, 2025

/ok to test

@leofang
Copy link
Member

leofang commented Mar 25, 2025

Thanks, Vlad! Since the CI is green (after a few retries) and all major comments are resolved, let's merge and address any potential issues in a separate PR.

@leofang leofang merged commit 1c6f3bc into NVIDIA:main Mar 25, 2025
74 checks passed
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.bindings Everything related to the cuda.bindings module enhancement Any code-related improvements P0 High priority - Must do! RFC Plans and announcements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: Statically link to cudart
4 participants