-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statically link to CUDA Runtime #517
Conversation
/ok to test |
2 similar comments
/ok to test |
/ok to test |
with gil: | ||
cbData.callback(cbData.userData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the gil here? Given these are low level bindings this is a bit surprising to me that we'd need to acquire the gil as a user could relatively easily create a function that doesn't require the GIL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to support Python callables and this is a shared path between both the Python users and Cython users. (since Python binding layer calls into the Cython binding layer)
I found a thread all the way back from 2021 with @shwina investigation on the consequence of not acquiring the GIL. My rough understanding is that it's possible to have the main thread hold the GIL, then the callback thread can't acquire it and so it deadlocks. So by enabling the GIL for the callback, the interpreter should regularly switch between the two threads and avoid the deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my current understanding, we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer, so I suggest we move callback-related changes to a separate PR to make it easier to review. (The static linking part seems almost ready to merge to me.)
Another thought is that we should handle this in the Python layer, and keep the Cython layer (this file) as lean/thin as it is today. I can imagine I want to define a pure C callback in Cython that does not require GIL, and pass it to the Cython binding. Then we should not need the GIL-holding wrapper at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer
We already handle it in Cython layer for Runtime: https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/cuda/bindings/_lib/cyruntime/utils.pyx.in#L885-L915
I can keep the Runtime changes to avoid regressing those, but remove the Driver ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I see, I overlooked the streamAddCallbackCommon
function when checking the cudart code yesterday, thanks Vlad!
Sounds like a good idea to
- move the lifetime-related changes to another PR
- ensure both driver/runtime Python layers have the lifetime management for callbacks
- ensure both driver/runtime Cython layers have no lifetime management for callbacks (as if the CUDA C APIs are called from Cython, which was the intent AFAIK)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #531 to track this.
154644c
to
3024f18
Compare
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
This comment has been minimized.
This comment has been minimized.
# at least with setuptools 75.0.0 this folder was added erroneously | ||
# to the payload, causing file copying to the build environment failed | ||
exclude cuda/bindings cuda?bindings | ||
exclude cuda/bindings/_bindings cuda?bindings?_bindings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I guess after #493 is merged this bit can be removed.
with gil: | ||
cbData.callback(cbData.userData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my current understanding, we already do not handle the callback lifetime or GIL release/acquire in either Python or Cython layer, so I suggest we move callback-related changes to a separate PR to make it easier to review. (The static linking part seems almost ready to merge to me.)
Another thought is that we should handle this in the Python layer, and keep the Cython layer (this file) as lean/thin as it is today. I can imagine I want to define a pure C callback in Cython that does not require GIL, and pass it to the Cython binding. Then we should not need the GIL-holding wrapper at all.
ctypedef cuHostCallbackData_st cuHostCallbackData | ||
|
||
@cython.show_performance_hints(False) | ||
cdef void cuHostCallbackWrapper(void *data) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we move this wrapper to the Python layer, we can define it as with gil
in the function signature, then we don't need the with gil
block in the function body. The GIL can be acquired at function call time, and Cython has a bit more info for the GIL analysis.
https://cython.readthedocs.io/en/latest/src/userguide/nogil.html#releasing-and-reacquiring-the-gil
cbData.userData = userData | ||
err = cydriver._cuLaunchHostFunc(hStream, <CUhostFn>cuHostCallbackWrapper, <void *>cbData) | ||
if err != CUDA_SUCCESS: | ||
free(cbData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit worried about double-free -- Do we know for sure that when the error code is not CUDA_SUCCESS
, the callback is guaranteed not executed (otherwise the buffer would have been freed inside the callback)? Could it be possible that the callback is executed and then we get an error code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the docs mainly talk about what happens in situations where there's a context failure before the callback is reached (i.e. "Note that, in contrast to cuStreamAddCallback, the function will not be called in the event of an error in the CUDA context.").
Since this doesn't answer the question directly though, I took a look at the driver source. While it does appear to do all the needed checks before the callback gets enqueued, I'm not confident enough in my understanding of internal structures to be certain. What I can be certain about though is that there's plenty of error checks before the enqueue is even considered. The checks are those involved user args, context, GPU state, memory allocations. So in an event where an error does get returned, my impression after looking through it is that it's much more likely to have occurred before the enqueue happened.
Overall, if the API fails but still executes as if it passed... then I think it should be treated as a driver bug.
/ok to test |
/ok to test |
/ok to test |
Thanks, Vlad! Since the CI is green (after a few retries) and all major comments are resolved, let's merge and address any potential issues in a separate PR. |
|
Close #100
This version of statically linking to CUDA Runtime has no user facing breaking runtime changes and can therefore be merged early to let it soak before our next release. This change leaves the graphics APIs as is because those types are redefined and would cause a type conflict if we were to extern the definitions as described in #488.
This change also couples in a fix for callback functions. Each callback API should have the GIL enabled when processing the callback but the driver APIs were missing this.