-
Notifications
You must be signed in to change notification settings - Fork 228
feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer #1283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer #1283
Conversation
|
|
||
| args_viewable_as_strided_memory | ||
|
|
||
| :template: dataclass.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- dataclass.rst does not render methods.
- class.rst omits cythonized properties
cyclass places attributes section just after the main class docstring. this way we can document the actual attributes at the end of the main docstring and they are followed by docstring of all the properties.
leofang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking in EOD progress. I haven't reviewed layout/memoryview.
Also, I assume you're working migrating the tests?
| driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_MEMORY_TYPE, | ||
| driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, | ||
| ) | ||
| return driver.cuPointerGetAttributes(len(attrs), attrs, ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: cimport this from cydriver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃
I've actually had that this way initially, but seeing all the cdriver imports are gone from buffer, went along with the Python API. I can undig the previous variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry. I think it's not "gone" gone, most likely @Andy-Jost found that we don't need many driver API calls in this file after the refactoring (#1205). But pointer attribute checks are in the hot path so we should cythonize it.
In fact, I am trying to catch up with what @fbusato is doing in C++ (NVIDIA/cccl#6733), which is an equivalent check (but for C++ mdspan instead of Python SMV).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference! Looking at the logic in cccl, I adjusted managed memory "discovery". I am not sure if we need to go into so much details as trying to get particular memory pool and check if readability flag is set there, I did not add this, but can adjust if needed.
In any case, I moved back to cydriver API and added tests with host/device/managed/pinned from cuda malloc and pinned from cuda register.
For pinned memory, I am not 100% sure what happens on devices for which
CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is false. I.e. if one registers host memory with cuda host register and passes the original pointer, while the pointer to access on device is different. I.e. would it be still memory_type = 0 or memory_type = host and what would be a desired is_device_accessible value.
c238c1a to
dc27268
Compare
|
The dlpack fix moved out of this PR is here #1291 |
|
Let's kick off CI |
|
/ok to test 3a904e7 |
This comment has been minimized.
This comment has been minimized.
|
A single test case failed - testing pointer attributes for host memory "manually pinned" with cuMemHostRregister. It failed on pre-condition assert that the memory is not device accessible before it is registered. And it failed in a second of two cases testing this. The test did not clean-up properly - it was missing unregister call. I am guessing we ended up with the same pointer in the second case. The 38ddb36 should fix that. |
Signed-off-by: Kamil Tokarski <[email protected]>
…pack export Signed-off-by: Kamil Tokarski <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
…n in reshape, fix to dense with sliced views Signed-off-by: Kamil Tokarski <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
…g tests Signed-off-by: Kamil Tokarski <[email protected]>
|
@stiepan would you mind resolving the conflicts? |
Signed-off-by: Kamil Tokarski <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
| # TODO(ktokarski): In some cases, the registered memory requires | ||
| # using different ptr for device and host, we could check | ||
| # cuMemHostGetDevicePointer and | ||
| # CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM | ||
| # to double check the device accessibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you happen to know what cases these are? This used to be the case with non-unified addressing but I don't think any platforms that CUDA supports are non-unified addressing these days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find a comprehensive list, but digging a bit I learnt one notable exception for modern gpus: running on WSL. Indeed, trying to access cudahostregistered ptr on WSL fails (if the memory is allocated with cuda from the start, using the same pointer is fine).
import cuda.core.experimental as ccx
from cuda.bindings import runtime
from cuda.bindings import driver
import cupy as cp
import numpy as np
d = ccx.Device()
d.set_current()
def query_memory_attrs(ptr):
attrs = (
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_MEMORY_TYPE,
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_IS_MANAGED,
)
ret, attr = driver.cuPointerGetAttributes(len(attrs), attrs, ptr)
assert ret == 0
return attr
a_np = np.empty(5, dtype=np.int32)
cpu_ptr = a_np.ctypes.data
ret, = runtime.cudaHostRegister(cpu_ptr, 20, 0)
assert ret == 0
assert query_memory_attrs(cpu_ptr)[0] == driver.CUmemorytype.CU_MEMORYTYPE_HOST
ret, attr = runtime.cudaPointerGetAttributes(cpu_ptr)
assert ret == 0
print(attr.devicePointer == cpu_ptr)
# On WSL, accessing cpu_ptr instead of attr.devicePointer fails
um = cp.cuda.UnownedMemory(cpu_ptr, 20, a_np, 0)
mem = cp.cuda.MemoryPointer(um, 0)
a_cp = cp.ndarray(shape=(5,), dtype=cp.int32, memptr=mem)
a_cp[:] = 1
print(a_np)
print(a_cp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the same time, driver's cuPointerGetAttributes still reports that pointer as CU_MEMORYTYPE_HOST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what should be the meaning of the is_device_accessible, is_host_accessible in this case?
- Should we check the device attribute and, if the attribute is 0, follow-up by retreiving host_ptr, device_ptr and set is_host_accessible=host_ptr==ptr, is_device_accessible=device_ptr==ptr?
- Or expect user to pass the correct pointer in a correct context, i.e. if the buffer is to be consumed on the gpu, user is expected to pass the device ptr?
- Or (not a fan) have buffer.device_ptr, buffer.host_ptr attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's table this discussion for now. I'll create an issue to track this. I think the strided layout itself is already big enough that we want to keep the scope limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we checked memory_type == cydriver.CUmemorytype.CU_MEMORYTYPE_HOST and CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM, I would assume the first check would return True, and the second check would return True if an allocation made from cudaMallocHost can use the same ptr for device and host, so it would still return True for is_device_accessible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On WSL, the CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is False.
If a ptr comes from cudaMallocHost or was passed to cudaHostRegister, the memory_type == cydriver.CUmemorytype.CU_MEMORYTYPE_HOST is True.
For cudaMallocHost, the ptr is truely device and host accessible, only the cudaHostRegister-ed one is troublesome - even though the memory type is CU_MEMORYTYPE_HOST, it cannot be used to access the mem from device. So my point was that if we were to say is_device_accessible is False whenever CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is False, we would break otherwise valid cudaMallocHost usages.
import cuda.core.experimental as ccx
from cuda.bindings import runtime
from cuda.bindings import driver
import cupy as cp
import numpy as np
import ctypes
def query_memory_attrs(ptr):
attrs = (
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_MEMORY_TYPE,
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_IS_MANAGED,
)
ret, attr = driver.cuPointerGetAttributes(len(attrs), attrs, ptr)
assert ret == 0
return attr
def as_numpy(ptr, shape, dtype):
size = np.prod(shape) * dtype.itemsize
return np.ndarray(
shape=shape,
dtype=dtype,
buffer=memoryview((ctypes.c_char * size).from_address(ptr))
)
def as_cupy(ptr, shape, dtype):
size = np.prod(shape) * dtype.itemsize
um = cp.cuda.UnownedMemory(ptr, size, owner=None, device_id=0)
mem = cp.cuda.MemoryPointer(um, 0)
return cp.ndarray(shape=shape, dtype=dtype, memptr=mem)
d = ccx.Device()
d.set_current()
# On WSL this is 0
print(driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM, 0))
shape = (5,)
dtype = np.dtype(np.int32)
size = np.prod(shape) * dtype.itemsize
# But this works
l = ccx.LegacyPinnedMemoryResource()
alloc_mem = l.allocate(np.prod(shape) * dtype.itemsize)
alloc_ptr = int(alloc_mem.handle)
# the pinned ptr is CU_POINTER_ATTRIBUTE_MEMORY_TYPE, as expected, 1 (aka CU_MEMORYTYPE_HOST)
assert query_memory_attrs(alloc_ptr)[0] == driver.CUmemorytype.CU_MEMORYTYPE_HOST
a_np = as_numpy(alloc_ptr, shape, dtype)
a_cp = as_cupy(alloc_ptr, shape, dtype)
a_np[:] = 1
print(a_np)
print(a_cp)
# The problem is when we register the memory
a_np = np.empty(shape, dtype=dtype)
cpu_ptr = a_np.ctypes.data
ret, = runtime.cudaHostRegister(cpu_ptr, size, 0)
assert ret == 0
assert query_memory_attrs(cpu_ptr)[0] == driver.CUmemorytype.CU_MEMORYTYPE_HOST
reg_np = as_numpy(cpu_ptr, shape, dtype)
reg_cp = as_cupy(cpu_ptr, shape, dtype)
reg_np[:] = 2
print(reg_np)
# Here we end up with invalid access
print(reg_cp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding is correct, on WSL CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is False for both cudaMallocHost memory as well as cudaHostRegister memory, but the ptr returned from cudaMallocHost is in fact usable in device code while the ptr used for cudaHostRegister is not usable in device code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could query the CU_POINTER_ATTRIBUTE_DEVICE_POINTER and CU_POINTER_ATTRIBUTE_HOST_POINTER attributes. On my local WSL setup it yields:
- Same ptr for pinned host memory
- Same ptr for managed memory
- Different ptrs for device memory (0 for the
CU_POINTER_ATTRIBUTE_HOST_POINTERattribute, as expected) - Different ptrs for registered host memory (neither are 0)
Our logic could be that we return is_device_accessible == True only when the ptr is equal to the ptr returned from CU_POINTER_ATTRIBUTE_DEVICE_POINTER and is_host_accessible == True when the ptr is equal to the ptr returned from CU_POINTER_ATTRIBUTE_HOST_POINTER.
That being said, querying these attributes are expensive and not sure if we want to pay this penalty...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding is correct, on WSL CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is False
That's right.
for both cudaMallocHost memory as well as cudaHostRegister memory
CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is a device attribute, not a memory ptr attribute
the ptr returned from cudaMallocHost is in fact usable in device code while the ptr used for cudaHostRegister is not usable in device code
That's right. And using memory type is not enough to distinguish the two.
Alternatively, we could query the CU_POINTER_ATTRIBUTE_DEVICE_POINTER and CU_POINTER_ATTRIBUTE_HOST_POINTER attributes.
Yeah, I've been thinking about similar approach. According to cuMemHostGetDevicePointer, there is still a catch, though. In some cases, the device_ptr != host_ptr, even though the memory can be accessed through the host pointer from the device. 🥲 If I read the docs right (and assuming that's the only edge-case), we'd need to boundle it with the CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM check, so that the CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM or ptr == device_ptr would be accurate enough.
Signed-off-by: Kamil Tokarski <[email protected]>
|
/ok to test 639ee5f |
when collecting cases in test_linkier, test_program, and test_utils. Could it be unrelated? |
Yes. It's likely a known glitch. Pinged you in an internal gha-runner thread. We can consider the CI is green. @cpcloud and I discussed, we are still reviewing the PR but we'd like to get it merged tomorrow the latest. |
| self._alloc_stream = None | ||
|
|
||
|
|
||
| cdef Buffer_init_mem_attrs(Buffer self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: cdef void Buffer_init_mem_attrs(Buffer self):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I recall some weird issues with void ret type when it comes to exception propagation with cython. Won't this require except* clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it'd have to be
cef int Buffer_init_mem_attrs(Buffer self) except?-1:
...
return 0if we want to do this and gain a bit of perf. I am fine with the status quo.
|
/ok to test 66fc6e8 |
|
|
Many thanks for @stiepan for migrating the strided layout to cuda-core and everyone for helping review!!! 🔥 |
Description
StridedLayout(shape, strides, itemsize),StridedLayout(a.shape, a.strides, a.itemsize, divide_strides=True)StridedLayout.dense(shape, itemsize, stride_order)StridedLayout.dense_likeandself.to_denseFrom Python, StridedLayout is immutable, stride manipulation methods return a new instance. In Cython, to avoid temporary objects in a sequence of operations, layout manipulations methods can be run in place.
Please take a look at the StridedLayout docs for more details and examples.
Enables wrapping external allocation into Buffer (
Buffer.from_handle(ptr, owner=obj)). The owner and memory resource cannot be specified together. The owner reference is kept until the Buffer is closed. Without the memory resource, Buffer now queries driver for host/device accessibility and device_id of the ptr.StridedMemoryView uses now StridedLayout to represent the shape/strides.
from_buffer(buffer, layout, optional dtype)to create SMV from Buffer and StridedLayout. For example to implement empty_like() method for numpy array, but allocated on a device, one could:The StridedMemoryView can be now exported via dlpack.(delayed for later)The StridedMemoryView.copy_from, StridedMemoryView.copy_to allow copying data between views(in a follow-up PR).Checklist