-
Notifications
You must be signed in to change notification settings - Fork 227
Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements Buffer.fill(value, width, *, stream) method that wraps cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter (1, 2, or 4 bytes). - Add fill() method to Buffer class in _buffer.pyx - Support width=1 (byte), width=2 (16-bit), width=4 (32-bit) - Validate width, value range, and buffer size divisibility - Add comprehensive tests in test_memory.py - Tests cover all widths, error cases, and verification Part of issue NVIDIA#1314: CUDA Graph phase 3 - memcpy nodes
Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill() in graph capture mode. The test verifies graph capture for Buffer operations including copy_from, copy_to, fill, and kernel launch operations. Part of issue NVIDIA#1314
- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Improves performance by eliminating Python overhead
- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Remove unused raise_if_driver_error import - Improves performance by eliminating Python overhead
|
/ok to test 7d9747d |
|
| cdef cydriver.CUstream s = s_stream._handle | ||
|
|
||
| # Validate width | ||
| if width not in (1, 2, 4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put the validation code closer to the top of the function so we avoid any setup work in the error case where the user passes an unsupported size to the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I fiddled with it to simplify the logic. To be honest, I don't see a big improvement, here, since most of the preceding statements just declare stack variables.
| raise ValueError(f"width must be 1, 2, or 4, got {width}") | ||
|
|
||
| # Validate value fits in width | ||
| if width == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could hoist this logic into a validation function to remove the magic numbers in the code.
def _validate_value_against_bitwidth(bitwidth, value, signed=False):
max_bits = bitwidth
if signed:
min_value = -(1 << (max_bits - 1))
max_value = (1 << (max_bits - 1)) - 1
else:
min_value = 0
max_value = (1 << max_bits) - 1
if not min_value <= value <= max_value:
raise ValueError(
f"value ({value}) is outside the representable range for {bitwidth}-bit integers "
f"[{min_value}, {max_value}]"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I converted it to Cython, too.
| buffer1.fill(0x42, width=1, stream=stream) | ||
| device.sync() | ||
|
|
||
| if check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are parametrizing these value checks in a test suite? The memory sizes don't strike me as so large that these operations would be slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values are only checked when the memory allocation is pinned. This follows the existing pattern.
rparolin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good generally, just a couple comments.
- Add _validate_value_against_bitwidth helper function - Move helper function to end of file as cdef function - Use 64-bit platform integers (int64_t/uint64_t) instead of Python ints - Add assertion that bitwidth < 64 - Remove magic numbers from fill() method - Update tests to match new error message format
|
/ok to test 8e2cddf |
Summary
This PR implements
Buffer.fill(value, width, *, stream)method that wraps CUDA'scuMemsetAsyncfunctions, supportingcuMemsetD8Async,cuMemsetD16Async, andcuMemsetD32Asyncbased on the width parameter.Part of issue #1314: CUDA Graph phase 3 - memcpy nodes
Changes
Added
Buffer.fill()method incuda/core/experimental/_memory/_buffer.pyx:cuMemsetD8Async)cuMemsetD16Async)cuMemsetD32Async)copy_to/copy_frommethodsAdded comprehensive tests in
tests/test_memory.py:Implementation Details
The method automatically selects the appropriate CUDA driver API function based on the
widthparameter:width=1: UsescuMemsetD8Asyncwith N = buffer_size (bytes)width=2: UsescuMemsetD16Asyncwith N = buffer_size // 2 (16-bit elements)width=4: UsescuMemsetD32Asyncwith N = buffer_size // 4 (32-bit elements)Example Usage
Testing
All tests pass with comprehensive coverage of: