Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Andy-Jost · 2025-12-04T23:54:35Z

Summary

This PR implements Buffer.fill(value, width, *, stream) method that wraps CUDA's cuMemsetAsync functions, supporting cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter.

Part of issue #1314: CUDA Graph phase 3 - memcpy nodes

Changes

Added Buffer.fill() method in cuda/core/experimental/_memory/_buffer.pyx:
- Supports width=1 (byte fill via cuMemsetD8Async)
- Supports width=2 (16-bit fill via cuMemsetD16Async)
- Supports width=4 (32-bit fill via cuMemsetD32Async)
- Validates width (must be 1, 2, or 4)
- Validates value range for each width
- Validates buffer size is divisible by width
- Follows same pattern as copy_to/copy_from methods
Added comprehensive tests in tests/test_memory.py:
- Tests all three widths (1, 2, 4 bytes)
- Tests error cases (invalid width, value out of range, size not divisible)
- Tests with different memory resource types
- Includes verification for host-accessible memory

Implementation Details

The method automatically selects the appropriate CUDA driver API function based on the width parameter:

width=1: Uses cuMemsetD8Async with N = buffer_size (bytes)
width=2: Uses cuMemsetD16Async with N = buffer_size // 2 (16-bit elements)
width=4: Uses cuMemsetD32Async with N = buffer_size // 4 (32-bit elements)

Example Usage

buffer = mr.allocate(1024, stream=stream)
buffer.fill(0x42, width=1, stream=stream)  # Fill with byte value
buffer.fill(0x1234, width=2, stream=stream)  # Fill with 16-bit value
buffer.fill(0xDEADBEEF, width=4, stream=stream)  # Fill with 32-bit value

Testing

All tests pass with comprehensive coverage of:

Success cases for all widths
Error validation (width, value range, size divisibility)
Multiple memory resource types (device, unified, pinned)

Implements Buffer.fill(value, width, *, stream) method that wraps cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter (1, 2, or 4 bytes). - Add fill() method to Buffer class in _buffer.pyx - Support width=1 (byte), width=2 (16-bit), width=4 (32-bit) - Validate width, value range, and buffer size divisibility - Add comprehensive tests in test_memory.py - Tests cover all widths, error cases, and verification Part of issue NVIDIA#1314: CUDA Graph phase 3 - memcpy nodes

copy-pr-bot · 2025-12-04T23:54:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill() in graph capture mode. The test verifies graph capture for Buffer operations including copy_from, copy_to, fill, and kernel launch operations. Part of issue NVIDIA#1314

- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Improves performance by eliminating Python overhead

- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Remove unused raise_if_driver_error import - Improves performance by eliminating Python overhead

Andy-Jost · 2025-12-05T00:23:37Z

/ok to test 7d9747d

github-actions · 2025-12-05T00:33:59Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1318/
https://nvidia.github.io/cuda-python/pr-preview/pr-1318/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1318/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1318/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2025-12-05T16:05:21Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+        cdef cydriver.CUstream s = s_stream._handle
+
+        # Validate width
+        if width not in (1, 2, 4):


I'd put the validation code closer to the top of the function so we avoid any setup work in the error case where the user passes an unsupported size to the function.

OK. I fiddled with it to simplify the logic. To be honest, I don't see a big improvement, here, since most of the preceding statements just declare stack variables.

rparolin · 2025-12-05T16:15:53Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+            raise ValueError(f"width must be 1, 2, or 4, got {width}")
+
+        # Validate value fits in width
+        if width == 1:


You could hoist this logic into a validation function to remove the magic numbers in the code.

def _validate_value_against_bitwidth(bitwidth, value, signed=False): max_bits = bitwidth if signed: min_value = -(1 << (max_bits - 1)) max_value = (1 << (max_bits - 1)) - 1 else: min_value = 0 max_value = (1 << max_bits) - 1 if not min_value <= value <= max_value: raise ValueError( f"value ({value}) is outside the representable range for {bitwidth}-bit integers " f"[{min_value}, {max_value}]" )

Ok. I converted it to Cython, too.

rparolin · 2025-12-05T16:19:29Z

cuda_core/tests/test_memory.py

+    buffer1.fill(0x42, width=1, stream=stream)
+    device.sync()
+
+    if check:


Why are parametrizing these value checks in a test suite? The memory sizes don't strike me as so large that these operations would be slow.

The values are only checked when the memory allocation is pinned. This follows the existing pattern.

rparolin

Looks good generally, just a couple comments.

- Add _validate_value_against_bitwidth helper function - Move helper function to end of file as cdef function - Use 64-bit platform integers (int64_t/uint64_t) instead of Python ints - Add assertion that bitwidth < 64 - Remove magic numbers from fill() method - Update tests to match new error message format

Andy-Jost · 2025-12-05T18:38:55Z

/ok to test 8e2cddf

Andy-Jost added cuda.core Everything related to the cuda.core module feature New feature or request labels Dec 4, 2025

Andy-Jost requested review from cpcloud, ksimpson-work and leofang December 4, 2025 23:54

Andy-Jost self-assigned this Dec 4, 2025

Andy-Jost requested review from mdboom, rparolin and rwgk December 4, 2025 23:54

Andy-Jost added this to the cuda.core beta 10 milestone Dec 4, 2025

Andy-Jost removed request for cpcloud, ksimpson-work and mdboom December 4, 2025 23:55

Add graph capture tests for Buffer.fill()

8294553

Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill() in graph capture mode. The test verifies graph capture for Buffer operations including copy_from, copy_to, fill, and kernel launch operations. Part of issue NVIDIA#1314

Andy-Jost added the P0 High priority - Must do! label Dec 5, 2025

Andy-Jost added 2 commits December 4, 2025 16:16

rparolin reviewed Dec 5, 2025

View reviewed changes

rparolin requested changes Dec 5, 2025

View reviewed changes

Andy-Jost added 2 commits December 5, 2025 10:08

Simplified argument validation logic in Buffer.fill.

07c65d2

Andy-Jost requested a review from rparolin December 5, 2025 18:37

Merge branch 'main' into issue-1314

8e2cddf

Andy-Jost marked this pull request as ready for review December 5, 2025 18:38

Andy-Jost enabled auto-merge (squash) December 5, 2025 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Uh oh!

Andy-Jost commented Dec 4, 2025

Uh oh!

copy-pr-bot bot commented Dec 4, 2025

Uh oh!

Andy-Jost commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin Dec 5, 2025

Uh oh!

Andy-Jost Dec 5, 2025

Uh oh!

rparolin Dec 5, 2025

Uh oh!

Andy-Jost Dec 5, 2025

Uh oh!

rparolin Dec 5, 2025

Uh oh!

Andy-Jost Dec 5, 2025 •

edited

Loading

Uh oh!

rparolin left a comment

Uh oh!

Andy-Jost commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Are you sure you want to change the base?

Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Uh oh!

Conversation

Andy-Jost commented Dec 4, 2025

Summary

Changes

Implementation Details

Example Usage

Testing

Uh oh!

copy-pr-bot bot commented Dec 4, 2025

Uh oh!

Andy-Jost commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

rparolin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

rparolin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rparolin left a comment

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Andy-Jost Dec 5, 2025 •

edited

Loading