Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

Summary

This PR implements Buffer.fill(value, width, *, stream) method that wraps CUDA's cuMemsetAsync functions, supporting cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter.

Part of issue #1314: CUDA Graph phase 3 - memcpy nodes

Changes

  • Added Buffer.fill() method in cuda/core/experimental/_memory/_buffer.pyx:

    • Supports width=1 (byte fill via cuMemsetD8Async)
    • Supports width=2 (16-bit fill via cuMemsetD16Async)
    • Supports width=4 (32-bit fill via cuMemsetD32Async)
    • Validates width (must be 1, 2, or 4)
    • Validates value range for each width
    • Validates buffer size is divisible by width
    • Follows same pattern as copy_to/copy_from methods
  • Added comprehensive tests in tests/test_memory.py:

    • Tests all three widths (1, 2, 4 bytes)
    • Tests error cases (invalid width, value out of range, size not divisible)
    • Tests with different memory resource types
    • Includes verification for host-accessible memory

Implementation Details

The method automatically selects the appropriate CUDA driver API function based on the width parameter:

  • width=1: Uses cuMemsetD8Async with N = buffer_size (bytes)
  • width=2: Uses cuMemsetD16Async with N = buffer_size // 2 (16-bit elements)
  • width=4: Uses cuMemsetD32Async with N = buffer_size // 4 (32-bit elements)

Example Usage

buffer = mr.allocate(1024, stream=stream)
buffer.fill(0x42, width=1, stream=stream)  # Fill with byte value
buffer.fill(0x1234, width=2, stream=stream)  # Fill with 16-bit value
buffer.fill(0xDEADBEEF, width=4, stream=stream)  # Fill with 32-bit value

Testing

All tests pass with comprehensive coverage of:

  • Success cases for all widths
  • Error validation (width, value range, size divisibility)
  • Multiple memory resource types (device, unified, pinned)

Implements Buffer.fill(value, width, *, stream) method that wraps
cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on
the width parameter (1, 2, or 4 bytes).

- Add fill() method to Buffer class in _buffer.pyx
- Support width=1 (byte), width=2 (16-bit), width=4 (32-bit)
- Validate width, value range, and buffer size divisibility
- Add comprehensive tests in test_memory.py
- Tests cover all widths, error cases, and verification

Part of issue NVIDIA#1314: CUDA Graph phase 3 - memcpy nodes
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost added cuda.core Everything related to the cuda.core module feature New feature or request labels Dec 4, 2025
@Andy-Jost Andy-Jost self-assigned this Dec 4, 2025
@Andy-Jost Andy-Jost requested review from mdboom, rparolin and rwgk December 4, 2025 23:54
@Andy-Jost Andy-Jost added this to the cuda.core beta 10 milestone Dec 4, 2025
Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill()
in graph capture mode. The test verifies graph capture for Buffer operations
including copy_from, copy_to, fill, and kernel launch operations.

Part of issue NVIDIA#1314
@Andy-Jost Andy-Jost added the P0 High priority - Must do! label Dec 5, 2025
- Replace Python driver module calls with direct cydriver calls
- Use 'with nogil:' blocks around CUDA driver API calls
- Use HANDLE_RETURN macro for error handling
- Cast stream to Stream type to access _handle attribute
- Improves performance by eliminating Python overhead
- Replace Python driver module calls with direct cydriver calls
- Use 'with nogil:' blocks around CUDA driver API calls
- Use HANDLE_RETURN macro for error handling
- Cast stream to Stream type to access _handle attribute
- Remove unused raise_if_driver_error import
- Improves performance by eliminating Python overhead
@Andy-Jost
Copy link
Contributor Author

/ok to test 7d9747d

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

cdef cydriver.CUstream s = s_stream._handle

# Validate width
if width not in (1, 2, 4):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the validation code closer to the top of the function so we avoid any setup work in the error case where the user passes an unsupported size to the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I fiddled with it to simplify the logic. To be honest, I don't see a big improvement, here, since most of the preceding statements just declare stack variables.

raise ValueError(f"width must be 1, 2, or 4, got {width}")

# Validate value fits in width
if width == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could hoist this logic into a validation function to remove the magic numbers in the code.

def _validate_value_against_bitwidth(bitwidth, value, signed=False):
    max_bits = bitwidth
    if signed:
        min_value = -(1 << (max_bits - 1))
        max_value = (1 << (max_bits - 1)) - 1
    else:
        min_value = 0
        max_value = (1 << max_bits) - 1

    if not min_value <= value <= max_value:
        raise ValueError(
            f"value ({value}) is outside the representable range for {bitwidth}-bit integers "
            f"[{min_value}, {max_value}]"
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I converted it to Cython, too.

buffer1.fill(0x42, width=1, stream=stream)
device.sync()

if check:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are parametrizing these value checks in a test suite? The memory sizes don't strike me as so large that these operations would be slow.

Copy link
Contributor Author

@Andy-Jost Andy-Jost Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are only checked when the memory allocation is pinned. This follows the existing pattern.

Copy link
Collaborator

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally, just a couple comments.

- Add _validate_value_against_bitwidth helper function
- Move helper function to end of file as cdef function
- Use 64-bit platform integers (int64_t/uint64_t) instead of Python ints
- Add assertion that bitwidth < 64
- Remove magic numbers from fill() method
- Update tests to match new error message format
@Andy-Jost Andy-Jost requested a review from rparolin December 5, 2025 18:37
@Andy-Jost Andy-Jost marked this pull request as ready for review December 5, 2025 18:38
@Andy-Jost
Copy link
Contributor Author

/ok to test 8e2cddf

@Andy-Jost Andy-Jost enabled auto-merge (squash) December 5, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants