[CUB] Refactor DeviceCopy to always take an environment#9416
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (3)
OverviewThis PR refactors the ChangesCore API Refactoring (
|
| Layer / File(s) | Summary |
|---|---|
DeviceCopy::Batched environment integration cub/cub/device/device_copy.cuh |
Temp-storage overload adds EnvT template parameter and const EnvT& env argument in place of cudaStream_t stream, dispatching via detail::dispatch_with_env_and_tuning; force-inlined overload tightens env from by-value to const reference; documentation updated. |
DeviceCopy::Copy mdspan environment integration cub/cub/device/device_copy.cuh |
Temp-storage overload replaces cudaStream_t stream with EnvT template and const EnvT& env parameter, retains mdspan assertions and early-return behavior, calls detail::copy_mdspan::copy(mdspan_in, mdspan_out, env); force-inlined overload tightens env to const reference; documentation updated. |
Mdspan copy dispatch const reference tightening cub/cub/device/dispatch/dispatch_copy_mdspan.cuh |
detail::copy_mdspan::copy function parameter changed from by-value to const EnvT& env to align with updated caller contract. |
Batched copy test expansion cub/test/catch2_test_device_copy_batched.cu |
Adds cuda/std/execution header; expands test with two sections: basic environment invoke and user-provided temp-storage query/allocation flow with CUDA error and synchronization checks, exercised across raw stream, cuda::stream, cuda::stream_ref, cuda::std::execution::env, and cuda::execution::gpu variants. |
Mdspan copy test expansion cub/test/catch2_test_device_copy_mdspan.cu |
Adds cuda/devices and cuda/std/execution headers; introduces comprehensive test covering 1D, 2D (layout_left), and 4D mdspan copies with 1-byte user-provided temp storage, CUDA error/sync validation, and device vector assertions across stream, cuda::stream, cuda::stream_ref, cuda::std::execution::env, and cuda::execution::gpu execution variants. |
Possibly related PRs
- NVIDIA/cccl#9318: Shared migration of dispatch helpers and public APIs (
DeviceFind,DeviceCopy) to const-reference environment parameters viadetail::dispatch_with_env_and_tuning.
Suggested reviewers
- gevtushenko
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cub/cub/device/device_copy.cuh (1)
314-320:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winimportant:
@param[in] streamis stale here. Line 335 replaced the stream parameter withconst EnvT& env, so the generated API docs now describe a removed parameter and the new one in the same overload. As per coding guidelines, Doxygen@paramdescriptions must accurately reflect the current functionality.Source: Coding guidelines
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9b1775a1-0b9f-41dd-a733-7f7a930d89ff
📒 Files selected for processing (7)
cub/cub/detail/env_dispatch.cuhcub/cub/device/device_copy.cuhcub/cub/device/device_find.cuhcub/test/catch2_test_device_copy_batched.cucub/test/catch2_test_device_copy_mdspan.cucub/test/catch2_test_device_find_env.culibcudacxx/include/cuda/std/__pstl/cuda/find_if.h
61979f9 to
a3bb2f3
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cub/cub/device/device_copy.cuh (1)
336-354: 💤 Low valuesuggestion:
d_temp_storageandtemp_storage_bytesare unused whend_temp_storage != nullptr. Line 354 callsdetail::copy_mdspan::copy(mdspan_in, mdspan_out, env)without forwarding the storage parameters. The user-provided memory is effectively ignored.If mdspan copy truly requires no temp storage, consider documenting this in the Doxygen (e.g., "Note: This operation requires minimal temporary storage (1 byte) which is not actually used.") to avoid confusing users who allocate and provide storage expecting it to be utilized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8dd691a0-c710-49d1-9020-8315e4e19f81
📒 Files selected for processing (3)
cub/cub/device/device_copy.cuhcub/test/catch2_test_device_copy_batched.cucub/test/catch2_test_device_copy_mdspan.cu
🚧 Files skipped from review as they are similar to previous changes (1)
- cub/test/catch2_test_device_copy_batched.cu
This comment has been minimized.
This comment has been minimized.
We want to be able to pass tunings even to the APIs that currently only take a stream. Refactor so that we can pass an arbitrary environment to those APIs that take user provided memory
152d45c to
881ee39
Compare
😬 CI Workflow Results🟥 Finished in 1h 56m: Pass: 38%/287 | Total: 2d 04h | Max: 42m 36s | Hits: 76%/184645See results here. |
| // Integer type large enough to hold any offset in [0, num_thread_blocks_launched), where a safe | ||
| // upper bound on num_thread_blocks_launched can be assumed to be given by | ||
| // IDIV_CEIL(num_ranges, 64) | ||
| using BlockOffsetT = uint32_t; |
There was a problem hiding this comment.
Q: Why do we need to remove this comment?
We want to be able to pass tunings even to the APIs that currently only take a stream.
Refactor so that we can pass an arbitrary environment to those APIs that take user provided memory