Skip to content

[CUB] Refactor DeviceCopy to always take an environment#9416

Open
miscco wants to merge 1 commit into
NVIDIA:mainfrom
miscco:device_copy_batched
Open

[CUB] Refactor DeviceCopy to always take an environment#9416
miscco wants to merge 1 commit into
NVIDIA:mainfrom
miscco:device_copy_batched

Conversation

@miscco

@miscco miscco commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

We want to be able to pass tunings even to the APIs that currently only take a stream.

Refactor so that we can pass an arbitrary environment to those APIs that take user provided memory

@miscco miscco requested review from a team as code owners June 12, 2026 08:14
@miscco miscco requested a review from ericniebler June 12, 2026 08:14
@miscco miscco requested a review from NaderAlAwar June 12, 2026 08:14
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 12, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 898a5e48-44e8-46e9-85dc-93aa0ea7281b

📥 Commits

Reviewing files that changed from the base of the PR and between 152d45c and 881ee39.

📒 Files selected for processing (4)
  • cub/cub/device/device_copy.cuh
  • cub/cub/device/dispatch/dispatch_copy_mdspan.cuh
  • cub/test/catch2_test_device_copy_batched.cu
  • cub/test/catch2_test_device_copy_mdspan.cu
🚧 Files skipped from review as they are similar to previous changes (3)
  • cub/test/catch2_test_device_copy_mdspan.cu
  • cub/test/catch2_test_device_copy_batched.cu
  • cub/cub/device/device_copy.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR refactors the cub::DeviceCopy API family to accept an environment parameter instead of explicitly taking a cudaStream_t. This enables environment-based configuration and tuning support for DeviceCopy APIs, particularly for those accepting user-provided memory.

Changes

Core API Refactoring (cub/cub/device/device_copy.cuh)

DeviceCopy::Batched (temp-storage overload):

  • Added template parameter EnvT (defaults to cuda::std::execution::env<>)
  • Replaced cudaStream_t stream = nullptr parameter with const EnvT& env = {}
  • Implementation routes through detail::dispatch_with_env_and_tuning, which resolves the stream/policy and passes them to detail::batch_memcpy::dispatch

DeviceCopy::Batched (force-inlined overload):

  • Changed env parameter from by-value to const EnvT&

DeviceCopy::Copy (mdspan temp-storage overload):

  • Added template parameter EnvT (defaults to cuda::std::execution::env<>)
  • Replaced cudaStream_t stream = nullptr parameter with const EnvT& env = {}
  • Preserves existing behavior: mdspan extents/data-handle assertions, early return for d_temp_storage == nullptr
  • Calls detail::copy_mdspan::copy(mdspan_in, mdspan_out, env)

DeviceCopy::Copy (mdspan force-inlined overload):

  • Changed env parameter from by-value to const EnvT&
  • Passes environment to underlying detail::copy_mdspan::copy

Signature Changes: +36/-32 lines

Dispatch Layer (cub/cub/device/dispatch/dispatch_copy_mdspan.cuh)

  • detail::copy_mdspan::copy(): Changed env parameter from by-value to const EnvT& env = {}
  • Signature Changes: +1/-1 lines

Test Coverage

cub/test/catch2_test_device_copy_batched.cu (+84/-9 lines):

  • Added #include <cuda/std/execution>
  • Introduced two test sections:
    • SECTION("With environment"): Tests batched copy with generated range descriptors
    • SECTION("With user provided memory and environment"): Two-phase test (temp-storage size query, then actual copy) with host-side verification
  • Tests multiple environment/policy variants: raw cudaStream_t, cuda::stream, cuda::stream_ref, cuda::std::execution::env, cuda::execution::gpu, and cuda::execution::gpu with explicit stream binding

cub/test/catch2_test_device_copy_mdspan.cu (+87/-0 lines):

  • Added #include <cuda/devices> and #include <cuda/std/execution>
  • Introduced comprehensive test for cub::DeviceCopy::Copy with mdspan views
  • Tests 1D, 2D (layout_left), and 4D tensor shapes
  • Validates across multiple execution environment types with proper error checking and device synchronization
  • Verifies correctness through device-to-host vector comparison

Impact

The refactoring enables:

  • Environment-based configuration and tuning for DeviceCopy APIs
  • Consistent API design across the cub library
  • Support for advanced execution policies beyond simple stream management
  • Backward compatibility through default template parameters

Walkthrough

The PR converts cub::DeviceCopy::Batched and mdspan cub::DeviceCopy::Copy from taking explicit cudaStream_t to accepting execution environments as const references (defaulting to cuda::std::execution::env<>). Implementations dispatch through environment-aware helpers and pass resolved streams/policies to underlying kernels. Tests expand to cover both internal-allocation and user-provided temporary storage flows across multiple execution/stream environment types.

Changes

DeviceCopy environment-based API standardization

Layer / File(s) Summary
DeviceCopy::Batched environment integration
cub/cub/device/device_copy.cuh
Temp-storage overload adds EnvT template parameter and const EnvT& env argument in place of cudaStream_t stream, dispatching via detail::dispatch_with_env_and_tuning; force-inlined overload tightens env from by-value to const reference; documentation updated.
DeviceCopy::Copy mdspan environment integration
cub/cub/device/device_copy.cuh
Temp-storage overload replaces cudaStream_t stream with EnvT template and const EnvT& env parameter, retains mdspan assertions and early-return behavior, calls detail::copy_mdspan::copy(mdspan_in, mdspan_out, env); force-inlined overload tightens env to const reference; documentation updated.
Mdspan copy dispatch const reference tightening
cub/cub/device/dispatch/dispatch_copy_mdspan.cuh
detail::copy_mdspan::copy function parameter changed from by-value to const EnvT& env to align with updated caller contract.
Batched copy test expansion
cub/test/catch2_test_device_copy_batched.cu
Adds cuda/std/execution header; expands test with two sections: basic environment invoke and user-provided temp-storage query/allocation flow with CUDA error and synchronization checks, exercised across raw stream, cuda::stream, cuda::stream_ref, cuda::std::execution::env, and cuda::execution::gpu variants.
Mdspan copy test expansion
cub/test/catch2_test_device_copy_mdspan.cu
Adds cuda/devices and cuda/std/execution headers; introduces comprehensive test covering 1D, 2D (layout_left), and 4D mdspan copies with 1-byte user-provided temp storage, CUDA error/sync validation, and device vector assertions across stream, cuda::stream, cuda::stream_ref, cuda::std::execution::env, and cuda::execution::gpu execution variants.

Possibly related PRs

  • NVIDIA/cccl#9318: Shared migration of dispatch helpers and public APIs (DeviceFind, DeviceCopy) to const-reference environment parameters via detail::dispatch_with_env_and_tuning.

Suggested reviewers

  • gevtushenko

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cub/cub/device/device_copy.cuh (1)

314-320: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

important: @param[in] stream is stale here. Line 335 replaced the stream parameter with const EnvT& env, so the generated API docs now describe a removed parameter and the new one in the same overload. As per coding guidelines, Doxygen @param descriptions must accurately reflect the current functionality.

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9b1775a1-0b9f-41dd-a733-7f7a930d89ff

📥 Commits

Reviewing files that changed from the base of the PR and between 1a093f9 and d3e18fc.

📒 Files selected for processing (7)
  • cub/cub/detail/env_dispatch.cuh
  • cub/cub/device/device_copy.cuh
  • cub/cub/device/device_find.cuh
  • cub/test/catch2_test_device_copy_batched.cu
  • cub/test/catch2_test_device_copy_mdspan.cu
  • cub/test/catch2_test_device_find_env.cu
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h

Comment thread cub/test/catch2_test_device_copy_batched.cu
Comment thread cub/test/catch2_test_device_copy_mdspan.cu
Comment thread cub/test/catch2_test_device_find_env.cu Outdated
@miscco miscco force-pushed the device_copy_batched branch 2 times, most recently from 61979f9 to a3bb2f3 Compare June 12, 2026 09:23

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cub/cub/device/device_copy.cuh (1)

336-354: 💤 Low value

suggestion: d_temp_storage and temp_storage_bytes are unused when d_temp_storage != nullptr. Line 354 calls detail::copy_mdspan::copy(mdspan_in, mdspan_out, env) without forwarding the storage parameters. The user-provided memory is effectively ignored.

If mdspan copy truly requires no temp storage, consider documenting this in the Doxygen (e.g., "Note: This operation requires minimal temporary storage (1 byte) which is not actually used.") to avoid confusing users who allocate and provide storage expecting it to be utilized.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8dd691a0-c710-49d1-9020-8315e4e19f81

📥 Commits

Reviewing files that changed from the base of the PR and between 61979f9 and a3bb2f3.

📒 Files selected for processing (3)
  • cub/cub/device/device_copy.cuh
  • cub/test/catch2_test_device_copy_batched.cu
  • cub/test/catch2_test_device_copy_mdspan.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cub/test/catch2_test_device_copy_batched.cu

@github-actions

This comment has been minimized.

We want to be able to pass tunings even to the APIs that currently only take a stream.

Refactor so that we can pass an arbitrary environment to those APIs that take user provided memory
@miscco miscco force-pushed the device_copy_batched branch from 152d45c to 881ee39 Compare June 12, 2026 11:07
@github-actions

Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 1h 56m: Pass: 38%/287 | Total: 2d 04h | Max: 42m 36s | Hits: 76%/184645

See results here.

Comment on lines -159 to -162
// Integer type large enough to hold any offset in [0, num_thread_blocks_launched), where a safe
// upper bound on num_thread_blocks_launched can be assumed to be given by
// IDIV_CEIL(num_ranges, 64)
using BlockOffsetT = uint32_t;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why do we need to remove this comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants