AIRUNTIME-171 - cooperative groups scan by g-h-c · Pull Request #5914 · ROCm/rocm-systems

g-h-c · 2026-05-08T15:20:51Z

Motivation

Implement cooperative_groups::inclusive_scan() and exclusive_scan()
Documentation: #7254
HIPIFY related ticket: ROCm/HIPIFY#2448

Technical Details

This is an initial implementation of the scan operators for cooperative groups. Additional PRs will add optimizations for upcoming architectures
This PR refactors cooperative_groups::reduce tests so they can be used to test scans too, as the test very similar operations (a reduce is the same as a inclusive_scan, but it returns the result the last lane would return in all lanes)
The implementation is different from cooperative_groups::reduce(). The main difference is that cg::reduce() is based on the _reduce*sync() operators where the scans do not have an equivalent warp intrinsic. Because of that, the macro GENERATE_SCAN_FUNC() is introduced, which generates equivalent warp-wide scans based on the OCKL intrinsics

JIRA ID

Resolves AIRUNTIME-171

Test Plan

Essentially the same tests as cg::reduce(), but now checking the result per lane as opposed as to checking a single value, as each lane in scan would produce different results most of the time.

Test Result

Tested on Navi48, all tests pass. Also see attached benchmark.
scan_benchmark_Navi48.txt

Tests on MI300X pass successfully both with and without address sanitizer.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Implements initial cooperative_groups::inclusive_scan() / exclusive_scan() support for HIP on AMD, and refactors/extends existing cooperative groups reduce tests and benchmarks to validate per-lane scan behavior across tiled and coalesced groups.

Changes:

Added new public hip_scan.h entry point and AMD device-side scan implementation backed by OCKL wfscan intrinsics with a fallback path.
Refactored shared warp/test utilities to compute and report expected per-lane aggregation results, and expanded cooperative groups tests to cover scan variants.
Extended performance benchmark harness to run cooperative-groups scans alongside existing reduce benchmarks.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
projects/hip/include/hip/cooperative_groups/hip_scan.h	Adds public scan API header routing to AMD/NVIDIA backends.
projects/clr/hipamd/include/hip/amd_detail/amd_device_functions.h	Introduces `GENERATE_SCAN_FUNC` macro for OCKL `wfscan` wrappers.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h	Implements AMD cooperative groups inclusive/exclusive scan.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp16.h	Adds f16 `wfscan` wrappers via `GENERATE_SCAN_FUNC`.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h	Adds group traits/mask helper and shared bpermute helper used by scan.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_reduce.h	Refactors reduce to reuse shared `groupMask`.
projects/hip-tests/catch/include/warp_common.hh	Refactors expected-value computation to support scans and improves mismatch diagnostics.
projects/hip-tests/catch/unit/warp/warp_reduce.cc	Updates reduce unit tests to use new per-lane expected computation helpers.
projects/hip-tests/catch/unit/rtc/rtc_coop.cc	Updates RTC reduce test expected computation plumbing for shared reduce/scan helpers.
projects/hip-tests/catch/unit/cooperativeGrps/cooperative_groups_common.hh	Includes new `hip_scan.h` for cooperative group tests.
projects/hip-tests/catch/unit/cooperativeGrps/thread_block_tile.cc	Refactors reduce tests into generic aggregation tests and adds scan test coverage.
projects/hip-tests/catch/performance/warpSync/warpSync.cc	Extends benchmarks to run cooperative-groups scans and refactors mask handling.
projects/hip-tests/catch/config/configs/unit/cooperativeGrps.yaml	Registers new scan tests in the unit test config.
projects/hip-tests/catch/hipTestMain/config/config_amd_linux	Adds an AMD Linux disabled-test entry affecting an existing reduce test.

Comments suppressed due to low confidence (6)

projects/hip-tests/catch/include/warp_common.hh:536

printMismatch uses (1ul << i) with a 64-bit mask. This has the same 32-bit unsigned long shift/UB problem for lane indices >= 32 on LLP64 platforms; use 1ull (or uint64_t{1}) for the bit test.

  for (int i = 0; i < getWarpSize(); i++) {
    if ((1ul << i) & mask) {
      if constexpr (std::is_same<T, __half>::value) {

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:146

maskIdx = __popcll(((1ul << laneId) - 1) & mask) should use a 64-bit literal (1ull) to avoid undefined behavior when laneId >= 32 on platforms where unsigned long is 32-bit. This affects correctness for wavefront size 64 and can also break compilation with aggressive UB sanitizers.

    mask = impl::groupMask(group);
    maskNumBits = __popcll(mask);
    maskIdx = __popcll(((1ul << laneId) - 1) & mask);

    if (laneId) {
      mask <<= 64 - laneId;
    }

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:166

The OCKL fast-path condition uses has_boolean_scan<Val, Op> but has_boolean_scan is defined as a single-type trait (has_boolean_scan<T, ...> with the second template arg intended for SFINAE). Passing Op as the second parameter forces the primary template and makes the trait always false, so boolean scans never take the intended intrinsic path. Use has_boolean_scan<Val>::value here (and in the similar coalesced-group branch).

      // not; if the block tile is actually the whole warp
      if (impl::tiledGroupSize<TyGroup>::value == warpSize) {
        if constexpr (impl::isArithmeticFunc<Val, TyFn >::value && impl::has_arithmetic_scan<Val>::value ||
                      impl::isBooleanFunc<Val, TyFn >::value && impl::has_boolean_scan<Val, Op>::value) {
          return impl::call_scan<Val, Op, Inclusive>(val);
        }
      }
    } else if constexpr (impl::isCoalescedGroup<TyGroup>::value) {
      // for the coalesced_group case we do need to check at runtime, adding a slight overhead on
      // this branch
      if (maskNumBits == warpSize) {
        if constexpr (impl::isArithmeticFunc<Val, TyFn >::value && impl::has_arithmetic_scan<Val>::value ||
                      impl::isBooleanFunc<Val, TyFn >::value && impl::has_boolean_scan<Val, Op>::value) {
          return impl::call_scan<Val, Op, Inclusive>(val);
        }

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:253

In the exclusive-scan return path, __builtin_clzll(mask) is evaluated after mask <<= 64 - laneId without first checking whether the shifted mask is nonzero. For the first participating lane when the first active bit is not lane 0 (e.g. a tile starting at lane 32), the shifted mask becomes 0 and __builtin_clzll(0) is undefined behavior. Guard the clzll call (e.g. compute a prev_mask and early-return the zero/identity value when it is 0).

      int nextBit = laneId;

      mask = impl::groupMask(group);

      if (laneId) {
        mask <<= 64 - laneId;
        nextBit -= __builtin_clzll(mask) + 1;
      } else {
        mask = 0ull;
      }

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:225

The scan implementation uses reinterpret_cast<Val*> on unsigned int storage (result/permuteResult) and also returns via *reinterpret_cast<Val*>(&result). This is undefined behavior due to strict-aliasing and may be misaligned for types with alignment > 4 (e.g. double, 64-bit integers, or user-defined structs with 8/16-byte alignment). Use __builtin_memcpy (or std::bit_cast when available) to load/store Val values instead of pointer casts in both the combine step and the final return.


      if (insideLanes) {
        if constexpr (!isPrimitiveType) {
          Val toReturn;
          toReturn = op(*reinterpret_cast<Val*>(result), *reinterpret_cast<Val*>(permuteResult));
        __builtin_memcpy(result, &toReturn, sizeof(Val));
        } else if constexpr (sizeof(Val) == 4 || sizeof(Val) == 2) {

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:126

amd_hip_cooperative_groups_scan.h uses cooperative_groups::impl::is_param_type_same, but that alias is only defined in amd_hip_cooperative_groups_reduce.h. Including hip_scan.h without including hip_reduce.h first will fail to compile. Define is_param_type_same in a shared header (or locally in this header) instead of relying on include order.

    // the number of backward permutes will be: size(Val) / 4 rounded up
    static constexpr int kNumOfPermutes = (sizeof(Val) <= 4)?
                                          1 :
                                          (sizeof(Val) + sizeof(unsigned int) - 1) / sizeof(unsigned int);
    static_assert(cooperative_groups::impl::is_param_type_same<Val, decltype(op(val, val))>::value, "Operator input and output types differ");
    static_assert(__hip_internal::is_trivially_copyable<Val>::value, "val must be trivially copyable");
    static_assert(sizeof(Val) <= 32, "scan only operate on values of size up to 32 bytes");

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

chrispaquot

Please fix the CoPilot comments. @b-sumner and @yxsamliu should also review this.

g-h-c · 2026-06-01T17:27:24Z

@chrispaquot Comments resolved, but I am still getting some failures on hiprtc that I need to investigate

…or 'half' and cooperative_groups::less too, to make sure the first active lane returns the identity in that case

…s to be used for cooperative_groups::less

…ed at the right alignment in cooperative_groups operations

…d past variables of type Val

… the first active lane to mimic what ockl_wfscan does

…UNC to emphasize that is a private AMD implementation detail

…t do not take the operator

…res after introducing AggregationType::InclusiveScanDefault and ExclusiveScanDefault

…unless cooperative_groups/reduce.h was included earlier

…nd not __activemask()

…. Add test Unit_Thread_Block_Tile_Scan_Custom_Op which tests this behaviour using a non-commutative operator

…of the first active lane in exclusive_scan, to be more similar to CUDA

…ternal::workgroup::thread_rank() as cooperative_groups operations were generating a warp mask that was dependent on threadIdx.x when that would not work on 2D or 3D blocks

…se the 'Build clr + hip-tests (NVIDIA)' CI run would fail to compile

…h 2D or 3D tiles

github-actions Bot added project: clr project: hip project: hip-tests labels May 8, 2026

systems-assistant Bot added the organization: ROCm label May 8, 2026

g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch 2 times, most recently from cac71bf to 4202990 Compare May 14, 2026 14:48

g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from 0c233a2 to 5315492 Compare May 19, 2026 12:01

g-h-c marked this pull request as ready for review May 19, 2026 14:54

g-h-c requested review from a team as code owners May 19, 2026 14:54

Copilot AI review requested due to automatic review settings May 19, 2026 14:54

g-h-c requested review from a team as code owners May 19, 2026 14:54

Copilot started reviewing on behalf of g-h-c May 19, 2026 14:56 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

g-h-c requested a review from Copilot May 20, 2026 14:00

Copilot started reviewing on behalf of g-h-c May 20, 2026 14:02 View session

g-h-c mentioned this pull request May 20, 2026

[HIPIFY][ROCM-1254] Add cooperative_groups reduce and scans support ROCm/HIPIFY#2448

Open

1 task

Copilot AI reviewed May 20, 2026

View reviewed changes

g-h-c requested a review from Copilot May 20, 2026 15:35

Copilot started reviewing on behalf of g-h-c May 20, 2026 15:35 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

g-h-c requested a review from a team as a code owner May 21, 2026 10:40

chrispaquot reviewed May 21, 2026

View reviewed changes

g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from 92f145f to acad615 Compare June 1, 2026 11:59

Copilot started work on behalf of g-h-c June 1, 2026 16:13 View session

Copilot finished work on behalf of g-h-c June 1, 2026 16:25

g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from 27c661e to d9201f3 Compare June 12, 2026 14:00

g-h-c requested a review from b-sumner June 15, 2026 11:35

g-h-c added 26 commits June 26, 2026 15:07

AIRUNTIME-171 - execute Unit_Thread_Block_Tile_Exclusive_Scan_Basic f…

eaf2c9d

…or 'half' and cooperative_groups::less too, to make sure the first active lane returns the identity in that case

AIRUNTIME-171 - fix up previous commit, std::numeric_limits only need…

47921ba

…s to be used for cooperative_groups::less

AIRUNTIME-171 - fix value for NumericLimits<float>::lowest()

cc332cd

AIRUNTIME-171 - fix Unit_Rtc_CoopReduce being too verbose

cfa954b

AIRUNTIME-171 - make sure the operands and the return value are align…

e8516db

…ed at the right alignment in cooperative_groups operations

AIRUNTIME-171 - prevent memcpy() in cooperative_groups::scan() to rea…

3142b93

…d past variables of type Val

AIRUNTIME-171 - fix memcpy() reading past Val also for reduce operations

12e99e6

AIRUNTIME-171 - make cooperative_groups::exclusive_scan() results for…

4d5bea1

… the first active lane to mimic what ockl_wfscan does

Fix expected values for cg::exlusive_scan() for the first active lane

5dd6707

AIRUNTIME-171 - cosmetic changes

ea17dc2

AIRUNTIME-171 - rename GENERATE_SCAN_FUNC as HIP_IMPL_GENERATE_SCAN_F…

cf2f101

…UNC to emphasize that is a private AMD implementation detail

AIRUNTIME-171 - fix typo in cooperativeGrps.yaml

7a62b59

AIRUNTIME-171 - fix Max::operator() contained an unnecessary extra loop

e4461bd

AIRUNTIME-171 - fix whitespace errors

ae7d88f

AIRUNTIME-171 - add missing nvidia_hip_cooperative_groups_scan.h

a533ae6

AIRUNTIME-171 - add cg::inclusive_scan/exclusive_scan() overloads tha…

041bc79

…t do not take the operator

AIRUNTIME-171 - fix Unit_Thread_Block_Tile_Inclusive_Scan_Basic failu…

fc4cadf

…res after introducing AggregationType::InclusiveScanDefault and ExclusiveScanDefault

AIRUNTIME-171 - fix that cooperative_groups/scan.h would not compile …

ec5857e

…unless cooperative_groups/reduce.h was included earlier

AIRUNTIME-171 - add test Unit_Thread_Block_Coalesced_Scan_Partition

78e0ee3

AIRUNTIME-171 - fix that scan should use coalesced_info.member_mask a…

4761780

…nd not __activemask()

AIRUNTIME-171 - add a test for partitioned thread_block_tiles too

0217805

AIRUNTIME-171 - fix order of parameters when applying scan operations…

c224faa

…. Add test Unit_Thread_Block_Tile_Scan_Custom_Op which tests this behaviour using a non-commutative operator

AIRUNTIME-171 - return {} as identity for custom types as the result …

56fcc84

…of the first active lane in exclusive_scan, to be more similar to CUDA

AIRUNTIME-171 - change cooperative_group::impl::buildMask() to use in…

1aabf99

…ternal::workgroup::thread_rank() as cooperative_groups operations were generating a warp mask that was dependent on threadIdx.x when that would not work on 2D or 3D blocks

SWDEV-515087 - replace __lane_id() with a manual calculation; otherwi…

7baca86

…se the 'Build clr + hip-tests (NVIDIA)' CI run would fail to compile

ROCM-26483 - make sure cooperative_groups::reduce() will now work wit…

bc62602

…h 2D or 3D tiles

g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from c5d8a9e to bc62602 Compare June 26, 2026 14:07

g-h-c merged commit 5b49d41 into develop Jun 26, 2026
35 checks passed

g-h-c deleted the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch June 26, 2026 21:17

g-h-c mentioned this pull request Jun 29, 2026

ROCM-26483 - fix cooperative_groups::reduce() would not work with 2D or 3D tiles #7588

Closed

1 task

Uh oh!

Conversation

g-h-c commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrispaquot left a comment

Choose a reason for hiding this comment

Uh oh!

g-h-c commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

g-h-c commented May 8, 2026 •

edited

Loading