Skip to content

AIRUNTIME-171 - cooperative groups scan#5914

Merged
g-h-c merged 100 commits into
developfrom
users/g-h-c/AIRUNTIME-171_cooperative_groups_scan
Jun 26, 2026
Merged

AIRUNTIME-171 - cooperative groups scan#5914
g-h-c merged 100 commits into
developfrom
users/g-h-c/AIRUNTIME-171_cooperative_groups_scan

Conversation

@g-h-c

@g-h-c g-h-c commented May 8, 2026

Copy link
Copy Markdown
Contributor

scan_benchmark_Navi48.txt

Motivation

Implement cooperative_groups::inclusive_scan() and exclusive_scan()
Documentation: #7254
HIPIFY related ticket: ROCm/HIPIFY#2448

Technical Details

  • This is an initial implementation of the scan operators for cooperative groups. Additional PRs will add optimizations for upcoming architectures
  • This PR refactors cooperative_groups::reduce tests so they can be used to test scans too, as the test very similar operations (a reduce is the same as a inclusive_scan, but it returns the result the last lane would return in all lanes)
  • The implementation is different from cooperative_groups::reduce(). The main difference is that cg::reduce() is based on the _reduce*sync() operators where the scans do not have an equivalent warp intrinsic. Because of that, the macro GENERATE_SCAN_FUNC() is introduced, which generates equivalent warp-wide scans based on the OCKL intrinsics

JIRA ID

Resolves AIRUNTIME-171

Test Plan

Essentially the same tests as cg::reduce(), but now checking the result per lane as opposed as to checking a single value, as each lane in scan would produce different results most of the time.

Test Result

Tested on Navi48, all tests pass. Also see attached benchmark.
scan_benchmark_Navi48.txt

Tests on MI300X pass successfully both with and without address sanitizer.

Submission Checklist

@g-h-c g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch 2 times, most recently from cac71bf to 4202990 Compare May 14, 2026 14:48
@g-h-c g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from 0c233a2 to 5315492 Compare May 19, 2026 12:01
@g-h-c g-h-c marked this pull request as ready for review May 19, 2026 14:54
@g-h-c g-h-c requested review from a team as code owners May 19, 2026 14:54
Copilot AI review requested due to automatic review settings May 19, 2026 14:54
@g-h-c g-h-c requested review from a team as code owners May 19, 2026 14:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements initial cooperative_groups::inclusive_scan() / exclusive_scan() support for HIP on AMD, and refactors/extends existing cooperative groups reduce tests and benchmarks to validate per-lane scan behavior across tiled and coalesced groups.

Changes:

  • Added new public hip_scan.h entry point and AMD device-side scan implementation backed by OCKL wfscan intrinsics with a fallback path.
  • Refactored shared warp/test utilities to compute and report expected per-lane aggregation results, and expanded cooperative groups tests to cover scan variants.
  • Extended performance benchmark harness to run cooperative-groups scans alongside existing reduce benchmarks.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
projects/hip/include/hip/cooperative_groups/hip_scan.h Adds public scan API header routing to AMD/NVIDIA backends.
projects/clr/hipamd/include/hip/amd_detail/amd_device_functions.h Introduces GENERATE_SCAN_FUNC macro for OCKL wfscan wrappers.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h Implements AMD cooperative groups inclusive/exclusive scan.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp16.h Adds f16 wfscan wrappers via GENERATE_SCAN_FUNC.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h Adds group traits/mask helper and shared bpermute helper used by scan.
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_reduce.h Refactors reduce to reuse shared groupMask.
projects/hip-tests/catch/include/warp_common.hh Refactors expected-value computation to support scans and improves mismatch diagnostics.
projects/hip-tests/catch/unit/warp/warp_reduce.cc Updates reduce unit tests to use new per-lane expected computation helpers.
projects/hip-tests/catch/unit/rtc/rtc_coop.cc Updates RTC reduce test expected computation plumbing for shared reduce/scan helpers.
projects/hip-tests/catch/unit/cooperativeGrps/cooperative_groups_common.hh Includes new hip_scan.h for cooperative group tests.
projects/hip-tests/catch/unit/cooperativeGrps/thread_block_tile.cc Refactors reduce tests into generic aggregation tests and adds scan test coverage.
projects/hip-tests/catch/performance/warpSync/warpSync.cc Extends benchmarks to run cooperative-groups scans and refactors mask handling.
projects/hip-tests/catch/config/configs/unit/cooperativeGrps.yaml Registers new scan tests in the unit test config.
projects/hip-tests/catch/hipTestMain/config/config_amd_linux Adds an AMD Linux disabled-test entry affecting an existing reduce test.
Comments suppressed due to low confidence (6)

projects/hip-tests/catch/include/warp_common.hh:536

  • printMismatch uses (1ul << i) with a 64-bit mask. This has the same 32-bit unsigned long shift/UB problem for lane indices >= 32 on LLP64 platforms; use 1ull (or uint64_t{1}) for the bit test.
  for (int i = 0; i < getWarpSize(); i++) {
    if ((1ul << i) & mask) {
      if constexpr (std::is_same<T, __half>::value) {

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:146

  • maskIdx = __popcll(((1ul << laneId) - 1) & mask) should use a 64-bit literal (1ull) to avoid undefined behavior when laneId >= 32 on platforms where unsigned long is 32-bit. This affects correctness for wavefront size 64 and can also break compilation with aggressive UB sanitizers.
    mask = impl::groupMask(group);
    maskNumBits = __popcll(mask);
    maskIdx = __popcll(((1ul << laneId) - 1) & mask);

    if (laneId) {
      mask <<= 64 - laneId;
    }

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:166

  • The OCKL fast-path condition uses has_boolean_scan<Val, Op> but has_boolean_scan is defined as a single-type trait (has_boolean_scan<T, ...> with the second template arg intended for SFINAE). Passing Op as the second parameter forces the primary template and makes the trait always false, so boolean scans never take the intended intrinsic path. Use has_boolean_scan<Val>::value here (and in the similar coalesced-group branch).
      // not; if the block tile is actually the whole warp
      if (impl::tiledGroupSize<TyGroup>::value == warpSize) {
        if constexpr (impl::isArithmeticFunc<Val, TyFn >::value && impl::has_arithmetic_scan<Val>::value ||
                      impl::isBooleanFunc<Val, TyFn >::value && impl::has_boolean_scan<Val, Op>::value) {
          return impl::call_scan<Val, Op, Inclusive>(val);
        }
      }
    } else if constexpr (impl::isCoalescedGroup<TyGroup>::value) {
      // for the coalesced_group case we do need to check at runtime, adding a slight overhead on
      // this branch
      if (maskNumBits == warpSize) {
        if constexpr (impl::isArithmeticFunc<Val, TyFn >::value && impl::has_arithmetic_scan<Val>::value ||
                      impl::isBooleanFunc<Val, TyFn >::value && impl::has_boolean_scan<Val, Op>::value) {
          return impl::call_scan<Val, Op, Inclusive>(val);
        }

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:253

  • In the exclusive-scan return path, __builtin_clzll(mask) is evaluated after mask <<= 64 - laneId without first checking whether the shifted mask is nonzero. For the first participating lane when the first active bit is not lane 0 (e.g. a tile starting at lane 32), the shifted mask becomes 0 and __builtin_clzll(0) is undefined behavior. Guard the clzll call (e.g. compute a prev_mask and early-return the zero/identity value when it is 0).
      int nextBit = laneId;

      mask = impl::groupMask(group);

      if (laneId) {
        mask <<= 64 - laneId;
        nextBit -= __builtin_clzll(mask) + 1;
      } else {
        mask = 0ull;
      }

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:225

  • The scan implementation uses reinterpret_cast<Val*> on unsigned int storage (result/permuteResult) and also returns via *reinterpret_cast<Val*>(&result). This is undefined behavior due to strict-aliasing and may be misaligned for types with alignment > 4 (e.g. double, 64-bit integers, or user-defined structs with 8/16-byte alignment). Use __builtin_memcpy (or std::bit_cast when available) to load/store Val values instead of pointer casts in both the combine step and the final return.

      if (insideLanes) {
        if constexpr (!isPrimitiveType) {
          Val toReturn;
          toReturn = op(*reinterpret_cast<Val*>(result), *reinterpret_cast<Val*>(permuteResult));
        __builtin_memcpy(result, &toReturn, sizeof(Val));
        } else if constexpr (sizeof(Val) == 4 || sizeof(Val) == 2) {

projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:126

  • amd_hip_cooperative_groups_scan.h uses cooperative_groups::impl::is_param_type_same, but that alias is only defined in amd_hip_cooperative_groups_reduce.h. Including hip_scan.h without including hip_reduce.h first will fail to compile. Define is_param_type_same in a shared header (or locally in this header) instead of relying on include order.
    // the number of backward permutes will be: size(Val) / 4 rounded up
    static constexpr int kNumOfPermutes = (sizeof(Val) <= 4)?
                                          1 :
                                          (sizeof(Val) + sizeof(unsigned int) - 1) / sizeof(unsigned int);
    static_assert(cooperative_groups::impl::is_param_type_same<Val, decltype(op(val, val))>::value, "Operator input and output types differ");
    static_assert(__hip_internal::is_trivially_copyable<Val>::value, "val must be trivially copyable");
    static_assert(sizeof(Val) <= 32, "scan only operate on values of size up to 32 bytes");

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/hip-tests/catch/include/warp_common.hh Outdated
Comment thread projects/hip-tests/catch/unit/cooperativeGrps/thread_block_tile.cc Outdated
Comment thread projects/hip-tests/catch/unit/cooperativeGrps/thread_block_tile.cc Outdated
Comment thread projects/hip-tests/catch/performance/warpSync/warpSync.cc Outdated
Comment thread projects/hip-tests/catch/performance/warpSync/warpSync.cc Outdated
Comment thread projects/hip-tests/catch/hipTestMain/config/config_amd_linux Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Comment thread projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h Outdated
Comment thread projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h Outdated
Comment thread projects/hip-tests/catch/unit/cooperativeGrps/thread_block_tile.cc Outdated
Comment thread projects/hip-tests/catch/unit/rtc/rtc_coop.cc Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Comment thread projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h Outdated
Comment thread projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h Outdated
Comment thread projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h Outdated
Comment thread projects/hip-tests/catch/performance/warpSync/warpSync.cc Outdated
Comment thread projects/hip-tests/catch/unit/rtc/rtc_coop.cc Outdated
@g-h-c g-h-c requested a review from a team as a code owner May 21, 2026 10:40

@chrispaquot chrispaquot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the CoPilot comments. @b-sumner and @yxsamliu should also review this.

@g-h-c g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from 92f145f to acad615 Compare June 1, 2026 11:59
Copilot finished work on behalf of g-h-c June 1, 2026 16:25
@g-h-c

g-h-c commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@chrispaquot Comments resolved, but I am still getting some failures on hiprtc that I need to investigate

@g-h-c g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from 27c661e to d9201f3 Compare June 12, 2026 14:00
@g-h-c g-h-c requested a review from b-sumner June 15, 2026 11:35
g-h-c added 26 commits June 26, 2026 15:07
…or 'half' and cooperative_groups::less too, to make sure the first active lane returns the identity in that case
…ed at the right alignment in cooperative_groups operations
… the first active lane to mimic what ockl_wfscan does
…UNC to emphasize that is a private AMD implementation detail
…res after introducing AggregationType::InclusiveScanDefault and ExclusiveScanDefault
…unless cooperative_groups/reduce.h was included earlier
…. Add test Unit_Thread_Block_Tile_Scan_Custom_Op which tests this behaviour using a non-commutative operator
…of the first active lane in exclusive_scan, to be more similar to CUDA
…ternal::workgroup::thread_rank() as cooperative_groups operations were generating a warp mask that was dependent on threadIdx.x when that would not work on 2D or 3D blocks
…se the 'Build clr + hip-tests (NVIDIA)' CI run would fail to compile
@g-h-c g-h-c force-pushed the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch from c5d8a9e to bc62602 Compare June 26, 2026 14:07
@g-h-c g-h-c merged commit 5b49d41 into develop Jun 26, 2026
35 checks passed
@g-h-c g-h-c deleted the users/g-h-c/AIRUNTIME-171_cooperative_groups_scan branch June 26, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants