AIRUNTIME-171 - cooperative groups scan#5914
Conversation
cac71bf to
4202990
Compare
0c233a2 to
5315492
Compare
There was a problem hiding this comment.
Pull request overview
Implements initial cooperative_groups::inclusive_scan() / exclusive_scan() support for HIP on AMD, and refactors/extends existing cooperative groups reduce tests and benchmarks to validate per-lane scan behavior across tiled and coalesced groups.
Changes:
- Added new public
hip_scan.hentry point and AMD device-side scan implementation backed by OCKLwfscanintrinsics with a fallback path. - Refactored shared warp/test utilities to compute and report expected per-lane aggregation results, and expanded cooperative groups tests to cover scan variants.
- Extended performance benchmark harness to run cooperative-groups scans alongside existing reduce benchmarks.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/hip/include/hip/cooperative_groups/hip_scan.h | Adds public scan API header routing to AMD/NVIDIA backends. |
| projects/clr/hipamd/include/hip/amd_detail/amd_device_functions.h | Introduces GENERATE_SCAN_FUNC macro for OCKL wfscan wrappers. |
| projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h | Implements AMD cooperative groups inclusive/exclusive scan. |
| projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp16.h | Adds f16 wfscan wrappers via GENERATE_SCAN_FUNC. |
| projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h | Adds group traits/mask helper and shared bpermute helper used by scan. |
| projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_reduce.h | Refactors reduce to reuse shared groupMask. |
| projects/hip-tests/catch/include/warp_common.hh | Refactors expected-value computation to support scans and improves mismatch diagnostics. |
| projects/hip-tests/catch/unit/warp/warp_reduce.cc | Updates reduce unit tests to use new per-lane expected computation helpers. |
| projects/hip-tests/catch/unit/rtc/rtc_coop.cc | Updates RTC reduce test expected computation plumbing for shared reduce/scan helpers. |
| projects/hip-tests/catch/unit/cooperativeGrps/cooperative_groups_common.hh | Includes new hip_scan.h for cooperative group tests. |
| projects/hip-tests/catch/unit/cooperativeGrps/thread_block_tile.cc | Refactors reduce tests into generic aggregation tests and adds scan test coverage. |
| projects/hip-tests/catch/performance/warpSync/warpSync.cc | Extends benchmarks to run cooperative-groups scans and refactors mask handling. |
| projects/hip-tests/catch/config/configs/unit/cooperativeGrps.yaml | Registers new scan tests in the unit test config. |
| projects/hip-tests/catch/hipTestMain/config/config_amd_linux | Adds an AMD Linux disabled-test entry affecting an existing reduce test. |
Comments suppressed due to low confidence (6)
projects/hip-tests/catch/include/warp_common.hh:536
printMismatchuses(1ul << i)with a 64-bitmask. This has the same 32-bitunsigned longshift/UB problem for lane indices >= 32 on LLP64 platforms; use1ull(oruint64_t{1}) for the bit test.
for (int i = 0; i < getWarpSize(); i++) {
if ((1ul << i) & mask) {
if constexpr (std::is_same<T, __half>::value) {
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:146
maskIdx = __popcll(((1ul << laneId) - 1) & mask)should use a 64-bit literal (1ull) to avoid undefined behavior whenlaneId >= 32on platforms whereunsigned longis 32-bit. This affects correctness for wavefront size 64 and can also break compilation with aggressive UB sanitizers.
mask = impl::groupMask(group);
maskNumBits = __popcll(mask);
maskIdx = __popcll(((1ul << laneId) - 1) & mask);
if (laneId) {
mask <<= 64 - laneId;
}
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:166
- The OCKL fast-path condition uses
has_boolean_scan<Val, Op>buthas_boolean_scanis defined as a single-type trait (has_boolean_scan<T, ...>with the second template arg intended for SFINAE). PassingOpas the second parameter forces the primary template and makes the trait alwaysfalse, so boolean scans never take the intended intrinsic path. Usehas_boolean_scan<Val>::valuehere (and in the similar coalesced-group branch).
// not; if the block tile is actually the whole warp
if (impl::tiledGroupSize<TyGroup>::value == warpSize) {
if constexpr (impl::isArithmeticFunc<Val, TyFn >::value && impl::has_arithmetic_scan<Val>::value ||
impl::isBooleanFunc<Val, TyFn >::value && impl::has_boolean_scan<Val, Op>::value) {
return impl::call_scan<Val, Op, Inclusive>(val);
}
}
} else if constexpr (impl::isCoalescedGroup<TyGroup>::value) {
// for the coalesced_group case we do need to check at runtime, adding a slight overhead on
// this branch
if (maskNumBits == warpSize) {
if constexpr (impl::isArithmeticFunc<Val, TyFn >::value && impl::has_arithmetic_scan<Val>::value ||
impl::isBooleanFunc<Val, TyFn >::value && impl::has_boolean_scan<Val, Op>::value) {
return impl::call_scan<Val, Op, Inclusive>(val);
}
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:253
- In the exclusive-scan return path,
__builtin_clzll(mask)is evaluated aftermask <<= 64 - laneIdwithout first checking whether the shifted mask is nonzero. For the first participating lane when the first active bit is not lane 0 (e.g. a tile starting at lane 32), the shifted mask becomes 0 and__builtin_clzll(0)is undefined behavior. Guard theclzllcall (e.g. compute aprev_maskand early-return the zero/identity value when it is 0).
int nextBit = laneId;
mask = impl::groupMask(group);
if (laneId) {
mask <<= 64 - laneId;
nextBit -= __builtin_clzll(mask) + 1;
} else {
mask = 0ull;
}
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:225
- The scan implementation uses
reinterpret_cast<Val*>onunsigned intstorage (result/permuteResult) and also returns via*reinterpret_cast<Val*>(&result). This is undefined behavior due to strict-aliasing and may be misaligned for types with alignment > 4 (e.g.double, 64-bit integers, or user-defined structs with 8/16-byte alignment). Use__builtin_memcpy(orstd::bit_castwhen available) to load/storeValvalues instead of pointer casts in both the combine step and the final return.
if (insideLanes) {
if constexpr (!isPrimitiveType) {
Val toReturn;
toReturn = op(*reinterpret_cast<Val*>(result), *reinterpret_cast<Val*>(permuteResult));
__builtin_memcpy(result, &toReturn, sizeof(Val));
} else if constexpr (sizeof(Val) == 4 || sizeof(Val) == 2) {
projects/clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups_scan.h:126
amd_hip_cooperative_groups_scan.husescooperative_groups::impl::is_param_type_same, but that alias is only defined inamd_hip_cooperative_groups_reduce.h. Includinghip_scan.hwithout includinghip_reduce.hfirst will fail to compile. Defineis_param_type_samein a shared header (or locally in this header) instead of relying on include order.
// the number of backward permutes will be: size(Val) / 4 rounded up
static constexpr int kNumOfPermutes = (sizeof(Val) <= 4)?
1 :
(sizeof(Val) + sizeof(unsigned int) - 1) / sizeof(unsigned int);
static_assert(cooperative_groups::impl::is_param_type_same<Val, decltype(op(val, val))>::value, "Operator input and output types differ");
static_assert(__hip_internal::is_trivially_copyable<Val>::value, "val must be trivially copyable");
static_assert(sizeof(Val) <= 32, "scan only operate on values of size up to 32 bytes");
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
92f145f to
acad615
Compare
|
@chrispaquot Comments resolved, but I am still getting some failures on hiprtc that I need to investigate |
27c661e to
d9201f3
Compare
…or 'half' and cooperative_groups::less too, to make sure the first active lane returns the identity in that case
…s to be used for cooperative_groups::less
…ed at the right alignment in cooperative_groups operations
…d past variables of type Val
… the first active lane to mimic what ockl_wfscan does
…UNC to emphasize that is a private AMD implementation detail
…t do not take the operator
…res after introducing AggregationType::InclusiveScanDefault and ExclusiveScanDefault
…unless cooperative_groups/reduce.h was included earlier
…nd not __activemask()
…. Add test Unit_Thread_Block_Tile_Scan_Custom_Op which tests this behaviour using a non-commutative operator
…of the first active lane in exclusive_scan, to be more similar to CUDA
…ternal::workgroup::thread_rank() as cooperative_groups operations were generating a warp mask that was dependent on threadIdx.x when that would not work on 2D or 3D blocks
…se the 'Build clr + hip-tests (NVIDIA)' CI run would fail to compile
c5d8a9e to
bc62602
Compare
scan_benchmark_Navi48.txt
Motivation
Implement
cooperative_groups::inclusive_scan()andexclusive_scan()Documentation: #7254
HIPIFY related ticket: ROCm/HIPIFY#2448
Technical Details
cooperative_groups::reducetests so they can be used to test scans too, as the test very similar operations (a reduce is the same as ainclusive_scan, but it returns the result the last lane would return in all lanes)cooperative_groups::reduce(). The main difference is thatcg::reduce()is based on the _reduce*sync() operators where the scans do not have an equivalent warp intrinsic. Because of that, the macro GENERATE_SCAN_FUNC() is introduced, which generates equivalent warp-wide scans based on the OCKL intrinsicsJIRA ID
Resolves AIRUNTIME-171
Test Plan
Essentially the same tests as cg::reduce(), but now checking the result per lane as opposed as to checking a single value, as each lane in scan would produce different results most of the time.
Test Result
Tested on Navi48, all tests pass. Also see attached benchmark.
scan_benchmark_Navi48.txt
Tests on MI300X pass successfully both with and without address sanitizer.
Submission Checklist