Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
249c30a
AIRUNTIME-171 - add initial implemention of cg::inclusive_scan. Only …
g-h-c Mar 30, 2026
198af3c
AIRUNTIME-171 - use __ockl_wfscan_add_i32() when __OPTIMIZE__ is defined
g-h-c Apr 16, 2026
7108afb
AIRUNTIME-171 - implement inclusive_scan for any thread block tile si…
g-h-c Apr 22, 2026
bc6b59f
AIRUNTIME-171 - make sure the OCKL intrinsic exist before using it
g-h-c Apr 24, 2026
c2f24ed
AIRUNTIME-171 - try to avoid the cost of the conditional check to see…
g-h-c Apr 24, 2026
46430e2
Fix assertion being triggered in __hip_check_mask() for some cooopera…
g-h-c Apr 28, 2026
9713078
AIRUNTIME-171 - make calculateExpected() also calculate expected resu…
g-h-c Apr 28, 2026
a21813e
AIRUNTIME-171 - add macro GENERATE_SCAN_FUNC() to generate the overlo…
g-h-c Apr 28, 2026
8a89634
AIRUNTIME-171 - refactor reduce tests so they can be reused for inclu…
g-h-c May 1, 2026
c1287e5
ROCM-1254 - remove Unit_Thread_Block_Tile_Reduce_Non_Participating_Th…
g-h-c Apr 15, 2026
8fefb10
Remove unused partialSum kernel (dead code with UB pattern)
Copilot Apr 16, 2026
cb177dd
AIRUNTIME-171 - make sure the expected result in a reduce is the resu…
g-h-c May 5, 2026
2cdf83c
AIRUNTIME-171 - fix that the reduction tree implementation in calcula…
g-h-c May 6, 2026
7bd6699
AIRUNTIME-171 - fix that floating point aggregations were not expecti…
g-h-c May 6, 2026
6c50a68
AIRUNTIME-171 - iterate the 'modulo' variable until nextPowerOf2(last…
g-h-c May 7, 2026
2eb0f0a
AIRUNTIME-171 - fix we were calling compareFloatingPoint() even for l…
g-h-c May 7, 2026
063bf5b
AIRUNTIME-171 - add support for cooperative_groups::less in cg::inclu…
g-h-c May 7, 2026
a809ed4
AIRUNTIME-171 - add Unit_Thread_Block_Tile_Scan_Random_arithmetic
g-h-c May 7, 2026
1c2ec54
AIRUNTIME-171 - extend cg::inclusive_scan() to support all types, not…
g-h-c May 8, 2026
c8c1e02
AIRUNTIME-171 - simplify code that invokes ockl scan intrinsics. Do n…
g-h-c May 11, 2026
44ab009
AIRUNTIME-171 - simplify mask generation logic
g-h-c May 11, 2026
bca8cea
AIRUNTIME-171 - fix up RTC tests compilation
g-h-c May 11, 2026
15621ce
AIRUNTIME-171 - add cg::exclusive_scan and some of the associated tests
g-h-c May 13, 2026
47af512
AIRUNTIME-171 - refactor duplicated code that calculate the group mask
g-h-c May 13, 2026
0240546
AIRUNTIME-171 - fix up previous commits
g-h-c May 13, 2026
c69a7f4
AIRUNTIME-171 - fix some scope Catch2 INFO() macro would have no effe…
g-h-c May 13, 2026
10aae09
AIRUNTIME-171 - fix up calculation of the expected value for cg::bit_xor
g-h-c May 13, 2026
6270127
AIRUNTIME-171 - implement cg::exclusive_scan() via a backward permute…
g-h-c May 13, 2026
08424e2
AIRUNTIME-171 - add more cg::exclusive_scan() tests
g-h-c May 13, 2026
f98e7f4
AIRUNTIME-171 - fix up one compareFloatingPoint() missing the Op temp…
g-h-c May 13, 2026
92da2b2
AIRUNTIME-171 - add bPermute() implementing backward permutes that ca…
g-h-c May 14, 2026
b7d08b4
AIRUNTIME-171 - start using bPermute. Remove function isPrimitiveType()
g-h-c May 14, 2026
b5d1bc0
AIRUNTIME-171 - use ockl functions for __half cooperative_group scans…
g-h-c May 14, 2026
f5e42d4
AIRUNTIME-171 - fix up add missing tests in cooperativeGrps.yaml. Fix…
g-h-c May 15, 2026
4756539
AIRUNTIME-171 - add Unit_Thread_Block_Tile_Scan_Trivially_Copyable_Pa…
g-h-c May 15, 2026
81d6d2c
AIRUNTIME-171 - add inclusive/exclusive_scan benchmarks
g-h-c May 15, 2026
c3a8807
AIRUNTIME-171 - add more information when the output of the reduce be…
g-h-c May 18, 2026
4b927e6
AIRUNTIME-171 - remove ExcludeFirst template parameter as all threads…
g-h-c May 18, 2026
fd1a566
AIRUNTIME-171 - remove mask generation duplicated code in warpSync.cc
g-h-c May 18, 2026
dddb19f
AIRUNTIME-171 - add Unit_Thread_Block_Tile_Scan_All_Parameter_Sizes
g-h-c May 19, 2026
b528231
AIRUNTIME-171 - remove outdated comments
g-h-c May 19, 2026
4c189e9
AIRUNTIME-171 - fix compilation error: std::memset() is not defined i…
g-h-c May 19, 2026
3ce865a
AIRUNTIME-171 - fix duplicated variable
g-h-c May 19, 2026
cdcb38c
AIRUNTIME-171 - fix integer constant not having the right suffix for …
g-h-c May 19, 2026
0af4069
AIRUNTIME-171 - fix use of undeclared variable
g-h-c May 19, 2026
c8434cc
AIRUNTIME-171 - refactor rtc cooperative groups tests so they can be …
g-h-c May 20, 2026
f33fdf7
AIRUNTIME-171 - add Unit_Rtc_CoopScan
g-h-c May 20, 2026
b6b6e6b
AIRUNTIME-171 - use 1ull instead of 1ul when dealing with 64-bit masks
g-h-c May 20, 2026
07a1916
AIRUNTIME-171 - 'unsigned long warpMask' should have been 'unsigned l…
g-h-c May 20, 2026
8276563
AIRUNTIME-171 - fix another use of 1ul when it should have 1ull. Also…
g-h-c May 20, 2026
e63e449
AIRUNTIME-171 - fix Unit_Thread_Block_Coalesced_Scan_boolean using ar…
g-h-c May 20, 2026
8c0be61
AIRUNTIME-171 - fix GENERATE_SCAN_FUNC() for fp16 when not in coopera…
g-h-c May 20, 2026
ee60c7b
AIRUNTIME-171 - fix potential deadlock in applyFunctor() and applySca…
g-h-c May 20, 2026
89c96c5
AIRUNTIME-171 - add missing semicolon
g-h-c May 20, 2026
a912a38
AIRUNTIM-171 - Fix another use of 1ul and not 1ull on mask calculations
g-h-c May 20, 2026
faf2464
AIRUNTIME-171 - fix Unit_Rtc_CoopScan failing because calculateExpect…
g-h-c May 20, 2026
1670709
AIRUNTIME-171 - fix OCKL intrinsics not being called for scan
g-h-c May 20, 2026
389e2e4
AIRUNTIME-171 - fix OCKL boolean scan intrinsics not being called
g-h-c May 20, 2026
e41de8e
Add hip_scan.h to hiprtc/CMakeLists.txt
g-h-c May 21, 2026
d812d73
AIRUNTIME-171 - fix that in cg::exclusive_scan the identity needs to …
g-h-c May 21, 2026
1b75779
AIRUNTIME-171 - implement __hip_internal::numeric_limits<T> using con…
g-h-c May 26, 2026
a2bc885
AIRUNTIME-171 - rename __hip_internal::numeric_limits<T>::max() to Nu…
g-h-c May 26, 2026
cd88bf8
AIRUNTIME-171 - fix invocation of opToString should have been: opToSt…
g-h-c May 26, 2026
eaada34
AIRUNTIME-171 - fix struct NumericLimits<float>::lowest() should retu…
g-h-c May 27, 2026
c396a2b
AIRUNTIME-171 - fix rocsparse 'error: constexpr function never produc…
g-h-c May 27, 2026
fc40fc5
AIRUNTIME-171 - rename opToString() in rtc_reduce.cc to reduceOpToStr…
g-h-c May 27, 2026
a5d5206
AIRUNTIME-171 - fix that Unit_Rtc_CoopScan for cg::plus<float> used e…
g-h-c May 27, 2026
14f3c41
AIRUNTIME-171 - fix that NumericLimits<__half>::maximum() and lowest(…
g-h-c May 27, 2026
1c90b98
AIRUNTIME-171 - fix 'expected' variable assigned but never used
g-h-c May 27, 2026
3037743
AIRUNTIME-171 - remove halfWaveSize which is actually shadowed by ano…
g-h-c May 27, 2026
53b89c2
AIRUNTIME-171 - fix undefined behaviour that could happen if __builti…
g-h-c May 27, 2026
60623df
AIRUNTIME-171 - try to implement NumericLimits<__half>::maximum() and…
g-h-c May 28, 2026
70d936b
AIRUNTIME-171 - avoid 'error: change of the active member of a union …
g-h-c May 28, 2026
56e3170
AIRUNTIME-171 - fix expected value for __half in exclusive_scans tests
g-h-c May 29, 2026
eaf2c9d
AIRUNTIME-171 - execute Unit_Thread_Block_Tile_Exclusive_Scan_Basic f…
g-h-c Jun 1, 2026
47921ba
AIRUNTIME-171 - fix up previous commit, std::numeric_limits only need…
g-h-c Jun 1, 2026
cc332cd
AIRUNTIME-171 - fix value for NumericLimits<float>::lowest()
g-h-c Jun 1, 2026
cfa954b
AIRUNTIME-171 - fix Unit_Rtc_CoopReduce being too verbose
g-h-c Jun 1, 2026
e8516db
AIRUNTIME-171 - make sure the operands and the return value are align…
g-h-c Jun 1, 2026
3142b93
AIRUNTIME-171 - prevent memcpy() in cooperative_groups::scan() to rea…
g-h-c Jun 11, 2026
12e99e6
AIRUNTIME-171 - fix memcpy() reading past Val also for reduce operations
g-h-c Jun 11, 2026
4d5bea1
AIRUNTIME-171 - make cooperative_groups::exclusive_scan() results for…
g-h-c Jun 11, 2026
5dd6707
Fix expected values for cg::exlusive_scan() for the first active lane
g-h-c Jun 12, 2026
ea17dc2
AIRUNTIME-171 - cosmetic changes
g-h-c Jun 13, 2026
cf2f101
AIRUNTIME-171 - rename GENERATE_SCAN_FUNC as HIP_IMPL_GENERATE_SCAN_F…
g-h-c Jun 16, 2026
7a62b59
AIRUNTIME-171 - fix typo in cooperativeGrps.yaml
g-h-c Jun 16, 2026
e4461bd
AIRUNTIME-171 - fix Max::operator() contained an unnecessary extra loop
g-h-c Jun 16, 2026
ae7d88f
AIRUNTIME-171 - fix whitespace errors
g-h-c Jun 16, 2026
a533ae6
AIRUNTIME-171 - add missing nvidia_hip_cooperative_groups_scan.h
g-h-c Jun 16, 2026
041bc79
AIRUNTIME-171 - add cg::inclusive_scan/exclusive_scan() overloads tha…
g-h-c Jun 16, 2026
fc4cadf
AIRUNTIME-171 - fix Unit_Thread_Block_Tile_Inclusive_Scan_Basic failu…
g-h-c Jun 16, 2026
ec5857e
AIRUNTIME-171 - fix that cooperative_groups/scan.h would not compile …
g-h-c Jun 17, 2026
78e0ee3
AIRUNTIME-171 - add test Unit_Thread_Block_Coalesced_Scan_Partition
g-h-c Jun 17, 2026
4761780
AIRUNTIME-171 - fix that scan should use coalesced_info.member_mask a…
g-h-c Jun 17, 2026
0217805
AIRUNTIME-171 - add a test for partitioned thread_block_tiles too
g-h-c Jun 17, 2026
c224faa
AIRUNTIME-171 - fix order of parameters when applying scan operations…
g-h-c Jun 18, 2026
56fcc84
AIRUNTIME-171 - return {} as identity for custom types as the result …
g-h-c Jun 18, 2026
1aabf99
AIRUNTIME-171 - change cooperative_group::impl::buildMask() to use in…
g-h-c Jun 19, 2026
7baca86
SWDEV-515087 - replace __lane_id() with a manual calculation; otherwi…
g-h-c Jun 25, 2026
bc62602
ROCM-26483 - make sure cooperative_groups::reduce() will now work wit…
g-h-c Jun 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -894,4 +894,13 @@ static inline __device__ void* memset(void* ptr, int val, size_t size) {
}
#endif // !__OPENMP_AMDGCN__

#define HIP_IMPL_GENERATE_SCAN_FUNC(OP, TYPE_ALIAS, TYPE) \
extern "C" __device__ __attribute__((const)) TYPE __ockl_wfscan_ ## OP ## _ ## TYPE_ALIAS(TYPE, bool);\
\
template <bool Inclusive>\
__device__ __forceinline__ TYPE scan_ ## OP(TYPE val)\
{\
return __ockl_wfscan_ ## OP ## _ ## TYPE_ALIAS(val, Inclusive);\
}

#endif
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@
#endif

namespace cooperative_groups {
class thread_group;
namespace impl {
template <typename TyGroup>
__CG_QUALIFIER__ unsigned long long groupMask(const TyGroup&);
}

/** \brief The base type of all cooperative group types.
*
Expand All @@ -32,12 +37,13 @@ namespace cooperative_groups {
* on Microsoft Windows.
*/
class thread_group {
template <typename TyGroup>
friend __CG_QUALIFIER__ unsigned long long cooperative_groups::impl::groupMask(const TyGroup&);
protected:
__hip_uint32_t _type; //! Type of the thread_group.
__hip_uint32_t _num_threads; //! Total number of threads in the thread_group.
__hip_uint64_t _mask; //! Lanemask for coalesced and tiled partitioned group types,
//! LSB represents lane 0, and MSB represents lane 63

//! Construct a thread group, and set thread group type and other essential
//! thread group properties. This generic thread group is directly constructed
//! only when the group is supposed to contain only the calling thread
Expand Down Expand Up @@ -387,6 +393,7 @@ class coalesced_group : public thread_group {
friend __CG_QUALIFIER__ coalesced_group tiled_partition(const coalesced_group& parent,
unsigned int tile_size);
friend __CG_QUALIFIER__ coalesced_group binary_partition(const coalesced_group& cgrp, bool pred);

template <unsigned int fsize, class fparent> friend __CG_QUALIFIER__ coalesced_group
binary_partition(const thread_block_tile<fsize, fparent>& tgrp, bool pred);

Expand Down Expand Up @@ -1044,6 +1051,36 @@ template <unsigned int size, class ParentCGTy> class thread_block_tile_internal
__CG_QUALIFIER__ thread_block_tile_internal(const thread_block& g)
: thread_block_tile_type<size, ParentCGTy>() {}
};

// becomes to std::true_type if the group is tiled and has a size known at compile time
template <class TyGroup>
struct isTiledGroup : __hip_internal::false_type {
};

template <unsigned int N, class ParentCGTy>
struct isTiledGroup<cooperative_groups::thread_block_tile<N, ParentCGTy>>
: __hip_internal::integral_constant<bool,
(N == 1 || N == 2 || N == 4 || N == 8 ||
N == 16 || N == 32 || N == 64)> {
};

// returns the size of tile_group provided it is known at compile time
template <class TyGroup>
struct tiledGroupSize : __hip_internal::integral_constant<int, 0> {

};
template <unsigned int N, class ParentCGTy>
struct tiledGroupSize<cooperative_groups::thread_block_tile<N, ParentCGTy>>
: __hip_internal::integral_constant<int, N> {
};

template <class TyGroup>
struct isCoalescedGroup : __hip_internal::false_type {
};

template <>
struct isCoalescedGroup<cooperative_groups::coalesced_group> : __hip_internal::true_type {
};
} // namespace impl

/** \brief Group type - thread_block_tile
Expand Down Expand Up @@ -1287,6 +1324,7 @@ __CG_QUALIFIER__ coalesced_group binary_partition(const thread_block_tile<size,
}
}


template <class T>
struct plus {
__CG_QUALIFIER__ T operator()(T lhs, T rhs) const
Expand Down Expand Up @@ -1334,6 +1372,153 @@ struct bit_or {
return lhs | rhs;
}
};

namespace impl {
// when instantiated with two parameter types, allows to know if there are the same, regardless
// of const/volatile qualifiers
template <typename T, typename U>
using is_param_type_same = __hip_internal::is_same<typename __hip_internal::remove_cvref<T>,
typename __hip_internal::remove_cvref<U>>;

template <class T, class Op>
struct isArithmeticFunc : __hip_internal::false_type {
};

template <class T>
struct isArithmeticFunc<T, cooperative_groups::plus<T>> : __hip_internal::true_type {
};

template <class T>
struct isArithmeticFunc<T, cooperative_groups::less<T>> : __hip_internal::true_type {
};

template <class T>
struct isArithmeticFunc<T, cooperative_groups::greater<T>> : __hip_internal::true_type {
};

template <class T, class Op>
struct isBooleanFunc : __hip_internal::false_type {
};

template <class T>
struct isBooleanFunc<T, cooperative_groups::bit_and<T>> : __hip_internal::true_type {
};

template <class T>
struct isBooleanFunc<T, cooperative_groups::bit_or<T>> : __hip_internal::true_type {
};

template <class T>
struct isBooleanFunc<T, cooperative_groups::bit_xor<T>> : __hip_internal::true_type {
};

// this is the value to return in exclusive_scan, for lane 0
template <class T, class Op>
struct CGIdentity {
__CG_QUALIFIER__ T operator()()
{
T result = {};
return result;
}
};

template <class T>
struct CGIdentity<T, cooperative_groups::bit_and<T>> {
__CG_QUALIFIER__ T operator()()
{
T result {};
return ~result;
}
};

template <class T>
struct CGIdentity<T, cooperative_groups::less<T>> {
__CG_QUALIFIER__ T operator()()
{
// CUDA would return 0 in this case. But in our case we mimic what __ockl_wfscan_*
Comment thread
g-h-c marked this conversation as resolved.
// would do
return __hip_internal::NumericLimits<T>::maximum();
}
};

template <class T>
struct CGIdentity<T, cooperative_groups::greater<T>> {
__CG_QUALIFIER__ T operator()()
{
return __hip_internal::NumericLimits<T>::minimum();
}
};

// calculates the necessary warp mask for cooperative groups that support reduce(), or
// inclusive/exlcusive_scan()
template <typename TyGroup>
__CG_QUALIFIER__ unsigned long long groupMask(const TyGroup& group)
{
unsigned long long mask = ~0ull;

if constexpr (impl::isCoalescedGroup<TyGroup>::value) {
Comment thread
g-h-c marked this conversation as resolved.
mask = group.coalesced_info.member_mask;
} else {
// we cannot simply just use the __activemask() here, because more than one tile could have active
// threads at a time; we need to mask away the threads that not part of this tile first
mask >>= (64 - group.num_threads());
mask <<= (((internal::workgroup::thread_rank() % warpSize) / group.num_threads()) * group.num_threads());
}

return mask;
}

// backward permute implementation for cooperative group operations, i.e.
// for up to 32 bytes (__hip_ds_bpermute() can only do 4 bytes at a time,
// this function calls it (or the floating point version) multiple times to
// implement it for bigger sizes
template <bool isPrimitiveType, class T, size_t NumPermutes, typename __hip_internal::enable_if<NumPermutes == 0, int>::type = 0>
__CG_QUALIFIER__ void bPermute(T&, T, int from)
{
}

// trivial case: the type fits within the permute size
template <bool IsPrimitiveType, class T, size_t NumPermutes, typename __hip_internal::enable_if<IsPrimitiveType && NumPermutes == 1, int>::type = 0>
__CG_QUALIFIER__ void bPermute(T& permuteResult, T result, int from)
{
auto backwardPermute = [](int index, T arg) {
if constexpr (__hip_internal::is_floating_point<T>::value &&
sizeof(T) <= 4) {
return __hip_ds_bpermutef(index, arg);
} else {
return __hip_ds_bpermute(index, arg);
}
};

if constexpr (sizeof(T) == 2) {
union {
int i;
T f;
} tmp;

Comment thread
g-h-c marked this conversation as resolved.
tmp.f = result;
tmp.i = __hip_ds_bpermute(from << 2, tmp.i);
permuteResult = tmp.f;
} else if constexpr (sizeof(T) == 4) {
auto bPermuteResult = backwardPermute(from << 2, result);
__builtin_memcpy(&permuteResult, &bPermuteResult, sizeof(result));
} else {
static_assert(__hip_internal::is_void<T>::value, "Unexpected type");
}
}

// Overload when we need multiple ds_permute, because one is not enough
template <bool IsPrimitiveType, class T, size_t NumPermutes>
__CG_QUALIFIER__ void bPermute(T permuteResult[NumPermutes], T result[NumPermutes], int from)
{
// ds_bpermute only deals with 32-bit sizes, so for other sizes
// we need to call the permute multiple times
for (int i = 0; i < NumPermutes; i++) {
permuteResult[i] = __hip_ds_bpermute(from << 2, result[i]);
}
}

} // namespace impl
#endif

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,6 @@

namespace cooperative_groups {
namespace impl {
template <typename T, typename U>
using is_param_type_same = __hip_internal::is_same<typename __hip_internal::remove_cvref<T>,
typename __hip_internal::remove_cvref<U>>;

template <typename T, typename = void>
struct has_add : __hip_internal::false_type {
};
Expand Down Expand Up @@ -76,25 +72,6 @@ namespace impl {
__hip_internal::void_t<decltype(__reduce_xor_sync<unsigned long long>(0ull, T {}))>
> : __hip_internal::true_type {};

// we can call reduce() only the block tiles that have a compile-time size
template <class TyGroup>
struct isTiledGroup : __hip_internal::false_type {
};

template <unsigned int N, class ParentCGTy>
struct isTiledGroup<cooperative_groups::thread_block_tile<N, ParentCGTy>>
: __hip_internal::integral_constant<bool,
(N == 1 || N == 2 || N == 4 || N == 8 ||
N == 16 || N == 32 || N == 64)> {
};

template <class TyGroup>
struct isCoalescedGroup : __hip_internal::false_type {
};

template <>
struct isCoalescedGroup<cooperative_groups::coalesced_group> : __hip_internal::true_type {
};
}

/** \ingroup CooperativeGAPI
Expand Down Expand Up @@ -126,15 +103,7 @@ __CG_QUALIFIER__ auto reduce(const TyGroup& group, TyVal&& val, TyFn&& op) -> de

unsigned long long mask = ~0ull;

// we cannot simply just use the __activemask() here, because more than one tile could have active
// threads at a time; we need to mask away the threads that not part of this tile first
if constexpr (!__hip_internal::is_same<TyGroup, cooperative_groups::coalesced_group>::value) {
mask >>= (64 - group.num_threads());
mask <<= (((threadIdx.x % warpSize) / group.num_threads()) * group.num_threads());
}

// for coalesced_groups, the mask is simply the activemask
mask &= __activemask();
mask = impl::groupMask(group);

if constexpr (__hip_internal::is_same<Op, cooperative_groups::plus<Val>>::value &&
impl::has_add<Val>::value) {
Expand Down
Loading
Loading