-
Notifications
You must be signed in to change notification settings - Fork 51
Description
The problem
When running on Tuolumne (MI300A) with a total problem size of 512M (128M per gpu), the Basic_PI_ATOMIC kernel in Kokkos performs significantly slower than both the RAJA and Base implementations.
Specifically:
- RAJA / Base variants: around 2 seconds total runtime
- Kokkos variant: around 60 seconds total runtime
The Potential Cause
After looking into the Kokkos implementation, the kernel might be performing redundant data movement and synchronization during each repetition of the repetition loop.
The slowdown appears to stem from how the Kokkos kernel manages its Kokkos::View and host–device data movement.
In the current implementation, the pi view is recreated and synchronized during every repetition of the timing loop:
for (RepIndex_type irep = 0; irep < run_reps; irep++) {
*pi = m_pi_init;
pi_view = getViewFromPointer(pi, 1);
Kokkos::parallel_for(
"PI_ATOMIC-Kokkos Kokkos_Lambda",
Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>(ibegin, iend),
KOKKOS_LAMBDA(Index_type i) {
double x = (double(i) + 0.5) * dx;
Kokkos::atomic_add(&pi_view(0), dx / (1.0 + x * x));
});
moveDataToHostFromKokkosView(pi, pi_view, 1);
*pi *= 4.0;
}
Each iteration seems to a device allocation for the new Kokkos::View, along with what i think is a blocking host–device copy via moveDataToHostFromKokkosView(). This results in a full synchronization and data transfer per repition.
By contrast the RAJA implementation seems to keeps persistent device memory and uses asynchronous kernel execution and transfers:
for (RepIndex_type irep = 0; irep < run_reps; irep++) {
RAJAPERF_HIP_REDUCER_INITIALIZE(&m_pi_init, pi, hpi, 1, 1);
RAJA::forall< RAJA::hip_exec<block_size, true /*async*/> >(res,
RAJA::RangeSegment(ibegin, iend),
[=] __device__(Index_type i) {
double x = (double(i) + 0.5) * dx;
RAJA::atomicAdd<RAJA::hip_atomic>(pi, dx / (1.0 + x * x));
});
RAJAPERF_HIP_REDUCER_COPY_BACK(pi, hpi, 1, 1);
m_pi_final = hpi[0] * 4.0;
}
The Impliccations
The Kokkos implementation deals does repeated data movement and synchronization within the repetition loop, while the RAJA implementation reuses device buffers and performs asynchronous operations.
Assuming i have interpreted things correctly, and please do correct me if i havent, i have a couple of questions:
- Is this data movement pattern expected behavior for this kernel?
- Would restructuring the implementation to persist the
Kokkos::Viewacross repetitions or use asynchronous copies improve performance? - I am comparing the performance of the kernel across RAJA and Kokkos and no longer believe it's a fair comparison. Am i correct in my assertion?