Potential Inconsistencies between RAJA and Kokkos HIp implementations

### The problem
When running on Tuolumne (MI300A) with a total problem size of 512M (128M per gpu), the `Basic_PI_ATOMIC` kernel in [Kokkos](https://github.com/LLNL/RAJAPerf/blob/develop/src/basic-kokkos/PI_ATOMIC-Kokkos.cpp) performs significantly slower than both the [RAJA](https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/PI_ATOMIC-Hip.cpp) and Base implementations. 

Specifically:
- RAJA / Base variants: around 2 seconds total runtime
- Kokkos variant: around 60 seconds total runtime

### The Potential Cause

After looking into the Kokkos implementation, the kernel might be performing redundant data movement and synchronization during each repetition of the repetition loop.

The slowdown appears to stem from how the Kokkos kernel manages its `Kokkos::View` and host–device data movement.

In the current implementation, the pi view is recreated and synchronized during every repetition of the timing loop:
```cpp
for (RepIndex_type irep = 0; irep < run_reps; irep++) {
  *pi = m_pi_init;
  pi_view = getViewFromPointer(pi, 1);

  Kokkos::parallel_for(
      "PI_ATOMIC-Kokkos Kokkos_Lambda",
      Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>(ibegin, iend),
      KOKKOS_LAMBDA(Index_type i) {
        double x = (double(i) + 0.5) * dx;
        Kokkos::atomic_add(&pi_view(0), dx / (1.0 + x * x));
      });

  moveDataToHostFromKokkosView(pi, pi_view, 1);
  *pi *= 4.0;
}

```

Each iteration seems to a device allocation for the new Kokkos::View, along with what i think is a blocking host–device copy via `moveDataToHostFromKokkosView()`. This results in  a full synchronization and data transfer per repition.

By contrast the RAJA implementation seems to keeps persistent device memory and uses asynchronous kernel execution and transfers:
```cpp
for (RepIndex_type irep = 0; irep < run_reps; irep++) {
  RAJAPERF_HIP_REDUCER_INITIALIZE(&m_pi_init, pi, hpi, 1, 1);

  RAJA::forall< RAJA::hip_exec<block_size, true /*async*/> >(res,
    RAJA::RangeSegment(ibegin, iend),
    [=] __device__(Index_type i) {
      double x = (double(i) + 0.5) * dx;
      RAJA::atomicAdd<RAJA::hip_atomic>(pi, dx / (1.0 + x * x));
    });

  RAJAPERF_HIP_REDUCER_COPY_BACK(pi, hpi, 1, 1);
  m_pi_final = hpi[0] * 4.0;
}

```


### The Impliccations
The Kokkos implementation deals does  repeated data movement and synchronization within the repetition  loop, while the RAJA implementation reuses device buffers and performs asynchronous operations.

Assuming i have interpreted things correctly, and please do correct me if i havent, i have a couple of questions:
- Is this data movement pattern expected behavior for this kernel?
- Would restructuring the implementation to persist the `Kokkos::View` across repetitions or use asynchronous copies improve performance?
- I am comparing the performance of the kernel across RAJA and Kokkos and no longer believe it's a fair comparison. Am i correct in my assertion?






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Inconsistencies between RAJA and Kokkos HIp implementations #568

The problem

The Potential Cause

The Impliccations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential Inconsistencies between RAJA and Kokkos HIp implementations #568

Description

The problem

The Potential Cause

The Impliccations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions