Skip to content

Potential Inconsistencies between RAJA and Kokkos HIp implementations #568

@Yejashi

Description

@Yejashi

The problem

When running on Tuolumne (MI300A) with a total problem size of 512M (128M per gpu), the Basic_PI_ATOMIC kernel in Kokkos performs significantly slower than both the RAJA and Base implementations.

Specifically:

  • RAJA / Base variants: around 2 seconds total runtime
  • Kokkos variant: around 60 seconds total runtime

The Potential Cause

After looking into the Kokkos implementation, the kernel might be performing redundant data movement and synchronization during each repetition of the repetition loop.

The slowdown appears to stem from how the Kokkos kernel manages its Kokkos::View and host–device data movement.

In the current implementation, the pi view is recreated and synchronized during every repetition of the timing loop:

for (RepIndex_type irep = 0; irep < run_reps; irep++) {
  *pi = m_pi_init;
  pi_view = getViewFromPointer(pi, 1);

  Kokkos::parallel_for(
      "PI_ATOMIC-Kokkos Kokkos_Lambda",
      Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>(ibegin, iend),
      KOKKOS_LAMBDA(Index_type i) {
        double x = (double(i) + 0.5) * dx;
        Kokkos::atomic_add(&pi_view(0), dx / (1.0 + x * x));
      });

  moveDataToHostFromKokkosView(pi, pi_view, 1);
  *pi *= 4.0;
}

Each iteration seems to a device allocation for the new Kokkos::View, along with what i think is a blocking host–device copy via moveDataToHostFromKokkosView(). This results in a full synchronization and data transfer per repition.

By contrast the RAJA implementation seems to keeps persistent device memory and uses asynchronous kernel execution and transfers:

for (RepIndex_type irep = 0; irep < run_reps; irep++) {
  RAJAPERF_HIP_REDUCER_INITIALIZE(&m_pi_init, pi, hpi, 1, 1);

  RAJA::forall< RAJA::hip_exec<block_size, true /*async*/> >(res,
    RAJA::RangeSegment(ibegin, iend),
    [=] __device__(Index_type i) {
      double x = (double(i) + 0.5) * dx;
      RAJA::atomicAdd<RAJA::hip_atomic>(pi, dx / (1.0 + x * x));
    });

  RAJAPERF_HIP_REDUCER_COPY_BACK(pi, hpi, 1, 1);
  m_pi_final = hpi[0] * 4.0;
}

The Impliccations

The Kokkos implementation deals does repeated data movement and synchronization within the repetition loop, while the RAJA implementation reuses device buffers and performs asynchronous operations.

Assuming i have interpreted things correctly, and please do correct me if i havent, i have a couple of questions:

  • Is this data movement pattern expected behavior for this kernel?
  • Would restructuring the implementation to persist the Kokkos::View across repetitions or use asynchronous copies improve performance?
  • I am comparing the performance of the kernel across RAJA and Kokkos and no longer believe it's a fair comparison. Am i correct in my assertion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions