Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ACC directives to Fortran #27

Merged
merged 6 commits into from
Feb 27, 2025
Merged

Conversation

mnlevy1981
Copy link
Contributor

This only updates swm_fortran_kernels.F90 (and adds a mechanism for building with nvfortran instead of gfortran)

This is probably the worst way to control what compiler is used for building
the model, but on the other hand it's a simple way to control what compiler is
used for building the model
Also added -acc=gpu flag to Makefile (though this might be the default? it
didn't change performance, while -noacc slowed things down)
@mnlevy1981
Copy link
Contributor Author

Timing improvements using a compute node on derecho (with single gpu, where applicable):

  1. Building swm_fortran with nvfortran (this executable does not use acc at all)

     cycle number 4000 total computer time 2.552587 time per cycle 0.000638
     time and megaflops for loop 100 0.842869 7464.336043
     time and megaflops for loop 200 0.951119 7166.028746
     time and megaflops for loop 300 0.735940 5343.047880
    
  2. swm_fortran_driver (with ACC)

     cycle number 4000 total computer time 0.612509 time per cycle 0.000153
     time and megaflops for loop 100 0.422081 14905.802665
     time and megaflops for loop 200 0.086040 79215.535142
     time and megaflops for loop 300 0.081181 48437.204598
    

When using nvfortran, it's helpful to have both executables for comparison
@mnlevy1981 mnlevy1981 marked this pull request as draft February 26, 2025 18:20
@mnlevy1981
Copy link
Contributor Author

I'll mark this ready-to-review once I've done a little more optimization with regards to acc directives in the driver (single copyin / copyout, more ACC directives around periodicity, etc)

@mnlevy1981
Copy link
Contributor Author

I added ACC directives around initialization, and that had a noticeable benefit. I also tried to just move the ACC copyin and copyout directives from the kernel to the driver and that degraded performance. Specifically:

diff --git a/swm_fortran/swm_fortran_driver.F90 b/swm_fortran/swm_fortran_driver.F90
index f159393..740f279 100644
--- a/swm_fortran/swm_fortran_driver.F90
+++ b/swm_fortran/swm_fortran_driver.F90
@@ -160,7 +160,9 @@ Program SWM_Fortran_Driver
   do ncycle=1,ITMAX

     call cpu_time(c1)
+    !$acc enter data copyin(p,u,v,fsdx,fsdy,cu,cv,h,z)
     call UpdateIntermediateVariablesKernel(fsdx,fsdy,p,u,v,cu,cv,h,z)
+    !$acc exit data copyout(cu,cv,z,h)
     call cpu_time(c2)
     t100 = t100 + (c2 - c1)

@@ -190,7 +192,9 @@ Program SWM_Fortran_Driver
     tdtsdy = tdt / dy

     call cpu_time(c1)
+    !$acc enter data copyin(tdtsdx,tdtsdy,tdts8,cu,cv,z,h,pold,uold,vold,pnew,unew,vnew)
     call UpdateNewVariablesKernel(tdtsdx,tdtsdy,tdts8,pold,uold,vold,cu,cv,h,z,pnew,unew,vnew)
+    !$acc exit data copyout(unew,vnew,pnew)
     call cpu_time(c2)
     t200 = t200 + (c2-c1)

diff --git a/swm_fortran/swm_fortran_kernels.F90 b/swm_fortran/swm_fortran_kernels.F90
index cff1d69..45ee999 100644
--- a/swm_fortran/swm_fortran_kernels.F90
+++ b/swm_fortran/swm_fortran_kernels.F90
@@ -12,7 +12,6 @@ subroutine UpdateIntermediateVariablesKernel(fsdx,fsdy,p,u,v,cu,cv,h,z)

     integer :: i,j

-    !$acc enter data copyin(p,u,v,fsdx,fsdy,cu,cv,h,z)
     !$acc parallel loop collapse(2) present(p,u,v,fsdx,fsdy)
     do j=1,size(cu,2)-1
       do i=1,size(cu,1)-1
@@ -24,7 +23,6 @@ subroutine UpdateIntermediateVariablesKernel(fsdx,fsdy,p,u,v,cu,cv,h,z)
                                   v(i,j+1) * v(i,j+1) + v(i,j) * v(i,j))
       end do
     end do
-    !$acc exit data copyout(cu,cv,z,h)

   end subroutine UpdateIntermediateVariablesKernel

@@ -36,7 +34,6 @@ subroutine UpdateNewVariablesKernel(tdtsdx,tdtsdy,tdts8,pold,uold,vold,cu,cv,h,z

     integer :: i,j

-    !$acc enter data copyin(tdtsdx,tdtsdy,tdts8,cu,cv,z,h,pold,uold,vold,pnew,unew,vnew)
     !$acc parallel loop collapse(2) present(tdtsdx,tdtsdy,tdts8,cu,cv,z,h,pold,uold,vold)
     do j=1,size(unew,2)-1
       do i=1,size(unew,1)-1
@@ -49,7 +46,6 @@ subroutine UpdateNewVariablesKernel(tdtsdx,tdtsdy,tdts8,pold,uold,vold,cu,cv,h,z
         pnew(i,j) = pold(i,j) - tdtsdx * (cu(i+1,j) - cu(i,j)) - tdtsdy * (cv(i,j+1) - cv(i,j))
       end do
     end do
-    !$acc exit data copyout(unew,vnew,pnew)

   end subroutine UpdateNewVariablesKernel

results in ~50% longer runtime. Is there a directive to say "this function will always be called with data on the GPU"? Or something else that I'm missing?

@mnlevy1981
Copy link
Contributor Author

@johnmauff I should have tagged you in that last comment

Offload more initialization and periodicity updates to GPU; also added
nvfortran-noacc target to Makefile to build without acc
@mnlevy1981 mnlevy1981 marked this pull request as ready for review February 27, 2025 18:36
@johnmauff johnmauff merged commit 2f3a84b into NCAR:main Feb 27, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants