-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ACC directives to Fortran #27
Conversation
This is probably the worst way to control what compiler is used for building the model, but on the other hand it's a simple way to control what compiler is used for building the model
Also added -acc=gpu flag to Makefile (though this might be the default? it didn't change performance, while -noacc slowed things down)
Timing improvements using a compute node on derecho (with single gpu, where applicable):
|
When using nvfortran, it's helpful to have both executables for comparison
I'll mark this ready-to-review once I've done a little more optimization with regards to acc directives in the driver (single copyin / copyout, more ACC directives around periodicity, etc) |
I added ACC directives around initialization, and that had a noticeable benefit. I also tried to just move the ACC diff --git a/swm_fortran/swm_fortran_driver.F90 b/swm_fortran/swm_fortran_driver.F90
index f159393..740f279 100644
--- a/swm_fortran/swm_fortran_driver.F90
+++ b/swm_fortran/swm_fortran_driver.F90
@@ -160,7 +160,9 @@ Program SWM_Fortran_Driver
do ncycle=1,ITMAX
call cpu_time(c1)
+ !$acc enter data copyin(p,u,v,fsdx,fsdy,cu,cv,h,z)
call UpdateIntermediateVariablesKernel(fsdx,fsdy,p,u,v,cu,cv,h,z)
+ !$acc exit data copyout(cu,cv,z,h)
call cpu_time(c2)
t100 = t100 + (c2 - c1)
@@ -190,7 +192,9 @@ Program SWM_Fortran_Driver
tdtsdy = tdt / dy
call cpu_time(c1)
+ !$acc enter data copyin(tdtsdx,tdtsdy,tdts8,cu,cv,z,h,pold,uold,vold,pnew,unew,vnew)
call UpdateNewVariablesKernel(tdtsdx,tdtsdy,tdts8,pold,uold,vold,cu,cv,h,z,pnew,unew,vnew)
+ !$acc exit data copyout(unew,vnew,pnew)
call cpu_time(c2)
t200 = t200 + (c2-c1)
diff --git a/swm_fortran/swm_fortran_kernels.F90 b/swm_fortran/swm_fortran_kernels.F90
index cff1d69..45ee999 100644
--- a/swm_fortran/swm_fortran_kernels.F90
+++ b/swm_fortran/swm_fortran_kernels.F90
@@ -12,7 +12,6 @@ subroutine UpdateIntermediateVariablesKernel(fsdx,fsdy,p,u,v,cu,cv,h,z)
integer :: i,j
- !$acc enter data copyin(p,u,v,fsdx,fsdy,cu,cv,h,z)
!$acc parallel loop collapse(2) present(p,u,v,fsdx,fsdy)
do j=1,size(cu,2)-1
do i=1,size(cu,1)-1
@@ -24,7 +23,6 @@ subroutine UpdateIntermediateVariablesKernel(fsdx,fsdy,p,u,v,cu,cv,h,z)
v(i,j+1) * v(i,j+1) + v(i,j) * v(i,j))
end do
end do
- !$acc exit data copyout(cu,cv,z,h)
end subroutine UpdateIntermediateVariablesKernel
@@ -36,7 +34,6 @@ subroutine UpdateNewVariablesKernel(tdtsdx,tdtsdy,tdts8,pold,uold,vold,cu,cv,h,z
integer :: i,j
- !$acc enter data copyin(tdtsdx,tdtsdy,tdts8,cu,cv,z,h,pold,uold,vold,pnew,unew,vnew)
!$acc parallel loop collapse(2) present(tdtsdx,tdtsdy,tdts8,cu,cv,z,h,pold,uold,vold)
do j=1,size(unew,2)-1
do i=1,size(unew,1)-1
@@ -49,7 +46,6 @@ subroutine UpdateNewVariablesKernel(tdtsdx,tdtsdy,tdts8,pold,uold,vold,cu,cv,h,z
pnew(i,j) = pold(i,j) - tdtsdx * (cu(i+1,j) - cu(i,j)) - tdtsdy * (cv(i,j+1) - cv(i,j))
end do
end do
- !$acc exit data copyout(unew,vnew,pnew)
end subroutine UpdateNewVariablesKernel results in ~50% longer runtime. Is there a directive to say "this function will always be called with data on the GPU"? Or something else that I'm missing? |
@johnmauff I should have tagged you in that last comment |
Offload more initialization and periodicity updates to GPU; also added nvfortran-noacc target to Makefile to build without acc
This only updates
swm_fortran_kernels.F90
(and adds a mechanism for building withnvfortran
instead ofgfortran
)