Refactor of limit vel function for nicer layout for GPU compute#1023
Refactor of limit vel function for nicer layout for GPU compute#1023JorgeG94 wants to merge 27 commits intoNOAA-GFDL:dev/gfdlfrom
Conversation
* `trunc_any_array` is promoted to edge-domain in both directions `(SZIB_(G), SZJB_(G), ...)` * Some i/j indexing errors (e.g. `h(i+1,j)` -> `h(i,j+1)`) (Mostly i- code in the j- block) * Removed `!$omp parallel do` directives (NOTE: maybe we undo this one?) * Style fixes * Index case fixes (e.g. `I` -> `i`) * `end do` -> `enddo`, `end if` -> `endif` * Operator/keyword spacings * Trailing whitespace * Removed unused CFL_based_trunc from vertvisc_CS
Replaced the older thickness H_report conditions with the newer omes based on compute domain and using CS%h_[uv]. Also moved the write_u_accel block to mirror dev/gfdl. No idea if there are performance implications but we can check it out.
Revisions to vertvisc_limit_vel PR to dev/gfdl
Hallberg-NOAA
left a comment
There was a problem hiding this comment.
There are a couple of minor stylistic things that should be corrected here, and there needs to be a proper description of the changes in the PR message, but once these are in place this should be ready to go.
Please also note that the 12 commits in this PR will be squashed into a single commit, unless there is a compelling case to preserve each of the intermediate commits.
! I thought Marshall and I got all of them, sorry about that. I don't mind committs being squashed. Will address ! |
|
This PR modifies Much like before, the first loop tests for CFL violations and logs the smallest velocity magnitude for each This split was required to eliminate some performance bottlenecks observed in the GPU |
marshallward
left a comment
There was a problem hiding this comment.
@JorgeG94 While reviewing this again, I noticed that the trunc_any variable was dropped, causing the second truncation loop to always be executed. This could reduce the performance.
I believe an if (CFL > CS%CFL_trunc) trunc_any = .true. can be added before or after the if (CFL > CS%CFL_report) block, and these second kji blocks could be wrapped with if (trunc_any) tests.
The CI has also detected some trailing whitespace which needs to be removed.
I've also noted a few very minor style changes as inline comments.
And of course, thanks very much for figuring out a way to keep this working on both CPU and GPU!
|
|
|
Apologies, my previous comment about performance can be ignored. I can no longer reproduce the problem. It was likely a compiler flag difference. |
With that put to rest, I think everything else has been addressed? |
|
There are still a few outstanding issues.
|
- truncation moved inside the main big if since they are sequential, limit the scope of the write any variable - fix indendation
|
I think everything that I can see has been addressed, but I think it would also be good for someone other than me to review this. |
| endif | ||
| enddo ; enddo | ||
| do_any_write = .false. | ||
|
|
There was a problem hiding this comment.
trunc_any is never initialized to false, meaning that its (arbitrary) initialization value can determine whether the loops starting at lines 2404 and 2671 are called when there are in fact no truncations. The line trunc_any = .false should be added at about line 2579 and then again at line 2645.
Alternately, trunc_any could perhaps be replaced with the new variable do_any_write where it is used and then eliminated.
There was a problem hiding this comment.
good catch, will fix, thanks so much
There was a problem hiding this comment.
@Hallberg-NOAA There are different conditions for do_any_write (CFL > CS%CFL_report) and trunc_any (CFL > CS%CFL_trunc).
The CFL > CS%CFL_trunc is ultimately (re)tested in the second loop, but it is now wrapped in a do_any_write / CS%CFL_report test. If any CS%CFL_report is true and all CS%CFL_trunc are false, then you will run the loop and recompute CFL for nothing.
Admittedly, these do default to the same value, and even if they were not this is fairly unlikely, but it could have consequences in longer mostly-stable runs with intermittent CFL violations.
The current version of the code removes the trunc_any test. Unless I'm missing something, it makes more sense to me to preserve this test.
There was a problem hiding this comment.
Yes, @marshallward, you are absolutely correct. We do need to retain both do_any_write and any_trunc, and both need to be initialized properly.
|
sorry it seems I just never committed the last fix! |
@marshallward sorry for taking long to open this!
(Description copied from below)
This PR modifies vertvisc_limit_vel by refactoring the single CFL test and velocity truncation jki loop into two kji loops for each stage.
Much like before, the first loop tests for CFL violations and logs the smallest velocity magnitude for each ij column. The second loop independently checks for and applies potential potential truncations in the compute domain. The third write_[uv]_accel loop is preserved, with a new global control flag. The pattern is repeated for u and v.
This split was required to eliminate some performance bottlenecks observed in the GPU vert_friction implemenation.