Skip to content

Refactor of limit vel function for nicer layout for GPU compute#1023

Open
JorgeG94 wants to merge 27 commits intoNOAA-GFDL:dev/gfdlfrom
JorgeG94:refactor_limit_vel
Open

Refactor of limit vel function for nicer layout for GPU compute#1023
JorgeG94 wants to merge 27 commits intoNOAA-GFDL:dev/gfdlfrom
JorgeG94:refactor_limit_vel

Conversation

@JorgeG94
Copy link

@JorgeG94 JorgeG94 commented Jan 21, 2026

@marshallward sorry for taking long to open this!


(Description copied from below)

This PR modifies vertvisc_limit_vel by refactoring the single CFL test and velocity truncation jki loop into two kji loops for each stage.

Much like before, the first loop tests for CFL violations and logs the smallest velocity magnitude for each ij column. The second loop independently checks for and applies potential potential truncations in the compute domain. The third write_[uv]_accel loop is preserved, with a new global control flag. The pattern is repeated for u and v.

This split was required to eliminate some performance bottlenecks observed in the GPU vert_friction implemenation.

JorgeG94 and others added 12 commits January 8, 2026 19:14
* `trunc_any_array` is promoted to edge-domain in both directions
  `(SZIB_(G), SZJB_(G), ...)`

* Some i/j indexing errors (e.g. `h(i+1,j)` -> `h(i,j+1)`)
  (Mostly i- code in the j- block)

* Removed `!$omp parallel do` directives
  (NOTE: maybe we undo this one?)

* Style fixes

  * Index case fixes (e.g. `I` -> `i`)
  * `end do` -> `enddo`, `end if` -> `endif`
  * Operator/keyword spacings
  * Trailing whitespace

* Removed unused CFL_based_trunc from vertvisc_CS
Replaced the older thickness H_report conditions with the newer omes
based on compute domain and using CS%h_[uv].

Also moved the write_u_accel block to mirror dev/gfdl.

No idea if there are performance implications but we can check it out.
Revisions to vertvisc_limit_vel PR to dev/gfdl
Copy link
Member

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of minor stylistic things that should be corrected here, and there needs to be a proper description of the changes in the PR message, but once these are in place this should be ready to go.

Please also note that the 12 commits in this PR will be squashed into a single commit, unless there is a compelling case to preserve each of the intermediate commits.

@JorgeG94
Copy link
Author

There are a couple of minor stylistic things that should be corrected here, and there needs to be a proper description of the changes in the PR message, but once these are in place this should be ready to go.

Please also note that the 12 commits in this PR will be squashed into a single commit, unless there is a compelling case to preserve each of the intermediate commits.

! I thought Marshall and I got all of them, sorry about that.

I don't mind committs being squashed. Will address !

@marshallward
Copy link
Member

This PR modifies vertvisc_limit_vel by refactoring the single CFL test and velocity truncation jki loop into two kji loops for each stage.

Much like before, the first loop tests for CFL violations and logs the smallest velocity magnitude for each ij column. The second loop independently checks for and applies potential potential truncations in the compute domain. The third write_[uv]_accel loop is preserved, with a new global control flag. The pattern is repeated for u and v.

This split was required to eliminate some performance bottlenecks observed in the GPU vert_friction implemenation.

Copy link
Member

@marshallward marshallward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JorgeG94 While reviewing this again, I noticed that the trunc_any variable was dropped, causing the second truncation loop to always be executed. This could reduce the performance.

I believe an if (CFL > CS%CFL_trunc) trunc_any = .true. can be added before or after the if (CFL > CS%CFL_report) block, and these second kji blocks could be wrapped with if (trunc_any) tests.

The CI has also detected some trailing whitespace which needs to be removed.

I've also noted a few very minor style changes as inline comments.

And of course, thanks very much for figuring out a way to keep this working on both CPU and GPU!

@marshallward
Copy link
Member

marshallward commented Jan 28, 2026

Some of my suggestions on the GPU seemed to cause a serious (10x) degradation of runtime on the GPU. Or, at the least, my implementation had some issues. So this needs some investigation before moving forward.

@marshallward
Copy link
Member

Apologies, my previous comment about performance can be ignored. I can no longer reproduce the problem. It was likely a compiler flag difference.

@JorgeG94
Copy link
Author

JorgeG94 commented Feb 2, 2026

Apologies, my previous comment about performance can be ignored. I can no longer reproduce the problem. It was likely a compiler flag difference.

With that put to rest, I think everything else has been addressed?

@marshallward
Copy link
Member

There are still a few outstanding issues.

  • trunc_any still needs to be restored. The second kji loop of each direction is always run, even though the CFL is already computed and testable in the first kji loop. I would imagine this is CPU-expensive (and maybe even GPU-expensive).

  • In the current form, the two if (CS%u_trunc_file) blocks are sequential, and I don't see any reason why they should not be merged to a single if-block. Once merged, the scope of do_any_write becomes local to this if-block, so I would move the initialization inside this block. Similarly for CS%v_trunc_file.

    There may be an argument for moving the write_[uv]_accel blocks outside, but I'd suggest addressing it at a later time.

  • There are still a few minor style diffs that need to be fixed:

    • Indentation of L2667
    • if(do_any_write) -> if (do_any_write)

- truncation moved inside the main big if since they are sequential, limit the scope of the write any variable
- fix indendation
@marshallward
Copy link
Member

I think everything that I can see has been addressed, but I think it would also be good for someone other than me to review this.

endif
enddo ; enddo
do_any_write = .false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trunc_any is never initialized to false, meaning that its (arbitrary) initialization value can determine whether the loops starting at lines 2404 and 2671 are called when there are in fact no truncations. The line trunc_any = .false should be added at about line 2579 and then again at line 2645.

Alternately, trunc_any could perhaps be replaced with the new variable do_any_write where it is used and then eliminated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will fix, thanks so much

Copy link
Member

@marshallward marshallward Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hallberg-NOAA There are different conditions for do_any_write (CFL > CS%CFL_report) and trunc_any (CFL > CS%CFL_trunc).

The CFL > CS%CFL_trunc is ultimately (re)tested in the second loop, but it is now wrapped in a do_any_write / CS%CFL_report test. If any CS%CFL_report is true and all CS%CFL_trunc are false, then you will run the loop and recompute CFL for nothing.

Admittedly, these do default to the same value, and even if they were not this is fairly unlikely, but it could have consequences in longer mostly-stable runs with intermittent CFL violations.

The current version of the code removes the trunc_any test. Unless I'm missing something, it makes more sense to me to preserve this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @marshallward, you are absolutely correct. We do need to retain both do_any_write and any_trunc, and both need to be initialized properly.

@Hallberg-NOAA Hallberg-NOAA added the refactor Code cleanup with no changes in functionality or results label Feb 13, 2026
@JorgeG94
Copy link
Author

sorry it seems I just never committed the last fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Code cleanup with no changes in functionality or results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants