Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fail in MPAS with nvidiagpu compiler #6470

Closed
ndkeen opened this issue Jun 13, 2024 · 7 comments
Closed

Build fail in MPAS with nvidiagpu compiler #6470

ndkeen opened this issue Jun 13, 2024 · 7 comments
Assignees
Labels
bug mpas-ocean nvidia compiler nvidia compiler (formerly PGI) pm-gpu Perlmutter machine at NERSC (GPU nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

With test SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu it has been failing for a while now. I think I mentioned this to @jonbob who said the fail dates matched a PR that recently went in. I thought I had made an issue, but maybe not.

 0 inform,   0 warnings,   1 severes, 0 fatal for ocn_diagnostics_variables_destroy
Target CMakeFiles/ocn.dir/__/__/core_ocean/shared/mpas_ocn_diagnostics_variables.f90.o built in 0.444529 seconds
gmake[2]: *** [mpas-framework/src/CMakeFiles/ocn.dir/build.make:918: mpas-framework/src/CMakeFiles/ocn.dir/__/__/core_ocean/shared/mpas_ocn_diagnostics_variables.f90.o] Error 2
gmake[2]: *** Waiting for unfinished jobs....
ocn_equation_of_state_linear_density_only:
    181, Generating present(tracerssurfacelayervalue(:,:),density(:,:))
         Generating NVIDIA GPU code
        183, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        184,   ! blockidx%x threadidx%x collapsed
    198, Generating present(tracers(:,:,:),density(:,:))
         Generating NVIDIA GPU code
        200, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        201,   ! blockidx%x threadidx%x collapsed
Target CMakeFiles/ocn.dir/__/__/core_ocean/shared/mpas_ocn_equation_of_state_jm.f90.o built in 2.100368 seconds
ocn_equation_of_state_linear_density_exp:
    315, Generating present(thermalexpansioncoeff(:,:),tracerssurfacelayervalue(:,:),density(:,:),salinecontractioncoeff(:,:))
         Generating NVIDIA GPU code
        320, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        321,   ! blockidx%x threadidx%x collapsed
    341, Generating present(tracers(:,:,:),thermalexpansioncoeff(:,:),density(:,:),salinecontractioncoeff(:,:))
         Generating NVIDIA GPU code
        346, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        347,   ! blockidx%x threadidx%x collapsed
ocn_equation_of_state_wright_density_only:
    211, Generating enter data create(boussinesqpres(:,:),tracertemp(:,:),tracersalt(:,:))
    226, Generating present(boussinesqpres(:,:),tracersalt(:,:),tracertemp(:,:),density(:,:))
@xylar
Copy link
Contributor

xylar commented Jun 13, 2024

I build this myself and I don't think the output above is relevant (it's just related to the parallel build getting killed as far as I can tell). The relevant output is:

NVFORTRAN-S-0038-Symbol, topographic_wave_drag, has not been explicitly declared (/pscratch/sd/x/xylar/e3sm_scratch/pm-gpu/SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu.20240613_020012_785h6a/bld/cmake-bld/core_ocean/shared/mpas_ocn_diagnostics_variables.f90: 1023)

This appear to be caused by #6310, which removed the topographic_wave_drag field but missed the OpenACC directive on that line.

@xylar xylar added the bug label Jun 13, 2024
@xylar
Copy link
Contributor

xylar commented Jun 13, 2024

After fixing the above, I'm now seeing:

NVFORTRAN-S-1061-Procedures called in a compute region must have acc routine information - ocn_subgrid_ssh_lookup (/pscratch/sd/x/xylar/e3sm_scratch/pm-gpu/SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu.20240613_024017_46es2q/bld/cmake-bld/core_ocean/shared/mpas_ocn_diagnostics.f90: 2307)
/global/common/software/nersc/pm-2022q4/spack/linux-sles15-zen/cmake-3.24.3-k5msymx/bin/cmake -E cmake_copy_f90_mod mpas-framework/src/ocn_tracer_advection_mono.mod mpas-framework/src/CMakeFiles/ocn.dir/ocn_tracer_advection_mono.mod.stamp NVHPC
ocn_diagnostic_solve_z_coordinates:
   2307, Accelerator restriction: call to 'ocn_subgrid_ssh_lookup' with no acc routine information

@xylar
Copy link
Contributor

xylar commented Jun 13, 2024

This next issue seems to have been introduced by #6288, and it's going to be more of a challenge to address. It seems like it's caused by calling ocn_subgrid_ssh_lookup within an OpenACC loop without having added the required directives.

@xylar
Copy link
Contributor

xylar commented Jun 13, 2024

@sbrus89, I made #6471 to fix the first issue. Could you make a PR to fix the second one?

@xylar
Copy link
Contributor

xylar commented Jun 13, 2024

It seems like separate PRs probably make sense to fix these issues because they're unrelated to each other but we won't be able to test them on their own because the test isn't currently compiling.

jonbob added a commit that referenced this issue Jun 24, 2024
Fix OpenACC rountine issue for subgrid wetting and drying

This PR moves a subgrid subroutine call out of a OpenACC parallel region
to fix the compile problems noted in #6470. Since subgrid wetting and
drying is strictly a MPAS-Ocean standalone feature, it should be fine
for this code to remain CPU-only.

This PR also fixes a couple issues in mpas_ocn_vmix.F:
* An OpenACC bug related to the use of gang vector collapse(3) on a
  double nested loop with variable inner loop bounds.
* A missing !$omp parallel region in a calculation for the
  config_use_gotm option.

[BFB] -- mpas-ocean standalone only
jonbob added a commit that referenced this issue Jun 25, 2024
Fix OpenACC rountine issue for subgrid wetting and drying

This PR moves a subgrid subroutine call out of a OpenACC parallel region
to fix the compile problems noted in #6470. Since subgrid wetting and
drying is strictly a MPAS-Ocean standalone feature, it should be fine
for this code to remain CPU-only.

This PR also fixes a couple issues in mpas_ocn_vmix.F:
* An OpenACC bug related to the use of gang vector collapse(3) on a
  double nested loop with variable inner loop bounds.
* A missing !$omp parallel region in a calculation for the
  config_use_gotm option.

[BFB] -- mpas-ocean standalone only
jonbob added a commit that referenced this issue Jun 26, 2024
…(PR #6471)

Fix OpenACC deletes for topographic wave drag

Partially addresses #6470

The errors were introduced in #6310, when variables were introduced and
renamed, including in the OpenACC create directives but corresponding
changes were incomplete for the OpenACC directives for deleting them.

[BFB]
@sbrus89
Copy link
Contributor

sbrus89 commented Jun 27, 2024

@ndkeen, This appears to be fixed now: https://my.cdash.org/tests/175231189

jonbob added a commit that referenced this issue Jun 27, 2024
Fix OpenACC deletes for topographic wave drag

Partially addresses #6470

The errors were introduced in #6310, when variables were introduced and
renamed, including in the OpenACC create directives but corresponding
changes were incomplete for the OpenACC directives for deleting them.

[BFB]
@xylar
Copy link
Contributor

xylar commented Jul 12, 2024

@ndkeen, can you confirm if this has been fixed and close if you agree?

@ndkeen ndkeen closed this as completed Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug mpas-ocean nvidia compiler nvidia compiler (formerly PGI) pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

No branches or pull requests

4 participants