Skip to content

Conversation

@JaeseungYeom
Copy link

@JaeseungYeom JaeseungYeom commented Apr 8, 2025

This PR allows to build ExaEpi on HIP device enabled platforms.

For example, to build on tuolumne.llnl.gov, that has four AMD MI300A and 4th gen EPYC CPU per compute node, do as follows:

  • Load necessary modules including rocm, mpi and c++ compiler: module load PrgEnv-gnu-amd/8.6.0 rocm/6.3.1hangfix cray-mpich/8.1.32
  • Change directory in to build directory: mkdir build; cd build
  • Find out the device id: rocminfo | awk '((NF==2) && ($1=="Name:") && ($2 ~ /^gfx/)) {print $2}' | uniq
  • Set the environment variable for the HIP device id: export AMD_ARCH=gfx942
  • Run cmake: cmake -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=${AMD_ARCH} -DCMAKE_INSTALL_PREFIX=`realpath ..`/install ..
  • Compile: make -j 16

@debog debog requested review from atmyers and tannguyen153 April 8, 2025 11:22
params.ic_type = ICType::UrbanPop;
pp.get("urbanpop_filename", params.urbanpop_filename);
#ifdef AMREX_USE_CUDA
#if defined(AMREX_USE_CUDA) || defined(AMREX_USE_HIP)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JaeseungYeom Can you try #if defined(AMREX_USE_GPU) instead of #if defined(AMREX_USE_CUDA) || defined(AMREX_USE_HIP) here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @stevenhofmeyr set box size to 500 for nvidia gpu only, hence the use of AMREX_USE_CUDA. If we change this to AMREX_USE_GPU, this value will apply to all other GPUs including Intel and AMD ones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannguyen153 I think that's how it should be. When using GPUs, whether it's AMD/Intel/NVIDIA, the box size should be larger to minimize MPI communications between boxes on the same GPU, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I think box size should be large enough for all GPUs activated by AMREX_USE_GPU. We can also tune the box size for specific GPUs and enumerate the initial values with AMREX_USE_CUDA, AMREX_USE_HIP and AMREX_USE_SYCL, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently testing with that value. I am getting oom error with the value set to 100. I will experiment with it, and let you know.

[yeom2@tuolumne1038:bin]$ srun -N 4 -n 16 --exclusive ./agent inputs.ca 
Initializing AMReX (25.04-9-g30a9768150c4)...
MPI initialized with 16 MPI processes
MPI initialized with thread support level 0
Initializing HIP...
HIP initialized with 16 devices.
2130.442s: flux-shell[1]: ERROR: oom: Memory cgroup out of memory: killed 1 task on tuolumne1042.
2130.442s: flux-shell[1]: ERROR: oom: memory.peak = 240.32831G

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JaeseungYeom : I just checked the code again - this is a runtime parameter agent.max_box_size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannguyen153 : this is specifically for the UrbanPop code, and defaults to 100 for CPUs and 500 for GPUs. For the census code, it defaults to 16. It's so much more for UrbanPop because the underlying grid is lat/lng, and so we have many grid points that have no communities, unlike the packed allocation for the census where the underlying grid does not relate to physical lat/lng.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JaeseungYeom : you're likely getting oom for smaller box sizes because you'll have too many boxes (most of which will be empty).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that I can avoid OOM error using 8 nodes instead of 4 nodes. It looks like the memory is limited on tuolumne as it is shared between GPU and CPU. So, what is the final suggestion? Just remove the HIP flag or separate it from CUDA? Do you want me to try UrbanPop data and play with agent.max_box_size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could experiment with agent.max_box_size to find the best settings. Ideally the code in Utils.cpp will set usable defaults for every common situation.
The defaults should also be described in examples/inputs.defaults, e.g.:

# if ic_type is census
# agent.max_box_size = 16
# if ic_type is urbanpop and using GPUs
# agent.max_box_size = 500
# if ic_type is urbanpop and not using GPUs
# agent.max_box_size = 100

We should add extra info there for any new default cases. These parameters are also described in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants