-
Notifications
You must be signed in to change notification settings - Fork 12
Allow building with HIP devices #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
| params.ic_type = ICType::UrbanPop; | ||
| pp.get("urbanpop_filename", params.urbanpop_filename); | ||
| #ifdef AMREX_USE_CUDA | ||
| #if defined(AMREX_USE_CUDA) || defined(AMREX_USE_HIP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JaeseungYeom Can you try #if defined(AMREX_USE_GPU) instead of #if defined(AMREX_USE_CUDA) || defined(AMREX_USE_HIP) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @stevenhofmeyr set box size to 500 for nvidia gpu only, hence the use of AMREX_USE_CUDA. If we change this to AMREX_USE_GPU, this value will apply to all other GPUs including Intel and AMD ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannguyen153 I think that's how it should be. When using GPUs, whether it's AMD/Intel/NVIDIA, the box size should be larger to minimize MPI communications between boxes on the same GPU, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I think box size should be large enough for all GPUs activated by AMREX_USE_GPU. We can also tune the box size for specific GPUs and enumerate the initial values with AMREX_USE_CUDA, AMREX_USE_HIP and AMREX_USE_SYCL, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently testing with that value. I am getting oom error with the value set to 100. I will experiment with it, and let you know.
[yeom2@tuolumne1038:bin]$ srun -N 4 -n 16 --exclusive ./agent inputs.ca
Initializing AMReX (25.04-9-g30a9768150c4)...
MPI initialized with 16 MPI processes
MPI initialized with thread support level 0
Initializing HIP...
HIP initialized with 16 devices.
2130.442s: flux-shell[1]: ERROR: oom: Memory cgroup out of memory: killed 1 task on tuolumne1042.
2130.442s: flux-shell[1]: ERROR: oom: memory.peak = 240.32831G
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JaeseungYeom : I just checked the code again - this is a runtime parameter agent.max_box_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannguyen153 : this is specifically for the UrbanPop code, and defaults to 100 for CPUs and 500 for GPUs. For the census code, it defaults to 16. It's so much more for UrbanPop because the underlying grid is lat/lng, and so we have many grid points that have no communities, unlike the packed allocation for the census where the underlying grid does not relate to physical lat/lng.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JaeseungYeom : you're likely getting oom for smaller box sizes because you'll have too many boxes (most of which will be empty).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that I can avoid OOM error using 8 nodes instead of 4 nodes. It looks like the memory is limited on tuolumne as it is shared between GPU and CPU. So, what is the final suggestion? Just remove the HIP flag or separate it from CUDA? Do you want me to try UrbanPop data and play with agent.max_box_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could experiment with agent.max_box_size to find the best settings. Ideally the code in Utils.cpp will set usable defaults for every common situation.
The defaults should also be described in examples/inputs.defaults, e.g.:
# if ic_type is census
# agent.max_box_size = 16
# if ic_type is urbanpop and using GPUs
# agent.max_box_size = 500
# if ic_type is urbanpop and not using GPUs
# agent.max_box_size = 100
We should add extra info there for any new default cases. These parameters are also described in the docs.
…e, in case of c+117
This PR allows to build ExaEpi on HIP device enabled platforms.
For example, to build on tuolumne.llnl.gov, that has four AMD MI300A and 4th gen EPYC CPU per compute node, do as follows:
module load PrgEnv-gnu-amd/8.6.0 rocm/6.3.1hangfix cray-mpich/8.1.32mkdir build; cd buildrocminfo | awk '((NF==2) && ($1=="Name:") && ($2 ~ /^gfx/)) {print $2}' | uniqexport AMD_ARCH=gfx942cmake -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=${AMD_ARCH} -DCMAKE_INSTALL_PREFIX=`realpath ..`/install ..make -j 16