Welcome to AMD's HPC Training Examples Repo!
(Last revision of this README: April 2nd, 2025).
Here you will find a variety of examples to showcase the capabilities of AMD's GPU software stack. Please be aware that the repo is continuously updated to keep up with the most recent releases of the AMD software, and also to increase the number of examples and use cases that we strive to provide for our users.
Please refer to this table of contents to locate the exercises and examples you are interested in, sorted by topic.
- HIP
- HIP Functionality Checks
query_device
: checks thathipMemGetInfo
works.
- Fundamental Examples
basic_examples
: a collection of introductory exercises to get familiar with the HIP API and the HIP build process. Examples include an hipification of some CUDA code, device to host data transfer, error checking, and basic GPU kernel implementation. Begin here if you are just starting with HIP.README
.Stream_Overlap
: this example shows how to share the workload of a GPU offload computation using several overlapping HIP streams. Note that AMD GPUs natively support the creation of multiple stream queues on the same GPU. The result is an additional gain in terms of time of execution due to the additional parallelism provided by the overlapping streams.README
.dgemm
: a (d)GEMM application created as an exercise to showcase simple matrix-matrix multiplications on AMD GPUs.README
.hip_stream
: modification of the STREAM benchmark for HIP.README
.jacobi
: distributed Jacobi solver, using GPUs to perform the computation and MPI for halo exchanges.README
.matrix_addition
: example of a HIP kernel performing a matrix addition.saxpy
: example of a HIP kernel performing a saxpy operation.README
.stencil_examples
: examples stencils operation with a HIP kernel, including the use of timers and asyncronous copies.vectorAdd
: example of a HIP kernel to perform a vector add. Note that theCMakeLists.txt
in this directory represents a good example of a portable CMakeLists to build on either AMD or Nvidia GPUs with HIP.README
.vector_addition_examples
: another example of a HIP kernel to perform vector addition, including different versions such as one using shared memory, one with timers, and a CUDA one to tryHIPIFY
andhipifly
tools on. The examples in this directory are not part of the HIP test suite.reduction
: several examples of reduction operations using HIP kernels.README
.
- CUDA to HIP Porting
HIP-Optimizations
: a daxpy HIP kernel is used to show how an initial version can be optimized to improve performance.README
.HIPFort
: two examples that show how to use the hipfort interface to call hipblas functions from Fortran.hipgemm
: call the hipBLAS functionhipblasZgemm
from an OpenMP application code written in Fortran, leveraging the hipfort interface.README
.matmult
: this example compares the results of a matrix multiplication done withhipblasDgemm
using hipBLAS and hipfort, with one done using a HIP kernel. For the HIP kernel, a proper interface has to be created, which is instead provided by hipfort for the case of hipBLAS. With this example, userts can better understand how hipfort works, being involved themselves in the creation of such an interface.README
.
HIPStdPar
: several examples showing C++ Std Parallelism with HIP on AMD GPUs.README
.HIP-OpenMP
: several examples on HIP/OpenMP interoperability in Fortran and C++.- C++
- Call HIP kernels from OpenMP app and vice-versa: this directory contains several examples on how to use OpenMP and HIP in the same application. A detailed explanation of the
saxpy
anddaxpy
examples in this directory is contained in theREADME
. interop
: this example uses the OpenMPinterop
contstruct to synchronize a HIP kernel with an OpenMP kernel by placing them on the same HIP stream. The construct seems to be not working correctly at the moment, and a call tohipStreamSynchronize
is made, detailes in theREADME
.
- Call HIP kernels from OpenMP app and vice-versa: this directory contains several examples on how to use OpenMP and HIP in the same application. A detailed explanation of the
- Fortran
Calling_DGEMM
: this example calls a rocblass dgemm function from an OpenMP application code written in Fortran. It has two versions, one with explicit memory management done with OpenMP, in theexplicit
directory, and one that uses unified shared memory, in theusm
directory.README
.
- C++
- HIP Functionality Checks
- MPI-examples
- Benchmarks: GPU aware benchmarks (
collective.cpp
andpt2pt.cpp
) to assess the performance of the communication libraries.README
Video of Presentation
. - GhostExchange: slimmed down example of an actual physics application where the solution is initialized on a square 2D domain discretized with a Cartesian grid, and then advanced in parallel using MPI communications with unified shared memory, so host pointers are passed to the MPI calls, even if a GPU aware installation of MPI is used.
GhostExchange_ArrayAssign
: this version uses OpenMP to offload to the GPU. DetailedREADME
files are provided here for the different versions of theGhostExchange_ArrayAssign
code, that showcase how to useOmnitrace
to profile this application. Note that while the timeline tracing tool is nowrocprof-sys
,Omnitrace
stil lives in its dedicated github repository.GhostExchange_ArrayAssign_HIP
: this version uses HIP to offload to the GPU. In this case as well, DetailedREADME
files are provided here for the different versions of theGhostExchange_ArrayAssign_HIP
code, that illustrate how to useOmnitrace
to profile this application.GhostExchange3D_ArrayAssign
: a single version of the Ghost Exchange example in 3D, without offloading to GPU.
- Benchmarks: GPU aware benchmarks (
- ManagedMemory: programming model exercises, topics covered are APU programming model, OpenMP, performance protability frameworks (Kokkos and RAJA) and discrete GPU programming model. Some HIP examples are also available.
README
. - MLExamples: this is a rapidly growing directory including a variety of machine learning (ML) and artificial intellingence (AI) related examples.
- Miscelaneous Examples: a variation of PyTorch's MNIST example code, and a smoke test for MPI4Py using CuPy. Examples with Tensorflow, Horovod and Huggingface are also included
README
. AI_Surrogates
: this directory contains a variety of Jupyter notebooks that have been developed to show some applications of AI for science using surrogate models. There are no READMEs for these examples at the moment and we suggest users work directly with the Jupyter notebooks for details.PyTorch_Profiling
: a colleciton of examples to show how to profile PyTorch using AMD tools.README
.RAG_LangChainDemo
: a RAG Chatobot Demo application.README
.
- Miscelaneous Examples: a variation of PyTorch's MNIST example code, and a smoke test for MPI4Py using CuPy. Examples with Tensorflow, Horovod and Huggingface are also included
- Occupancy: example on modifying thread occupancy, using several variants of a matrix vector multiplication leveraging shared memory and launch bounds.
- OmniperfExamples: several examples showing how to leverage Omniperf (now renamed rocprof-compute) to perform kernel level optimization using HIP. NOTE: detailed READMEs are provided on each subdirectory, and whereas the output refers to Omniperf, the commands can be executed in the same way, just by replacing
omniperf
withrocprof-compute
.README
.Video of Presentation
. - Omniperf-OpenMP: example showing how to leverage Omniperf (now rocprof-compute) to perform kernel level optimization using Fortran and OpenMP.
README
. - Omnitrace
- Omnitrace on Jacobi: Omnitrace used on the Jacobi solver example.
README
. - Omnitrace by Example: Omnitrace used on several versions of the Ghost Exchange example:
- OpenMP Version:
READMEs
available for each of the different versions of the example code.Video of Presentation
. - HIP Version:
READMEs
available for each of the different versions of the example code.
- OpenMP Version:
- Omnitrace on Jacobi: Omnitrace used on the Jacobi solver example.
- Pragma_Examples: a large variety of examples for OpenMP (in Fortran, C, and C++) and a few for OpenACC.
- OpenMP: there are really many OpenMP examples that span various languages (C,C++ and Fortran) and various levels of complexity. There is an introductory
README
for the OpenMP material but users are strongly encouraged to browes this directory and its sub-directory in great detail to make sure they go over as many examples as possible.- C: this directory contains many examples that go from simple constructs to complex constructs, device routines, reductions, build examples and also a Jacobi solver example. This directory contains a
README
but users are encouraged to browse each sub-directory independently and consult the dedicated READMEs anytime they are available. - C++: more complex exercises that explore optimizations with memory alignment, targeted use of the memory management directives and clauses, and setting ad-hoc parameters such as
num_threads()
andthread_limit()
. There is also an example calledcpp_classes
that applies OpenMP offloading to a code using C++ classes. There is no specific README at the moment for this directory and users are encouraged to browse the sub-directories and associated READMEs independently. - Fortran: as in the C sub-directory, there is a wide variety of examples here that span a similar set of cases such as the C counterpart. For instance the Jacobi solver example is also available here in Fortan. A top level
README
is available but once again users are strongly encouraged to browse the sub-directories and associated READMEs independently. Intro
: a collection of mostly C++ examples with some Fortran as well. There is no associated README at the moment so users will need to inspect the code directly for more details.USM
: some examples specific to unified shared memory and OpenMP.README
.OpenMP_CPU
: some examples of using OpenMP on the CPU.
- C: this directory contains many examples that go from simple constructs to complex constructs, device routines, reductions, build examples and also a Jacobi solver example. This directory contains a
- OpenACC: a few examples of offloading to GPU using OpenACC.
- OpenMP: there are really many OpenMP examples that span various languages (C,C++ and Fortran) and various levels of complexity. There is an introductory
- Speedup_Examples: examples to show the speedup obtained going from a CPU to a GPU implementation.
README
. - atomics_openmp: examples on atomic operations using OpenMP.
- Kokkos: runs the Stream Triad example with a Kokkos implementation.
README
. - Rocgdb: debugs the
HPCTrainingExamples/HIP/saxpy
example with Rocgdb.README
.Video of Presentation
. - Rocprof: uses Rocprof to profile
HPCTrainingExamples/HIPIFY/mini-nbody/hip/
.README
. - Rocprofv3: uses Rocprofv3 to profile a Jacobi solver example.
HIP
: example showing how to userocprofv3
to profile the Jacobi solver example written in HIP and available atHPCTrainingExamples/HIP/jacobi
.OpenMP
: this directory contains various examples on how to use Rocprofv3 to profile OpenMP applications:- Jacobi: example showing how to use
rocprofv3
to profile the Jacobi solver example written with OpenMP and available atHPCTrainingExamples/Pragma_Examples/OpenMP/Fortran/7_jacobi/1_jacobi_usm
.README
. Allocations_and_MemoryPool_MI300A
: example showing the importance of reducting dynamic memory allocations on MI300A with unified memory.README
.
- Jacobi: example showing how to use
- rocm-blog-codes: this directory contains accompany source code examples for select HPC ROCm blogs found at https://rocm.blogs.amd.com.
README
. Libraries
: examples showcasing how to integrate some of the HIP/ROCm libraries in your application code.matrix_exponential
: an example on how to use rocBLAS to compute the approximate solution of a linear system of ordinary differential equations.README
.ConjugateGradient
: example showing how to use rocBLAS and rocSPARSE to solve a linear system with sparse symmetric positive definite matrix using a conjugate gradient algorithm.README
.
RocSolverRf
: this example shows how to solve a sequence of sparse linear systems with refactorizaion using RocSolverRf.README
.rocprofiler-systems
: an example of how to use therocprof-sys
timeline trace profile on the Jacobi solver example inHPCTrainingExamples/HIP/jacobi
.Profile-by-example
: a walk-through of how to profile the Jacobi solver example inHPCTrainingExamples/HIP/jacobi
) on Oak Ridge National Lab's machine Frontier, usingrocprofv3
,rocprof-sys
, androcprof-compute
, effectively providing an example of an all around profiling effort using AMD tools.README
.Affinity
: an example to show how to set proper affinity to CPU cores and GPUs.README
.- login_info
- Doc: directory with LaTeX and PDF documents that contain some of the most relevant README files properly formatted for ease of reading. The PDF document is obtained building the LaTeX document. Note: the document may be out of date compared to the READMEs in the repo which are most current source of information for these exercises.
tests
: this directory contains a huge number of test scripts aimed at testing the installation of the software provided by the scripts in the companing repoHPCTrainingDock
.
Most of the exercises in this repo can be run as a test suite by doing:
git clone https://github.com/amd/HPCTrainingExamples && \
cd HPCTrainingExamples && \
cd tests && \
./runTests.sh
You can also run a subset of the whole test suite by specifying the subset you are interested in as an input to the runTests.sh
script. For instance: ./runTests.sh --pytorch
. To see a full list of the possible subsets that can be run: ./runTests.sh --help
.
NOTE: tests can also be run manually from their respective directories, provided the necessary modules have been loaded and they have been compiled appropriately.
We recommend users also check out the rocm-examples
Github repo, that has a lot of content on HIP and ROCm libraries.
We welcome your feedback and contributions, feel free to use this repo to bring up any issues or submit pull requests.
The software made available here is released under the MIT license, more details can be found in LICENSE.md
.