Skip to content

amd/HPCTrainingExamples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMD HPC Training Examples Repo

Welcome to AMD's HPC Training Examples Repo!

(Last revision of this README: April 2nd, 2025).

Here you will find a variety of examples to showcase the capabilities of AMD's GPU software stack. Please be aware that the repo is continuously updated to keep up with the most recent releases of the AMD software, and also to increase the number of examples and use cases that we strive to provide for our users.

Repository Structure

Please refer to this table of contents to locate the exercises and examples you are interested in, sorted by topic.

  1. HIP
    1. HIP Functionality Checks
      1. query_device: checks that hipMemGetInfo works.
    2. Fundamental Examples
      1. basic_examples: a collection of introductory exercises to get familiar with the HIP API and the HIP build process. Examples include an hipification of some CUDA code, device to host data transfer, error checking, and basic GPU kernel implementation. Begin here if you are just starting with HIP. README.
      2. Stream_Overlap: this example shows how to share the workload of a GPU offload computation using several overlapping HIP streams. Note that AMD GPUs natively support the creation of multiple stream queues on the same GPU. The result is an additional gain in terms of time of execution due to the additional parallelism provided by the overlapping streams. README.
      3. dgemm: a (d)GEMM application created as an exercise to showcase simple matrix-matrix multiplications on AMD GPUs. README.
      4. hip_stream: modification of the STREAM benchmark for HIP. README.
      5. jacobi: distributed Jacobi solver, using GPUs to perform the computation and MPI for halo exchanges. README.
      6. matrix_addition: example of a HIP kernel performing a matrix addition.
      7. saxpy: example of a HIP kernel performing a saxpy operation. README.
      8. stencil_examples: examples stencils operation with a HIP kernel, including the use of timers and asyncronous copies.
      9. vectorAdd: example of a HIP kernel to perform a vector add. Note that the CMakeLists.txt in this directory represents a good example of a portable CMakeLists to build on either AMD or Nvidia GPUs with HIP. README.
      10. vector_addition_examples: another example of a HIP kernel to perform vector addition, including different versions such as one using shared memory, one with timers, and a CUDA one to try HIPIFY and hipifly tools on. The examples in this directory are not part of the HIP test suite.
      11. reduction: several examples of reduction operations using HIP kernels. README.
    3. CUDA to HIP Porting
      1. HIPIFY: example to show how to port CUDA code to HIP with HIPIFY tools. README.
      2. hipifly: example to show how to port CUDA code to HIP with hipifly tools. README.
    4. HIP-Optimizations: a daxpy HIP kernel is used to show how an initial version can be optimized to improve performance. README.
    5. HIPFort: two examples that show how to use the hipfort interface to call hipblas functions from Fortran.
      1. hipgemm: call the hipBLAS function hipblasZgemm from an OpenMP application code written in Fortran, leveraging the hipfort interface. README.
      2. matmult: this example compares the results of a matrix multiplication done with hipblasDgemm using hipBLAS and hipfort, with one done using a HIP kernel. For the HIP kernel, a proper interface has to be created, which is instead provided by hipfort for the case of hipBLAS. With this example, userts can better understand how hipfort works, being involved themselves in the creation of such an interface. README.
    6. HIPStdPar: several examples showing C++ Std Parallelism with HIP on AMD GPUs. README.
    7. HIP-OpenMP: several examples on HIP/OpenMP interoperability in Fortran and C++.
      1. C++
        1. Call HIP kernels from OpenMP app and vice-versa: this directory contains several examples on how to use OpenMP and HIP in the same application. A detailed explanation of the saxpy and daxpy examples in this directory is contained in the README.
        2. interop: this example uses the OpenMP interop contstruct to synchronize a HIP kernel with an OpenMP kernel by placing them on the same HIP stream. The construct seems to be not working correctly at the moment, and a call to hipStreamSynchronize is made, detailes in the README.
      2. Fortran
        1. Calling_DGEMM: this example calls a rocblass dgemm function from an OpenMP application code written in Fortran. It has two versions, one with explicit memory management done with OpenMP, in the explicit directory, and one that uses unified shared memory, in the usm directory. README.
  2. MPI-examples
    1. Benchmarks: GPU aware benchmarks (collective.cpp and pt2pt.cpp) to assess the performance of the communication libraries. README Video of Presentation.
    2. GhostExchange: slimmed down example of an actual physics application where the solution is initialized on a square 2D domain discretized with a Cartesian grid, and then advanced in parallel using MPI communications with unified shared memory, so host pointers are passed to the MPI calls, even if a GPU aware installation of MPI is used.
      1. GhostExchange_ArrayAssign: this version uses OpenMP to offload to the GPU. Detailed README files are provided here for the different versions of the GhostExchange_ArrayAssign code, that showcase how to use Omnitrace to profile this application. Note that while the timeline tracing tool is now rocprof-sys, Omnitrace stil lives in its dedicated github repository.
      2. GhostExchange_ArrayAssign_HIP: this version uses HIP to offload to the GPU. In this case as well, Detailed README files are provided here for the different versions of the GhostExchange_ArrayAssign_HIP code, that illustrate how to use Omnitrace to profile this application.
      3. GhostExchange3D_ArrayAssign: a single version of the Ghost Exchange example in 3D, without offloading to GPU.
  3. ManagedMemory: programming model exercises, topics covered are APU programming model, OpenMP, performance protability frameworks (Kokkos and RAJA) and discrete GPU programming model. Some HIP examples are also available. README.
  4. MLExamples: this is a rapidly growing directory including a variety of machine learning (ML) and artificial intellingence (AI) related examples.
    1. Miscelaneous Examples: a variation of PyTorch's MNIST example code, and a smoke test for MPI4Py using CuPy. Examples with Tensorflow, Horovod and Huggingface are also included README.
    2. AI_Surrogates: this directory contains a variety of Jupyter notebooks that have been developed to show some applications of AI for science using surrogate models. There are no READMEs for these examples at the moment and we suggest users work directly with the Jupyter notebooks for details.
    3. PyTorch_Profiling: a colleciton of examples to show how to profile PyTorch using AMD tools. README.
    4. RAG_LangChainDemo: a RAG Chatobot Demo application. README.
  5. Occupancy: example on modifying thread occupancy, using several variants of a matrix vector multiplication leveraging shared memory and launch bounds.
  6. OmniperfExamples: several examples showing how to leverage Omniperf (now renamed rocprof-compute) to perform kernel level optimization using HIP. NOTE: detailed READMEs are provided on each subdirectory, and whereas the output refers to Omniperf, the commands can be executed in the same way, just by replacing omniperf with rocprof-compute. README.Video of Presentation.
  7. Omniperf-OpenMP: example showing how to leverage Omniperf (now rocprof-compute) to perform kernel level optimization using Fortran and OpenMP. README.
  8. Omnitrace
    1. Omnitrace on Jacobi: Omnitrace used on the Jacobi solver example. README.
    2. Omnitrace by Example: Omnitrace used on several versions of the Ghost Exchange example:
      1. OpenMP Version: READMEs available for each of the different versions of the example code. Video of Presentation.
      2. HIP Version: READMEs available for each of the different versions of the example code.
  9. Pragma_Examples: a large variety of examples for OpenMP (in Fortran, C, and C++) and a few for OpenACC.
    1. OpenMP: there are really many OpenMP examples that span various languages (C,C++ and Fortran) and various levels of complexity. There is an introductory README for the OpenMP material but users are strongly encouraged to browes this directory and its sub-directory in great detail to make sure they go over as many examples as possible.
      1. C: this directory contains many examples that go from simple constructs to complex constructs, device routines, reductions, build examples and also a Jacobi solver example. This directory contains a README but users are encouraged to browse each sub-directory independently and consult the dedicated READMEs anytime they are available.
      2. C++: more complex exercises that explore optimizations with memory alignment, targeted use of the memory management directives and clauses, and setting ad-hoc parameters such as num_threads() and thread_limit(). There is also an example called cpp_classes that applies OpenMP offloading to a code using C++ classes. There is no specific README at the moment for this directory and users are encouraged to browse the sub-directories and associated READMEs independently.
      3. Fortran: as in the C sub-directory, there is a wide variety of examples here that span a similar set of cases such as the C counterpart. For instance the Jacobi solver example is also available here in Fortan. A top level README is available but once again users are strongly encouraged to browse the sub-directories and associated READMEs independently.
      4. Intro: a collection of mostly C++ examples with some Fortran as well. There is no associated README at the moment so users will need to inspect the code directly for more details.
      5. USM: some examples specific to unified shared memory and OpenMP. README.
      6. OpenMP_CPU: some examples of using OpenMP on the CPU.
    2. OpenACC: a few examples of offloading to GPU using OpenACC.
      1. C: examples of reductions, saxpy and vector addition in C using OpenACC.
      2. Fortran: examples of reductions and vector add in Fortran using OpenACC.
  10. Speedup_Examples: examples to show the speedup obtained going from a CPU to a GPU implementation. README.
  11. atomics_openmp: examples on atomic operations using OpenMP.
  12. Kokkos: runs the Stream Triad example with a Kokkos implementation. README.
  13. Rocgdb: debugs the HPCTrainingExamples/HIP/saxpy example with Rocgdb.README. Video of Presentation.
  14. Rocprof: uses Rocprof to profile HPCTrainingExamples/HIPIFY/mini-nbody/hip/. README.
  15. Rocprofv3: uses Rocprofv3 to profile a Jacobi solver example.
    1. HIP: example showing how to use rocprofv3 to profile the Jacobi solver example written in HIP and available at HPCTrainingExamples/HIP/jacobi.
    2. OpenMP: this directory contains various examples on how to use Rocprofv3 to profile OpenMP applications:
      1. Jacobi: example showing how to use rocprofv3 to profile the Jacobi solver example written with OpenMP and available at HPCTrainingExamples/Pragma_Examples/OpenMP/Fortran/7_jacobi/1_jacobi_usm. README.
      2. Allocations_and_MemoryPool_MI300A: example showing the importance of reducting dynamic memory allocations on MI300A with unified memory. README.
  16. rocm-blog-codes: this directory contains accompany source code examples for select HPC ROCm blogs found at https://rocm.blogs.amd.com. README.
  17. Libraries: examples showcasing how to integrate some of the HIP/ROCm libraries in your application code.
    1. matrix_exponential: an example on how to use rocBLAS to compute the approximate solution of a linear system of ordinary differential equations. README.
    2. ConjugateGradient: example showing how to use rocBLAS and rocSPARSE to solve a linear system with sparse symmetric positive definite matrix using a conjugate gradient algorithm. README.
  18. RocSolverRf: this example shows how to solve a sequence of sparse linear systems with refactorizaion using RocSolverRf. README.
  19. rocprofiler-systems: an example of how to use the rocprof-sys timeline trace profile on the Jacobi solver example in HPCTrainingExamples/HIP/jacobi.
  20. Profile-by-example: a walk-through of how to profile the Jacobi solver example in HPCTrainingExamples/HIP/jacobi) on Oak Ridge National Lab's machine Frontier, using rocprofv3, rocprof-sys, and rocprof-compute, effectively providing an example of an all around profiling effort using AMD tools. README.
  21. Affinity: an example to show how to set proper affinity to CPU cores and GPUs. README.
  22. login_info
    1. AAC: instructions on how to log in to the AMD Accelerator Cloud (AAC) resource. README.
  23. Doc: directory with LaTeX and PDF documents that contain some of the most relevant README files properly formatted for ease of reading. The PDF document is obtained building the LaTeX document. Note: the document may be out of date compared to the READMEs in the repo which are most current source of information for these exercises.
  24. tests: this directory contains a huge number of test scripts aimed at testing the installation of the software provided by the scripts in the companing repo HPCTrainingDock.

Run the Tests

Most of the exercises in this repo can be run as a test suite by doing:

git clone https://github.com/amd/HPCTrainingExamples && \
cd HPCTrainingExamples && \
cd tests && \
./runTests.sh

You can also run a subset of the whole test suite by specifying the subset you are interested in as an input to the runTests.sh script. For instance: ./runTests.sh --pytorch. To see a full list of the possible subsets that can be run: ./runTests.sh --help.

NOTE: tests can also be run manually from their respective directories, provided the necessary modules have been loaded and they have been compiled appropriately.

Additional Resources

We recommend users also check out the rocm-examples Github repo, that has a lot of content on HIP and ROCm libraries.

Feedback

We welcome your feedback and contributions, feel free to use this repo to bring up any issues or submit pull requests. The software made available here is released under the MIT license, more details can be found in LICENSE.md.