Skip to content

Memory leak with ROCM-aware OpenMPI with UCX 1.17.0 #13170

Open
@StaticObserver

Description

@StaticObserver

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5
v5.0.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source code from official webpage https://www.open-mpi.org/

configured with

 ./configure --prefix=/work/home/packages/openmpi/5.0.5 --with-ucx=/work/home/packages/ucx/1.17.0 --with-rocm=/public/software/compiler/dtk-23.10 --with-devel-headers --enable-mpi-fortran=no --enable-mca-no-build=btl-uct --enable-mpi1-compatibility

Please describe the system on which you are running

  • Operating system/version: Rocky Linux 8.8 (Green Obsidian)
  • Computer hardware: AMD GPU architecture gfx906
  • Network type: not sure, infiniband?

Details of the problem

Memory leak whenever there's communications between gpu cards. Memory were used up very quickly. I wrote a minimal test program to reproduce the issue.

#include <mpi.h>
#include <Kokkos_Core.hpp>
#include <iostream>
#include <vector>
#include <string>

// To simplify, use the same size for sending and receiving
// Here, assume real_t = double, but could also be float, etc.
using real_t = double;

int main(int argc, char* argv[])
{
    // Initialize MPI
    MPI_Init(&argc, &argv);
    // Initialize Kokkos
    Kokkos::initialize(argc, argv);

    {
        int rank, size;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);

        // For simplicity, assume only rank 0 and rank 1 communicate with each other
        if (size < 2) {
            if (rank == 0) {
                std::cerr << "Please run with at least 2 MPI ranks.\n";
            }
            Kokkos::finalize();
            MPI_Finalize();
            return 0;
        }

        // If you want to control the number of test iterations and array size from the command line,
        // you can read the parameters here.
        // Default iteration count (ITER) and array size (N)
        int ITER = 1000;      // Number of communication loops
        int N    = 1024 * 64; // Array size (number of elements). Increase if you want to observe memory behavior

         if (argc > 1) {
            ITER = std::stoi(argv[1]);
        }
        if (argc > 2) {
            N = std::stoi(argv[2]);
        }

        // Print a hint message
        if (rank == 0) {
            std::cout << "Running test with ITER = " << ITER
                      << ", array size = " << N << "\n"
                      << "Monitor memory usage in another terminal, etc.\n"
                      << std::endl;
        }

        // To distinguish sending destinations and receiving sources:
        // rank 0 -> sends to rank 1, receives from rank 1
        // rank 1 -> sends to rank 0, receives from rank 0
        // Other ranks do not perform actual communication
        int sendRank = (rank == 0) ? 1 : 0;
        int recvRank = (rank == 0) ? 1 : 0;

        // Test loop
        for (int iter = 0; iter < ITER; ++iter) {
            // Allocate a new send buffer (sendBuf) on the GPU
            // Kokkos::View defaults to the Cuda space (if Kokkos_ENABLE_CUDA is enabled),
            // and for simplicity we do not consider the Layout here
            Kokkos::View<real_t*, Kokkos::DefaultExecutionSpace>
                sendBuf("sendBuf", N);

            // If we need to receive, allocate a receive buffer (recvBuf)
            Kokkos::View<real_t*, Kokkos::DefaultExecutionSpace>
                recvBuf("recvBuf", N);

            // First, do a simple initialization for sendBuf (parallel loop)
            Kokkos::parallel_for("init_sendBuf", N, KOKKOS_LAMBDA(const int i){
              sendBuf(i) = static_cast<real_t>(rank + i * 0.001);
            });
            // If synchronization is needed (optional, depends on MPI+Kokkos implementation)
            Kokkos::fence();

            // MPI communication: rank 0 and rank 1 send and receive data from each other
            // For simplicity, here we use MPI_Sendrecv
            if (rank == 0 || rank == 1) {
                MPI_Sendrecv(
                    sendBuf.data(), // Send data pointer
                    N,              // Number of elements to send
                    MPI_DOUBLE,     // Data type
                    sendRank,       // Destination process
                    1234,           // Send tag
                    recvBuf.data(), // Receive data pointer
                    N,
                    MPI_DOUBLE,
                    recvRank,       // Source process
                    1234,           // Receive tag
                    MPI_COMM_WORLD, // Communicator
                    MPI_STATUS_IGNORE
                );
            }

            // (Optional check) verify successful reception
            // Only rank 0 or rank 1 need to check
            if ((iter % 100 == 0) && (rank == 0 || rank == 1)) {
                // Copy recvBuf to host to check
                auto recvHost = Kokkos::create_mirror_view(recvBuf);
                Kokkos::deep_copy(recvHost, recvBuf);

                // Print some debugging information
                // In practice, if the iteration count is large, it's better not to print frequently
                if (iter % 200 == 0) {
                      std::cout << "[Rank " << rank << "] Iter " << iter << ", recvBuf(0) = " << recvHost(0) << "\n";
                }
            }

            // At the end of the loop, sendBuf and recvBuf are no longer used
            // They will be freed when the braces end (due to Kokkos::View's RAII)
            // However, whether the MPI/GPU driver immediately reclaims the IPC handle requires monitoring
        } // end for(ITER)

        if (rank == 0) {
            std::cout << "Test finished. Check if GPU memory usage grew abnormally.\n";
        }
    }

    Kokkos::finalize();
    MPI_Finalize();

    return 0;
}

                                     

I compiled the program with cmake, a CMakeLists.txt could be

cmake_minimum_required(VERSION 3.10)
project(TestIPCIssue LANGUAGES CXX)

find_package(MPI REQUIRED)

find_package(Kokkos REQUIRED)

add_executable(mpi-test mpi-test.cpp)


target_link_libraries(mpi-test Kokkos::kokkos)
target_link_libraries(mpi-test MPI::MPI_CXX)
target_include_directories(mpi-test PUBLIC ${MPI_CXX_INCLUDE_PATH})

One must have Kokkos installed.
If one use the same buffers (by moving the buffers out of the for loop) instead of creating a new buffer everytime before communication, the issue disappear.
I think this is related to issue #12971 and #12849

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions