Open
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.5
v5.0.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
source code from official webpage https://www.open-mpi.org/
configured with
./configure --prefix=/work/home/packages/openmpi/5.0.5 --with-ucx=/work/home/packages/ucx/1.17.0 --with-rocm=/public/software/compiler/dtk-23.10 --with-devel-headers --enable-mpi-fortran=no --enable-mca-no-build=btl-uct --enable-mpi1-compatibility
Please describe the system on which you are running
- Operating system/version: Rocky Linux 8.8 (Green Obsidian)
- Computer hardware: AMD GPU architecture gfx906
- Network type: not sure, infiniband?
Details of the problem
Memory leak whenever there's communications between gpu cards. Memory were used up very quickly. I wrote a minimal test program to reproduce the issue.
#include <mpi.h>
#include <Kokkos_Core.hpp>
#include <iostream>
#include <vector>
#include <string>
// To simplify, use the same size for sending and receiving
// Here, assume real_t = double, but could also be float, etc.
using real_t = double;
int main(int argc, char* argv[])
{
// Initialize MPI
MPI_Init(&argc, &argv);
// Initialize Kokkos
Kokkos::initialize(argc, argv);
{
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// For simplicity, assume only rank 0 and rank 1 communicate with each other
if (size < 2) {
if (rank == 0) {
std::cerr << "Please run with at least 2 MPI ranks.\n";
}
Kokkos::finalize();
MPI_Finalize();
return 0;
}
// If you want to control the number of test iterations and array size from the command line,
// you can read the parameters here.
// Default iteration count (ITER) and array size (N)
int ITER = 1000; // Number of communication loops
int N = 1024 * 64; // Array size (number of elements). Increase if you want to observe memory behavior
if (argc > 1) {
ITER = std::stoi(argv[1]);
}
if (argc > 2) {
N = std::stoi(argv[2]);
}
// Print a hint message
if (rank == 0) {
std::cout << "Running test with ITER = " << ITER
<< ", array size = " << N << "\n"
<< "Monitor memory usage in another terminal, etc.\n"
<< std::endl;
}
// To distinguish sending destinations and receiving sources:
// rank 0 -> sends to rank 1, receives from rank 1
// rank 1 -> sends to rank 0, receives from rank 0
// Other ranks do not perform actual communication
int sendRank = (rank == 0) ? 1 : 0;
int recvRank = (rank == 0) ? 1 : 0;
// Test loop
for (int iter = 0; iter < ITER; ++iter) {
// Allocate a new send buffer (sendBuf) on the GPU
// Kokkos::View defaults to the Cuda space (if Kokkos_ENABLE_CUDA is enabled),
// and for simplicity we do not consider the Layout here
Kokkos::View<real_t*, Kokkos::DefaultExecutionSpace>
sendBuf("sendBuf", N);
// If we need to receive, allocate a receive buffer (recvBuf)
Kokkos::View<real_t*, Kokkos::DefaultExecutionSpace>
recvBuf("recvBuf", N);
// First, do a simple initialization for sendBuf (parallel loop)
Kokkos::parallel_for("init_sendBuf", N, KOKKOS_LAMBDA(const int i){
sendBuf(i) = static_cast<real_t>(rank + i * 0.001);
});
// If synchronization is needed (optional, depends on MPI+Kokkos implementation)
Kokkos::fence();
// MPI communication: rank 0 and rank 1 send and receive data from each other
// For simplicity, here we use MPI_Sendrecv
if (rank == 0 || rank == 1) {
MPI_Sendrecv(
sendBuf.data(), // Send data pointer
N, // Number of elements to send
MPI_DOUBLE, // Data type
sendRank, // Destination process
1234, // Send tag
recvBuf.data(), // Receive data pointer
N,
MPI_DOUBLE,
recvRank, // Source process
1234, // Receive tag
MPI_COMM_WORLD, // Communicator
MPI_STATUS_IGNORE
);
}
// (Optional check) verify successful reception
// Only rank 0 or rank 1 need to check
if ((iter % 100 == 0) && (rank == 0 || rank == 1)) {
// Copy recvBuf to host to check
auto recvHost = Kokkos::create_mirror_view(recvBuf);
Kokkos::deep_copy(recvHost, recvBuf);
// Print some debugging information
// In practice, if the iteration count is large, it's better not to print frequently
if (iter % 200 == 0) {
std::cout << "[Rank " << rank << "] Iter " << iter << ", recvBuf(0) = " << recvHost(0) << "\n";
}
}
// At the end of the loop, sendBuf and recvBuf are no longer used
// They will be freed when the braces end (due to Kokkos::View's RAII)
// However, whether the MPI/GPU driver immediately reclaims the IPC handle requires monitoring
} // end for(ITER)
if (rank == 0) {
std::cout << "Test finished. Check if GPU memory usage grew abnormally.\n";
}
}
Kokkos::finalize();
MPI_Finalize();
return 0;
}
I compiled the program with cmake, a CMakeLists.txt could be
cmake_minimum_required(VERSION 3.10)
project(TestIPCIssue LANGUAGES CXX)
find_package(MPI REQUIRED)
find_package(Kokkos REQUIRED)
add_executable(mpi-test mpi-test.cpp)
target_link_libraries(mpi-test Kokkos::kokkos)
target_link_libraries(mpi-test MPI::MPI_CXX)
target_include_directories(mpi-test PUBLIC ${MPI_CXX_INCLUDE_PATH})
One must have Kokkos installed.
If one use the same buffers (by moving the buffers out of the for loop) instead of creating a new buffer everytime before communication, the issue disappear.
I think this is related to issue #12971 and #12849