-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Hi! I use the dist.all_to_all_single
with torch.distributed and torch.ccl. I found that when the size of send buffer and recv buffer is large (several Gigabytes), the problem of segment fault
will occur.
Here is the test code:
import torch
import torch.distributed as dist
import numpy as np
import os
import torch_ccl
def init_dist_group():
world_size = int(os.environ.get("PMI_SIZE", -1))
rank = int(os.environ.get("PMI_RANK", -1))
dist_url = "env://"
dist.init_process_group(backend="ccl", init_method="env://",
world_size=world_size, rank=rank)
assert torch.distributed.is_initialized()
print(f"dist_info RANK: {dist.get_rank()}, SIZE: {dist.get_world_size()}")
# number of process in this MPI group
world_size = dist.get_world_size()
# mpi rank in this MPI group
rank = dist.get_rank()
return (rank, world_size)
# main function
if __name__ == "__main__":
rank, world_size = init_dist_group()
# allocate memory for send_buf and recv_buf
data_size = 250000
send_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
recv_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
send_buf_shape = send_buf.shape
recv_buf_shape = recv_buf.shape
print("send_buf.shape = {}, recv_buf.shape = {}".format(send_buf.shape, recv_buf.shape), flush=True)
send_splits = [data_size for i in range(world_size)]
recv_splits = [data_size for i in range(world_size)]
send_splits[rank] = 0
recv_splits[rank] = 0
print("rank = {}, send_splits = {}, recv_splits = {}".format(rank, send_splits, recv_splits), flush=True)
assert(sum(send_splits) == send_buf_shape[0])
assert(sum(recv_splits) == recv_buf_shape[0])
assert(len(send_splits) == world_size)
assert(len(recv_splits) == world_size)
# all_to_all
dist.all_to_all_single(recv_buf, send_buf, recv_splits, send_splits)
print("finish!")
When the data_size = 25000
, it works well. But when I set data_size = 500000
, there is an error output:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 3564580 RUNNING AT g0118
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 11 PID 3564581 RUNNING AT g0118
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
The version of MPI: intel-mpi/2021.8
The version of Pytorch: 1.10.0 (CPU version)
The version of torch_ccl: 1.10.0
The number of MPI processes: 16, each MPI process is mapped to 1 socket.
Each compute node owns 2 socket CPUs, and the total memory size in one compute node is 387 Gigabytes. So this benchmark is ran on 8 compute nodes, 16 socket CPUs.
The command I use to launch MPI is: mpiexec.hydra -n 16 -ppn 2 python test_alltoall.py