Skip to content

Segement fault when the size of send buffer and recv buffer is large #49

@zhuangbility111

Description

@zhuangbility111

Hi! I use the dist.all_to_all_single with torch.distributed and torch.ccl. I found that when the size of send buffer and recv buffer is large (several Gigabytes), the problem of segment fault will occur.
Here is the test code:

import torch 
import torch.distributed as dist
import numpy as np
import os
import torch_ccl

def init_dist_group():
    world_size = int(os.environ.get("PMI_SIZE", -1))
    rank = int(os.environ.get("PMI_RANK", -1))
    dist_url = "env://"
    dist.init_process_group(backend="ccl", init_method="env://", 
                            world_size=world_size, rank=rank)
    assert torch.distributed.is_initialized()
    print(f"dist_info RANK: {dist.get_rank()}, SIZE: {dist.get_world_size()}")
    # number of process in this MPI group
    world_size = dist.get_world_size() 
    # mpi rank in this MPI group
    rank = dist.get_rank()
    return (rank, world_size)

# main function
if __name__ == "__main__":
    rank, world_size = init_dist_group()

    # allocate memory for send_buf and recv_buf
    data_size = 250000
    send_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
    recv_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
    send_buf_shape = send_buf.shape
    recv_buf_shape = recv_buf.shape

    print("send_buf.shape = {}, recv_buf.shape = {}".format(send_buf.shape, recv_buf.shape), flush=True)

    send_splits = [data_size for i in range(world_size)]
    recv_splits = [data_size for i in range(world_size)]
    send_splits[rank] = 0
    recv_splits[rank] = 0

    print("rank = {}, send_splits = {}, recv_splits = {}".format(rank, send_splits, recv_splits), flush=True)

    assert(sum(send_splits) == send_buf_shape[0])
    assert(sum(recv_splits) == recv_buf_shape[0])
    assert(len(send_splits) == world_size)
    assert(len(recv_splits) == world_size)

    # all_to_all
    dist.all_to_all_single(recv_buf, send_buf, recv_splits, send_splits)

    print("finish!")

When the data_size = 25000, it works well. But when I set data_size = 500000, there is an error output:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 10 PID 3564580 RUNNING AT g0118
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 11 PID 3564581 RUNNING AT g0118
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

The version of MPI: intel-mpi/2021.8
The version of Pytorch: 1.10.0 (CPU version)
The version of torch_ccl: 1.10.0
The number of MPI processes: 16, each MPI process is mapped to 1 socket.
Each compute node owns 2 socket CPUs, and the total memory size in one compute node is 387 Gigabytes. So this benchmark is ran on 8 compute nodes, 16 socket CPUs.
The command I use to launch MPI is: mpiexec.hydra -n 16 -ppn 2 python test_alltoall.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions