Segement fault when the size of send buffer and recv buffer is large

Hi! I use the `dist.all_to_all_single` with torch.distributed and torch.ccl. I found that when the size of send buffer and recv buffer is large (several Gigabytes), the problem of `segment fault` will occur.
Here is the test code:
```python
import torch 
import torch.distributed as dist
import numpy as np
import os
import torch_ccl

def init_dist_group():
    world_size = int(os.environ.get("PMI_SIZE", -1))
    rank = int(os.environ.get("PMI_RANK", -1))
    dist_url = "env://"
    dist.init_process_group(backend="ccl", init_method="env://", 
                            world_size=world_size, rank=rank)
    assert torch.distributed.is_initialized()
    print(f"dist_info RANK: {dist.get_rank()}, SIZE: {dist.get_world_size()}")
    # number of process in this MPI group
    world_size = dist.get_world_size() 
    # mpi rank in this MPI group
    rank = dist.get_rank()
    return (rank, world_size)

# main function
if __name__ == "__main__":
    rank, world_size = init_dist_group()

    # allocate memory for send_buf and recv_buf
    data_size = 250000
    send_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
    recv_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
    send_buf_shape = send_buf.shape
    recv_buf_shape = recv_buf.shape

    print("send_buf.shape = {}, recv_buf.shape = {}".format(send_buf.shape, recv_buf.shape), flush=True)

    send_splits = [data_size for i in range(world_size)]
    recv_splits = [data_size for i in range(world_size)]
    send_splits[rank] = 0
    recv_splits[rank] = 0

    print("rank = {}, send_splits = {}, recv_splits = {}".format(rank, send_splits, recv_splits), flush=True)

    assert(sum(send_splits) == send_buf_shape[0])
    assert(sum(recv_splits) == recv_buf_shape[0])
    assert(len(send_splits) == world_size)
    assert(len(recv_splits) == world_size)

    # all_to_all
    dist.all_to_all_single(recv_buf, send_buf, recv_splits, send_splits)

    print("finish!")
```
When the `data_size = 25000`, it works well. But when I set `data_size = 500000`, there is an error output:
```
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 10 PID 3564580 RUNNING AT g0118
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 11 PID 3564581 RUNNING AT g0118
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
```
The version of MPI: intel-mpi/2021.8
The version of Pytorch: 1.10.0 (CPU version)
The version of torch_ccl: 1.10.0 
The number of MPI processes: 16, each MPI process is mapped to 1 socket.
Each compute node owns 2 socket CPUs, and the total memory size in one compute node is 387 Gigabytes. So this benchmark is ran on 8 compute nodes, 16 socket CPUs. 
The command I use to launch MPI is: `mpiexec.hydra -n 16 -ppn 2 python test_alltoall.py`





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segement fault when the size of send buffer and recv buffer is large #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segement fault when the size of send buffer and recv buffer is large #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions