Skip to content

DDP(model) gets stocked in a cluster When run Demo.py manually #46

@leonardozcm

Description

@leonardozcm

Torch/torch-ccl/ipex version 1.13.0
cluster node: 2
World_size: 2
All nodes have password-less connections set, and mpirun works well as the readme says:

mpirun -f ./hosts -n 2 -ppn 1 -genv OMP_NUM_THREADS=24 python demo.py 

And I try to run it manually by start training in both of the nodes:

# in node 0
RANK=0 WORLD_SIZE=2 python demo.py
# in node 1
RANK=1 WORLD_SIZE=2 python demo.py

This will stock at DDP(model):

/home/cpx/anaconda3/envs/bigdl_test/lib/python3.7/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
1 2
2023-05-05 16:13:22,973 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-05 16:13:22,984 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023:05:05-16:13:30:(3185397) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi

This will not happen if I set dist.init_process_group(backend='gloo')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions