-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
Torch/torch-ccl/ipex version 1.13.0
cluster node: 2
World_size: 2
All nodes have password-less connections set, and mpirun works well as the readme says:
mpirun -f ./hosts -n 2 -ppn 1 -genv OMP_NUM_THREADS=24 python demo.py
And I try to run it manually by start training in both of the nodes:
# in node 0
RANK=0 WORLD_SIZE=2 python demo.py
# in node 1
RANK=1 WORLD_SIZE=2 python demo.py
This will stock at DDP(model):
/home/cpx/anaconda3/envs/bigdl_test/lib/python3.7/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
1 2
2023-05-05 16:13:22,973 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-05 16:13:22,984 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023:05:05-16:13:30:(3185397) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
This will not happen if I set dist.init_process_group(backend='gloo')
Metadata
Metadata
Assignees
Labels
No labels