-
Notifications
You must be signed in to change notification settings - Fork 31
Description
I'm try to use torch.distributed.launch to launch multiple node training with oneccl.
On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh
The command on 1st node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py
The command on 2nd node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py
But on both nodes, it hung after these messages:
2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
point 0
point 1
point 2
point 2.1
point 2.2
2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
I'm wondering how to use torch.distributed.launch to run multiple node training with oneccl? Is there any specific setting needs to do?