How to use torch.distributed.launch  to run multiple node training with oneccl

I'm try to use torch.distributed.launch to launch multiple node training with oneccl.
On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh
The command on 1st node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py
The command on 2nd node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py

But on both nodes, it hung after these messages:
2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
point 0
point 1
point 2
point 2.1
point 2.2
2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi

I'm wondering how to use torch.distributed.launch  to run multiple node training with oneccl? Is there any specific setting needs to do?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use torch.distributed.launch to run multiple node training with oneccl #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to use torch.distributed.launch to run multiple node training with oneccl #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions