Skip to content

How to use torch.distributed.launch to run multiple node training with oneccl #48

@jenniew

Description

@jenniew

I'm try to use torch.distributed.launch to launch multiple node training with oneccl.
On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh
The command on 1st node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py
The command on 2nd node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py

But on both nodes, it hung after these messages:
2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
point 0
point 1
point 2
point 2.1
point 2.2
2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi

I'm wondering how to use torch.distributed.launch to run multiple node training with oneccl? Is there any specific setting needs to do?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions