-
Notifications
You must be signed in to change notification settings - Fork 93
Closed
Description
Hi
I am trying to get multi node training to work on a slurm cluster. Single node training seems fine, but when I run multi node I get several NCCL warnings like
graph/search.cc:1135 NCCL WARN Could not find a path for pattern 1, falling back to simple order
and
graph/topo.h:230 NCCL WARN Could not find NET with id 0
and eventually it crashes with
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3376, internal error - please report this issue to the NCCL developers, NCCL version 2.25.1
This is with the LocalExecutor, applying this fix (#251) for multi node support.
Does anyone know what the problem might be?
best
Barry
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels