Skip to content

NCCL error when trying to run NeMo training multi node #266

@bhaddow

Description

@bhaddow

Hi

I am trying to get multi node training to work on a slurm cluster. Single node training seems fine, but when I run multi node I get several NCCL warnings like

graph/search.cc:1135 NCCL WARN Could not find a path for pattern 1, falling back to simple order

and

graph/topo.h:230 NCCL WARN Could not find NET with id 0

and eventually it crashes with

torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3376, internal error - please report this issue to the NCCL developers, NCCL version 2.25.1

This is with the LocalExecutor, applying this fix (#251) for multi node support.

Does anyone know what the problem might be?

best
Barry

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions