Skip to content

alltoall performance regression after upgrading from 2021.1-beta07-1 to 1.10 #34

@Peach-He

Description

@Peach-He

Hi,
We upgraded torch-ccl from 2021.1-beta07-1 to 1.10 and noticed some performance regression for all_to_all. overall, ccl 1.10 is 2x worse than 2021.1-beta07-1.
system config:

  • single node, 2 proc_per_node, so no network communication

Any idea on the root cause?

all_to_all profiling for torch ccl 1.10
all2all-ccl1.10110

all_to_all profiling for torch ccl 2021.1-beta07-1
all2all-ccl2021.1-beta07-1

test code:

import torch
import extend_distributed as ext_dist

if __name__ == "__main__":
    ext_dist.init_distributed(backend='ccl')
    input = []
    tensor = torch.ones(262144, 16, dtype=torch.bfloat16)
    input.append(tensor)
    with torch.autograd.profiler.profile(True) as prof:
        for _ in range(10):
            a2a_req = ext_dist.alltoall(input, None)
            ly_sparse = a2a_req.wait()
    print(prof.key_averages().table(sort_by="cpu_time_total"))

For extend_distributed, please refer to https://github.com/IntelAI/models/blob/master/models/recommendation/pytorch/dlrm/training/bfloat16/extend_distributed.py

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions