alltoall performance regression after upgrading from 2021.1-beta07-1 to 1.10

Hi,
We upgraded torch-ccl from 2021.1-beta07-1 to 1.10 and noticed some performance regression for all_to_all. overall, ccl 1.10 is 2x worse than 2021.1-beta07-1. 
**system config:**

- single node, 2 proc_per_node, so no network communication

Any idea on the root cause?

**all_to_all profiling for torch ccl 1.10**
![all2all-ccl1.10110](https://user-images.githubusercontent.com/51695408/150493984-62dc6edd-2fdc-4db8-bf2e-7da4a1907144.png)

**all_to_all profiling for torch ccl 2021.1-beta07-1**
![all2all-ccl2021.1-beta07-1](https://user-images.githubusercontent.com/51695408/150494047-f4af08cb-e27f-4d19-b27c-533191e28bee.png)

test code:
```
import torch
import extend_distributed as ext_dist

if __name__ == "__main__":
    ext_dist.init_distributed(backend='ccl')
    input = []
    tensor = torch.ones(262144, 16, dtype=torch.bfloat16)
    input.append(tensor)
    with torch.autograd.profiler.profile(True) as prof:
        for _ in range(10):
            a2a_req = ext_dist.alltoall(input, None)
            ly_sparse = a2a_req.wait()
    print(prof.key_averages().table(sort_by="cpu_time_total"))
```
For `extend_distributed`, please refer to [https://github.com/IntelAI/models/blob/master/models/recommendation/pytorch/dlrm/training/bfloat16/extend_distributed.py](url)

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

alltoall performance regression after upgrading from 2021.1-beta07-1 to 1.10 #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

alltoall performance regression after upgrading from 2021.1-beta07-1 to 1.10 #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions