ProcessGroupCCL Destructor Not Correctly Called in PT 1.10

Hi torch-ccl community,

I was trying to run the follow code with PT 1.10 + ccl backend:

``` python
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch_ccl
dist.init_process_group(backend="ccl")
class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10, bias=False)
        self.net2 = nn.Linear(10, 10)
    def forward(self, x):
        return self.net2(self.net1(x))
model = ToyModel()
ddp = torch.nn.parallel.DistributedDataParallel(
    model,
    find_unused_parameters=True)

inp = torch.randn(1, 10)
out = ddp(inp)
```

When `find_unused_parameters=True`, the destructor of ProcessGroupCCL was not correctly called. When `find_unused_parameters=False` there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorch

Would appreciate any insights and help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions