-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
Hi torch-ccl community,
I was trying to run the follow code with PT 1.10 + ccl backend:
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch_ccl
dist.init_process_group(backend="ccl")
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10, bias=False)
self.net2 = nn.Linear(10, 10)
def forward(self, x):
return self.net2(self.net1(x))
model = ToyModel()
ddp = torch.nn.parallel.DistributedDataParallel(
model,
find_unused_parameters=True)
inp = torch.randn(1, 10)
out = ddp(inp)
When find_unused_parameters=True
, the destructor of ProcessGroupCCL was not correctly called. When find_unused_parameters=False
there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorch
Would appreciate any insights and help!
Metadata
Metadata
Assignees
Labels
No labels