Skip to content

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

@Zha0q1

Description

@Zha0q1

Hi torch-ccl community,

I was trying to run the follow code with PT 1.10 + ccl backend:

import torch
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch_ccl
dist.init_process_group(backend="ccl")
class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10, bias=False)
        self.net2 = nn.Linear(10, 10)
    def forward(self, x):
        return self.net2(self.net1(x))
model = ToyModel()
ddp = torch.nn.parallel.DistributedDataParallel(
    model,
    find_unused_parameters=True)

inp = torch.randn(1, 10)
out = ddp(inp)

When find_unused_parameters=True, the destructor of ProcessGroupCCL was not correctly called. When find_unused_parameters=False there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorch

Would appreciate any insights and help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions