Run Trainer.fit multiple times under DDP mode #12401

xmlyqing00 · 2022-03-21T21:42:51Z

xmlyqing00
Mar 21, 2022

Hi,

I have a machine learning architecture project that requires modifying the network structure multiple times. I used PytorchLigtning codes to implement it. The overall structure is as followed.

The model definition, I ignore the training_step, 'validation_step' for clearly demonstration.

def ToyModel(pl.LightningModule):
  def __init__(self):
    super(ToyModel, self).__init__()
    self.list = nn.ModuleList()
  def forward(self, x):
    for op in self.list:
      x = op(x)
    return x
  def add(self):
    self.list.append(nn.Layer(...))

The following main script shows that I want to update the network structure and retrain the model in 10 iterations.

model = ToyModel()
for iter in range(10):
  model.add()
  trainer = Trainer(model, strategy='ddp', gpus=-1)
  trainer.fit(model)

When iter == 1, the model has been propagated into different GPU, and the model.add() results in different models. So I add a flag to make sure the modification is happened in the main process by

model = ToyModel()
for iter in range(10):
  model.add()
  trainer = Trainer(model, strategy='ddp', gpus=-1)
  trainer.fit(model)
  if not trainer.is_global_zero:
    return # kill other processes

But this time, the program get stuck when iter == 1. My questions are:

I have a feeling that native Pytorch using spawn can do that, do I need to switch back to PyTorch?
Is there any decent way to do that in PyTorch? Maybe ddp_spawn?

Thanks for your time. Any comments or suggestions are welcome.

rohitgr7 · 2022-03-22T10:16:56Z

rohitgr7
Mar 22, 2022

can you try it with ddp_spawn since ddp creates sub-scripts i.e it will execute your complete script on a specific device.

3 replies

xmlyqing00 Mar 22, 2022
Author

I tried it. ddp_spawn is my expectation. Thanks for your help.

I have a following question how to get the output from test/validation epoch except self.log?

rohitgr7 Mar 23, 2022

I have a following question how to get the output from test/validation epoch except self.log?

sorry, I didn't get you here.. Can you explain more?

xmlyqing00 Apr 21, 2022
Author

Oh I fixed it out. It seems there is no return from validation_epoch_end, so I save the return value and load it to the program after training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Trainer.fit multiple times under DDP mode #12401

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Run Trainer.fit multiple times under DDP mode #12401

xmlyqing00 Mar 21, 2022

Replies: 1 comment · 3 replies

rohitgr7 Mar 22, 2022

xmlyqing00 Mar 22, 2022 Author

rohitgr7 Mar 23, 2022

xmlyqing00 Apr 21, 2022 Author

xmlyqing00
Mar 21, 2022

Replies: 1 comment 3 replies

rohitgr7
Mar 22, 2022

xmlyqing00 Mar 22, 2022
Author

xmlyqing00 Apr 21, 2022
Author