Implementing a Metric and including a nn.Module doesn't work correctly in parallel #6693

import-antigravity · 2021-03-26T21:08:21Z

import-antigravity
Mar 26, 2021

I implemented the FID metric, which involves using a pre-trained Inception network. I have the following code to move it to CUDA:

def __init__(self, ...):
    ...
    self._model = InceptionV3(...)
    if cuda.is_available():
        self._model.to('cuda')

When I train using more than one GPU in DDP, this causes an exception RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable. I'm not sure what's causing this. I know that this stuff is supposed to be taken care of automatically but for some reason it's not working for me.

import-antigravity · 2021-03-31T16:59:22Z

import-antigravity
Mar 31, 2021
Author

Bump

1 reply

jhschwartz Apr 5, 2022

Did you ever resolve this? Having the same problem.

import-antigravity · 2021-03-31T17:06:12Z

import-antigravity
Mar 31, 2021
Author

Sidenote: using self.to() causes the same error

0 replies

SkafteNicki · 2021-03-31T18:32:24Z

SkafteNicki
Mar 31, 2021
Collaborator

Are you using a shared cluster/machine by any chance? That error can be due to another user using the gpu resources (and the gpus set to exclusive mode)

7 replies

SkafteNicki Apr 5, 2021
Collaborator

Normally, in lightning you should not have to move modules to devices as we do this automatically for you. What happens if you remove the

if cuda.is_available():
        self._model.to('cuda')

part of your code?

import-antigravity Apr 5, 2021
Author

If I don’t move the module manually, the module does not get moved by the trainer and I get an error. The trainer really doesn’t have any way of controlling the Metric because it’s part of a callback so I’m not sure how that would work. Also I know that Metric extends nn.Module, so in the current version of my code that line is self.to('cuda').

SkafteNicki Apr 5, 2021
Collaborator

alright, but since you are running in multi-gpu setting, calling self.to('cuda') is not correct as it instead should be self.to(cuda:0), self.to(cuda:1) ect depending on what gpu currently are being used

import-antigravity Apr 5, 2021
Author

Ah interesting. Do you have a documentation link for that so I can learn more?

Alternatively, is there a way I can somehow register the Metric instance with the Trainer so the trainer can actually handle the device management? Thanks!

SkafteNicki Apr 6, 2021
Collaborator

what I would do in your place was to equip my model with the inception net necessary to calculate fid so it gets moved to the correct device during training. Then in my callback I can access it through the pl_module argument which most hooks have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing a Metric and including a nn.Module doesn't work correctly in parallel #6693

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Implementing a Metric and including a nn.Module doesn't work correctly in parallel #6693

import-antigravity Mar 26, 2021

Replies: 3 comments · 8 replies

import-antigravity Mar 31, 2021 Author

jhschwartz Apr 5, 2022

import-antigravity Mar 31, 2021 Author

SkafteNicki Mar 31, 2021 Collaborator

SkafteNicki Apr 5, 2021 Collaborator

import-antigravity Apr 5, 2021 Author

SkafteNicki Apr 5, 2021 Collaborator

import-antigravity Apr 5, 2021 Author

SkafteNicki Apr 6, 2021 Collaborator

import-antigravity
Mar 26, 2021

Replies: 3 comments 8 replies

import-antigravity
Mar 31, 2021
Author

import-antigravity
Mar 31, 2021
Author

SkafteNicki
Mar 31, 2021
Collaborator

SkafteNicki Apr 5, 2021
Collaborator

import-antigravity Apr 5, 2021
Author

SkafteNicki Apr 5, 2021
Collaborator

import-antigravity Apr 5, 2021
Author

SkafteNicki Apr 6, 2021
Collaborator