Implementing a Metric and including a nn.Module doesn't work correctly in parallel #6693
Unanswered
import-antigravity
asked this question in
DDP / multi-GPU / multi-node
Replies: 3 comments 8 replies
-
Sidenote: using |
Beta Was this translation helpful? Give feedback.
0 replies
-
Are you using a shared cluster/machine by any chance? That error can be due to another user using the gpu resources (and the gpus set to exclusive mode) |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I implemented the FID metric, which involves using a pre-trained Inception network. I have the following code to move it to CUDA:
When I train using more than one GPU in DDP, this causes an exception
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
. I'm not sure what's causing this. I know that this stuff is supposed to be taken care of automatically but for some reason it's not working for me.Beta Was this translation helpful? Give feedback.
All reactions