DDP MultiGPU Training doesn't reduce training time #18187
Unanswered
AlejandroTL
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello!
I want to do multiGPU training with a model. I have a node with 4 GPUs. Training with just 1 GPU, each epoch takes 9 hours. Training with 4 GPUs, each epoch takes 9 hours. There is no reduction whatsoever nor in the number of batches needed to train nor the time.
The way I am calling the trainer is:
Lines I see in the logs when training with 4 GPUs are:
How can I actually check whether I am indeed working with 4 GPUs? I know that my system can see them.
Finally, I use
ddp_find_unused_parameters_true
instead ofddp
cause I use atorch.nn.Embedding
and not in every minibatch I retrieve all indices, which apparently provokes some problems:Torch version:
'2.0.1+cu117'
Pytorch lightning version:
'2.0.6'
I solved this problem but now what happens is that my script gets stuck in the following construction of the DDP process:
I have seen some githubs issues with similar problems but no solution found! Any idea?
Thanks!!!
Beta Was this translation helpful? Give feedback.
All reactions