Some confusion in the DDP mode. In DDP mode, do I have to always use self.all_gather()? #18330
Unanswered
Struggle-Forever
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In DDP mode, I found that when not using self.all_gather to collect predictions, the early-stop monitoring mechanism doesn't seem to come from reduces the metric across GPUs/TPUs.
For the following code, I am not using self.all_gather. I use
sync_dist=True
I found that the monitored Metric seems to come from rank 0.
However, this is inconsistent with the model's judgement of stopping early, because when it stops, rank0 does not reach PATIENCES.
When I use self.all_gather, I find it works correctly.
Does this mean that in DDP mode, in order to properly monitor metic , I must always use self.all_gather()?
Don't
sync_dist=True
collect metic from all the gpu's?My understanding is that the model is in DDP mode the gradient is synchronised and the user doesn't need to care. But other metrics (e.g., acc, loss, etc.) recordings don't seem to be synchronised, and this needs to be handled by the user at all times?
Beta Was this translation helpful? Give feedback.
All reactions