Some confusion in the DDP mode. In DDP mode, do I have to always use self.all_gather()? #18330

Struggle-Forever · 2023-08-17T04:11:10Z

Struggle-Forever
Aug 17, 2023

In DDP mode, I found that when not using self.all_gather to collect predictions, the early-stop monitoring mechanism doesn't seem to come from reduces the metric across GPUs/TPUs.

For the following code, I am not using self.all_gather. I use sync_dist=True

    def share_val_step(self, batch):
        bert_inputs = batch[0]
        real_label = batch[1]  # batch * seq
        hidden, logits = self(inputs=bert_inputs)
        current_batch_size = real_label.shape[0]
        pred_label = logits.argmax(dim=-1)

        pred_label_list = pred_label.cpu().detach().numpy().tolist()
        real_labels_list = real_label.cpu().detach().numpy().tolist()

        f1, p, r, acc = envaulation(y_true=real_labels_list, y_pred=pred_label_list)
        return f1, p, r, acc, current_batch_size

    def validation_step(self, batch, batch_idx):
        f1, precision, recall, acc, current_batch_size = self.share_val_step(batch)
        metrics = {'val_acc': acc, 'val_f1':f1, 'val_precision': precision, 'val_recall':recall}
        self.log_dict(metrics, prog_bar=True, logger=True, on_epoch=True, sync_dist=True, batch_size=current_batch_size,
                      rank_zero_only=True)

I found that the monitored Metric seems to come from rank 0.
However, this is inconsistent with the model's judgement of stopping early, because when it stops, rank0 does not reach PATIENCES.

When I use self.all_gather, I find it works correctly.

    def share_val_step(self, batch):
        bert_inputs = batch[0]
        real_label = batch[1]  # batch * seq
        hidden, logits = self(inputs=bert_inputs)
        current_batch_size = real_label.shape[0]
        pred_label = logits.argmax(dim=-1)
        # Here， using self.all_gather
        pred_label = self.all_gather(pred_label, sync_grads=True).view(-1)
        real_label = self.all_gather(real_label, sync_grads=True).view(-1)

        pred_label_list = pred_label.cpu().detach().numpy().tolist()
        real_labels_list = real_label.cpu().detach().numpy().tolist()

        f1, p, r, acc = envaulation(y_true=real_labels_list, y_pred=pred_label_list)
        return f1, p, r, acc, current_batch_size

Does this mean that in DDP mode, in order to properly monitor metic , I must always use self.all_gather()?
Don't sync_dist=True collect metic from all the gpu's?

My understanding is that the model is in DDP mode the gradient is synchronised and the user doesn't need to care. But other metrics (e.g., acc, loss, etc.) recordings don't seem to be synchronised, and this needs to be handled by the user at all times?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some confusion in the DDP mode. In DDP mode, do I have to always use self.all_gather()? #18330

{{title}}

Replies: 0 comments

Select a reply

Some confusion in the DDP mode. In DDP mode, do I have to always use self.all_gather()? #18330

Struggle-Forever Aug 17, 2023

Replies: 0 comments

Struggle-Forever
Aug 17, 2023