Avoid CPU Sync in SyncBatchNorm When Capturing CUDA Graphs

mrshenli · pytorchmergebot · commit 1884d7fbe989 · 2022-06-03T04:32:57.000Z
We recently updated `SyncBatchNorm` to support empty input batches. The new code removes stats from ranks with empty inputs. However, this change breaks CUDA graph capture as it forces CPU sync. This commit uses `is_current_stream_capturing()` to guard the new code path, and only run the new code when not capturing CUA Graphs. To support empty inputs with CUDA graph capturing, we might need to update CUDA kernels for `batch_norm_backward_elemt` and `batch_norm_gather_stats_with_counts`. See pytorch#78656. Fixes pytorch#78549 Pull Request resolved: pytorch#78666 Approved by: https://github.com/albanD
diff --git a/torch/nn/modules/_functions.py b/torch/nn/modules/_functions.py
@@ -67,11 +67,19 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
             # world_size * (2C + 1) -> world_size * C, world_size * C, world_size * 1
             mean_all, invstd_all, count_all = torch.split(combined, num_channels, dim=1)
 
-        # remove stats from empty inputs
-        mask = count_all.squeeze(-1) >= 1
-        count_all = count_all[mask]
-        mean_all = mean_all[mask]
-        invstd_all = invstd_all[mask]
+        if not torch.cuda.is_current_stream_capturing():
+            # The lines below force a synchronization between CUDA and CPU, because
+            # the shape of the result count_all depends on the values in mask tensor.
+            # Such synchronizations break CUDA Graph capturing.
+            # See https://github.com/pytorch/pytorch/issues/78549
+            # FIXME: https://github.com/pytorch/pytorch/issues/78656 describes
+            # a better longer-term solution.
+
+            # remove stats from empty inputs
+            mask = count_all.squeeze(-1) >= 1
+            count_all = count_all[mask]
+            mean_all = mean_all[mask]
+            invstd_all = invstd_all[mask]
 
         # calculate global mean & invstd
         mean, invstd = torch.batch_norm_gather_stats_with_counts(