You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[doc] ddp multigpu tutorial - small updates (#3141)
* [doc] ddp multigpu tutorial - small updates
The main fix is to change `diff` blocks into `python` blocks so that the
user can easily copy/paste the data to run parts of the tutorial.
Fix author's link.
---------
Co-authored-by: Svetlana Karslioglu <[email protected]>
- How to migrate a single-GPU training script to multi-GPU via DDP
20
20
- Setting up the distributed process group
21
21
- Saving and loading models in a distributed setup
22
-
22
+
23
23
.. grid:: 1
24
24
25
25
.. grid-item::
26
26
27
27
:octicon:`code-square;1.0em;` View the code used in this tutorial on `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py>`__
@@ -45,11 +45,11 @@ In the `previous tutorial <ddp_series_theory.html>`__, we got a high-level overv
45
45
In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node.
46
46
Along the way, we will talk through important concepts in distributed training while implementing them in our code.
47
47
48
-
.. note::
48
+
.. note::
49
49
If your model contains any ``BatchNorm`` layers, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
50
50
layers across replicas.
51
51
52
-
Use the helper function
52
+
Use the helper function
53
53
`torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) <https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm>`__ to convert all ``BatchNorm`` layers in the model to ``SyncBatchNorm``.
54
54
55
55
@@ -58,27 +58,27 @@ Diff for `single_gpu.py <https://github.com/pytorch/examples/blob/main/distribut
58
58
These are the changes you typically make to a single-GPU training script to enable DDP.
59
59
60
60
Imports
61
-
~~~~~~~
61
+
-------
62
62
- ``torch.multiprocessing`` is a PyTorch wrapper around Python's native
63
63
multiprocessing
64
64
- The distributed process group contains all the processes that can
65
65
communicate and synchronize with each other.
66
66
67
-
.. code-block:: diff
67
+
.. code-block:: python
68
68
69
-
import torch
70
-
import torch.nn.functional as F
71
-
from utils import MyTrainDataset
69
+
import torch
70
+
import torch.nn.functional as F
71
+
from utils import MyTrainDataset
72
72
73
-
+ import torch.multiprocessing as mp
74
-
+ from torch.utils.data.distributed import DistributedSampler
75
-
+ from torch.nn.parallel import DistributedDataParallel as DDP
76
-
+ from torch.distributed import init_process_group, destroy_process_group
77
-
+ import os
73
+
import torch.multiprocessing as mp
74
+
from torch.utils.data.distributed import DistributedSampler
75
+
from torch.nn.parallel import DistributedDataParallel asDDP
76
+
from torch.distributed import init_process_group, destroy_process_group
77
+
import os
78
78
79
79
80
80
Constructing the process group
81
-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
81
+
------------------------------
82
82
83
83
- First, before initializing the group process, call `set_device <https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html?highlight=set_device#torch.cuda.set_device>`__,
84
84
which sets the default GPU for each process. This is important to prevent hangs or excessive memory utilization on `GPU:0`
@@ -90,66 +90,66 @@ Constructing the process group
0 commit comments