Skip to content

Commit abb8ed6

Browse files
Kiuk Chungfacebook-github-bot
Kiuk Chung
authored andcommitted
(torchx/components) simplify rendezvous parameters for dist.ddp, allow users to pass custom port, always use c10d rendezvous, and pick a free random port for single node (#432)
Summary: Pull Request resolved: #432 Currently the rdzv_backend, rdzv_endpoint and their respective defaults only work in a specific combination. Example: `--rdzv_backend="etcd"` won't work with the default `--rdzv_endpoint` since the user needs to run etcd server on the rank0's host, which isn't known at launch time. This PR simplifies rdzv parameters by: 1. always using c10d (for both single and multi-node) - we were defaulting to this anyways and I doubt rdzv_backend != c10d worked for anyone out of the box. 1. breaking `rdzv_endpoint` into `rdzv_host` and `rdzv_port` and hard coding rdzv_host to `TORCHX_RANK0_HOST`, while defaulting `rdzv_port=29500` and still giving the user a way to override it based on their firewall settings. 1. Ignore `rdzv_port` for single node launches and use `localhost:0` which lets elastic chose a free random port. Enables running multiple single node jobs locally without a port conflict. (e.g. four jobs of -j 1x2 on a devgpu with 8 gpus) 1. Improves documentation of the component Reviewed By: d4l3k Differential Revision: D35085959 fbshipit-source-id: 422c09b51686558cd6694af1e4c8ea135aa27bb6
1 parent e3bd347 commit abb8ed6

File tree

2 files changed

+23
-89
lines changed

2 files changed

+23
-89
lines changed

torchx/components/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ def ddp(
9191
args=[
9292
"-m",
9393
"torch.distributed.run",
94-
"--rdzv_backend=etcd",
94+
"--rdzv_backend=c10d",
9595
"--rdzv_endpoint=localhost:5900",
9696
f"--nnodes={nnodes}",
9797
f"--nprocs_per_node={nprocs_per_node}",

torchx/components/dist.py

Lines changed: 22 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -35,96 +35,25 @@
3535
$ torchx run -s local_cwd dist.ddp -j 1x4 --script main.py
3636
3737
# locally, 2 node x 4 workers (8 total)
38-
# remote (needs you to start a local etcd server on port 2379! and have a `python-etcd` library installed)
39-
$ torchx run -s local_cwd dist.ddp
40-
-j 2x4 \\
41-
--rdzv_endpoint localhost:2379 \\
42-
--script main.py \\
38+
$ torchx run -s local_cwd dist.ddp -j 2x4 --script main.py
4339
44-
# remote (needs you to setup an etcd server first!)
40+
# remote (optionally pass --rdzv_port to use a different master port than the default 29500)
4541
$ torchx run -s kubernetes -cfg queue=default dist.ddp \\
4642
-j 2x4 \\
4743
--script main.py \\
4844
4945
50-
There is a lot happening under the hood so we strongly encourage you
51-
to continue reading the rest of this section
52-
to get an understanding of how everything works. Also note that
53-
while ``dist.ddp`` is convenient, you'll find that authoring your own
54-
distributed component is not only easy (simplest way is to just copy
55-
``dist.ddp``!) but also leads to better flexbility and maintainability
56-
down the road since builtin APIs are subject to more changes than
57-
the more stable specs API. However the choice is yours, feel free to rely on
58-
the builtins if they meet your needs.
59-
60-
Distributed Training
61-
-----------------------
62-
63-
Local Testing
64-
===================
65-
66-
.. note:: Please follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` first.
67-
68-
69-
Running distributed training locally is a quick way to validate your
70-
training script. TorchX's local scheduler will create a process per
71-
replica (``--nodes``). The example below uses `torchelastic <https://pytorch.org/docs/stable/elastic/run.html>`_,
72-
as the main entrypoint of each node, which in turn spawns ``--nprocs_per_node`` number
73-
of trainers. In total you'll see ``nnodes*nprocs_per_node`` trainer processes and ``nnodes``
74-
elastic agent procesess created on your local host.
75-
76-
.. code:: shell-session
77-
78-
$ torchx run -s local_docker ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
79-
--nnodes 2 \\
80-
--nproc_per_node 2 \\
81-
--rdzv_backend c10d \\
82-
--rdzv_endpoint localhost:29500
83-
84-
85-
Remote Launching
86-
====================
87-
88-
.. note:: Please follow the :ref:`schedulers/kubernetes:Prerequisites` first.
89-
90-
The following example demonstrate launching the same job remotely on kubernetes.
91-
92-
.. code:: shell-session
93-
94-
$ torchx run -s kubernetes -cfg queue=default \\
95-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
96-
--nnodes 2 \\
97-
--nproc_per_node 2 \\
98-
--rdzv_backend etcd \\
99-
--rdzv_endpoint etcd-server.default.svc.cluster.local:2379
100-
torchx 2021-10-18 18:46:55 INFO Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
101-
torchx 2021-10-18 18:46:55 INFO AppStatus:
102-
msg: <NONE>
103-
num_restarts: -1
104-
roles: []
105-
state: PENDING (2)
106-
structured_error_msg: <NONE>
107-
ui_url: null
108-
109-
torchx 2021-10-18 18:46:55 INFO Job URL: None
110-
111-
Note that the only difference compared to the local launch is the scheduler (``-s``)
112-
and ``--rdzv_backend``. etcd will also work in the local case, but we used ``c10d``
113-
since it does not require additional setup. Note that this is a torchelastic requirement
114-
not TorchX. Read more about rendezvous `here <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_.
115-
116-
.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and
117-
change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist``
118-
method. Modify ``resource_def`` to the number of GPUs that your host has.
119-
46+
Note that the only difference compared to the local launch is the scheduler (``-s``).
47+
The ``dist.ddp`` builtin uses ``torchelastic`` (more specifically ``torch.distributed.run``)
48+
under the hood. Read more about torchelastic `here <https://pytorch.org/docs/stable/elastic/run.html>`_.
12049
12150
Components APIs
12251
-----------------
12352
"""
12453
import os
12554
import shlex
12655
from pathlib import Path
127-
from typing import Dict, Iterable, Optional, List
56+
from typing import Dict, Iterable, List, Optional
12857

12958
import torchx
13059
import torchx.specs as specs
@@ -144,14 +73,16 @@ def ddp(
14473
j: str = "1x2",
14574
env: Optional[Dict[str, str]] = None,
14675
max_retries: int = 0,
147-
rdzv_backend: str = "c10d",
148-
rdzv_endpoint: Optional[str] = None,
76+
rdzv_port: int = 29500,
14977
mounts: Optional[List[str]] = None,
15078
) -> specs.AppDef:
15179
"""
15280
Distributed data parallel style application (one role, multi-replica).
15381
Uses `torch.distributed.run <https://pytorch.org/docs/stable/distributed.elastic.html>`_
154-
to launch and coordinate pytorch worker processes.
82+
to launch and coordinate PyTorch worker processes. Defaults to using ``c10d`` rendezvous backend
83+
on rendezvous_endpoint ``$rank_0_host:$rdzv_port``. Note that ``rdzv_port`` parameter is ignored
84+
when running on single node, and instead we use port 0 which instructs torchelastic to chose
85+
a free random port on the host.
15586
15687
Note: (cpu, gpu, memMB) parameters are mutually exclusive with ``h`` (named resource) where
15788
``h`` takes precedence if specified for setting resource requirements.
@@ -170,9 +101,11 @@ def ddp(
170101
j: {nnodes}x{nproc_per_node}, for gpu hosts, nproc_per_node must not exceed num gpus
171102
env: environment varibles to be passed to the run (e.g. ENV1=v1,ENV2=v2,ENV3=v3)
172103
max_retries: the number of scheduler retries allowed
173-
rdzv_backend: rendezvous backend (only matters when nnodes > 1)
174-
rdzv_endpoint: rendezvous server endpoint (only matters when nnodes > 1), defaults to rank0 host for schedulers that support it
175-
mounts: mounts to mount into the worker environment/container (ex. type=<bind/volume>,src=/host,dst=/job[,readonly]). See scheduler documentation for more info.
104+
rdzv_port: the port on rank0's host to use for hosting the c10d store used for rendezvous.
105+
Only takes effect when running multi-node. When running single node, this parameter
106+
is ignored and a random free port is chosen.
107+
mounts: mounts to mount into the worker environment/container (ex. type=<bind/volume>,src=/host,dst=/job[,readonly]).
108+
See scheduler documentation for more info.
176109
"""
177110

178111
if (script is None) == (m is None):
@@ -196,12 +129,13 @@ def ddp(
196129
else:
197130
raise ValueError("failed to compute role_name")
198131

199-
if rdzv_endpoint is None:
200-
rdzv_endpoint = _noquote(f"$${macros.rank0_env}:29500")
201-
132+
rdzv_backend = "c10d"
202133
if nnodes == 1:
203-
rdzv_backend = "c10d"
204-
rdzv_endpoint = "localhost:29500"
134+
# using port 0 makes elastic chose a free random port which is ok
135+
# for single-node jobs since all workers run under a single agent
136+
rdzv_endpoint = "localhost:0"
137+
else:
138+
rdzv_endpoint = _noquote(f"$${macros.rank0_env}:{rdzv_port}")
205139

206140
if env is None:
207141
env = {}

0 commit comments

Comments
 (0)