You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(torchx/components) simplify rendezvous parameters for dist.ddp, allow users to pass custom port, always use c10d rendezvous, and pick a free random port for single node (#432)
Summary:
Pull Request resolved: #432
Currently the rdzv_backend, rdzv_endpoint and their respective defaults only work in a specific combination.
Example: `--rdzv_backend="etcd"` won't work with the default `--rdzv_endpoint` since the user needs to run etcd server on the rank0's host, which isn't known at launch time.
This PR simplifies rdzv parameters by:
1. always using c10d (for both single and multi-node) - we were defaulting to this anyways and I doubt rdzv_backend != c10d worked for anyone out of the box.
1. breaking `rdzv_endpoint` into `rdzv_host` and `rdzv_port` and hard coding rdzv_host to `TORCHX_RANK0_HOST`, while defaulting `rdzv_port=29500` and still giving the user a way to override it based on their firewall settings.
1. Ignore `rdzv_port` for single node launches and use `localhost:0` which lets elastic chose a free random port. Enables running multiple single node jobs locally without a port conflict. (e.g. four jobs of -j 1x2 on a devgpu with 8 gpus)
1. Improves documentation of the component
Reviewed By: d4l3k
Differential Revision: D35085959
fbshipit-source-id: 422c09b51686558cd6694af1e4c8ea135aa27bb6
to launch and coordinate pytorch worker processes.
82
+
to launch and coordinate PyTorch worker processes. Defaults to using ``c10d`` rendezvous backend
83
+
on rendezvous_endpoint ``$rank_0_host:$rdzv_port``. Note that ``rdzv_port`` parameter is ignored
84
+
when running on single node, and instead we use port 0 which instructs torchelastic to chose
85
+
a free random port on the host.
155
86
156
87
Note: (cpu, gpu, memMB) parameters are mutually exclusive with ``h`` (named resource) where
157
88
``h`` takes precedence if specified for setting resource requirements.
@@ -170,9 +101,11 @@ def ddp(
170
101
j: {nnodes}x{nproc_per_node}, for gpu hosts, nproc_per_node must not exceed num gpus
171
102
env: environment varibles to be passed to the run (e.g. ENV1=v1,ENV2=v2,ENV3=v3)
172
103
max_retries: the number of scheduler retries allowed
173
-
rdzv_backend: rendezvous backend (only matters when nnodes > 1)
174
-
rdzv_endpoint: rendezvous server endpoint (only matters when nnodes > 1), defaults to rank0 host for schedulers that support it
175
-
mounts: mounts to mount into the worker environment/container (ex. type=<bind/volume>,src=/host,dst=/job[,readonly]). See scheduler documentation for more info.
104
+
rdzv_port: the port on rank0's host to use for hosting the c10d store used for rendezvous.
105
+
Only takes effect when running multi-node. When running single node, this parameter
106
+
is ignored and a random free port is chosen.
107
+
mounts: mounts to mount into the worker environment/container (ex. type=<bind/volume>,src=/host,dst=/job[,readonly]).
0 commit comments