Skip to content

Remove breaking torchrun config for single-node runs#292

Merged
hemildesai merged 2 commits intoNVIDIA-NeMo:mainfrom
ri-roee:fix-single-node-launcher
Jul 25, 2025
Merged

Remove breaking torchrun config for single-node runs#292
hemildesai merged 2 commits intoNVIDIA-NeMo:mainfrom
ri-roee:fix-single-node-launcher

Conversation

@ri-roee
Copy link
Copy Markdown
Contributor

@ri-roee ri-roee commented Jul 15, 2025

I suspect there were upstream changes between now and when this code block was original written because it actually breaks single-node torchrun runs today. I tested this by overwritting this config change locally and my launcher=torchrun (on a single-node in k8s) started working. The error I was seeing originally was [default7]:[I715 22:47:11.296780083 socket.cpp:872] [c10d] No socket on (localhost, 0) is listening yet, will retry. for context, which led me to believe that rdzv backend was misconfigured, which finally led me to the overwritting of c10d.

Signed-off-by: Roee Landesman <roeeland@cisco.com>
@ri-roee ri-roee force-pushed the fix-single-node-launcher branch from c95855d to 1f4dfb3 Compare July 15, 2025 23:29
@hemildesai
Copy link
Copy Markdown
Contributor

Did you test this on multi-node? This workaround was added for supporting torchrun on multi-node K8s as dynamic rendezvous was running into connection errors.

@ri-roee
Copy link
Copy Markdown
Contributor Author

ri-roee commented Jul 16, 2025

Shoot you're right! This means that torchrun on single-node k8s clusters is broken with this logic though, will have to dig deeper to understand root cause

@ri-roee ri-roee closed this Jul 16, 2025
@romilbhardwaj
Copy link
Copy Markdown
Contributor

Hey folks! 👋 SkyPilot recently fixed dynamic rdvz discovery for multi-node in this PR: skypilot-org/skypilot#5960. So using static rdvz should no longer be required. The fix is available in SkyPilot 0.10.

@ri-roee
Copy link
Copy Markdown
Contributor Author

ri-roee commented Jul 25, 2025

Amazing, thank you @romilbhardwaj! Recently upgraded #297, so re-opening PR

Signed-off-by: Roee Landesman <roeeland@cisco.com>
@hemildesai hemildesai merged commit 0a25fd6 into NVIDIA-NeMo:main Jul 25, 2025
19 of 21 checks passed
zoeyz101 pushed a commit to zoeyz101/NeMo-Run that referenced this pull request Nov 12, 2025
* remove breaking torchrun config for single-node runs

Signed-off-by: Roee Landesman <roeeland@cisco.com>

* fix lint

Signed-off-by: Roee Landesman <roeeland@cisco.com>

---------

Signed-off-by: Roee Landesman <roeeland@cisco.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants