Skip to content

Fix for multigpu runs with docker executor#134

Merged
Kipok merged 2 commits intoNVIDIA-NeMo:mainfrom
Kipok:igitman/docker-fix
Jan 8, 2025
Merged

Fix for multigpu runs with docker executor#134
Kipok merged 2 commits intoNVIDIA-NeMo:mainfrom
Kipok:igitman/docker-fix

Conversation

@Kipok
Copy link
Contributor

@Kipok Kipok commented Jan 8, 2025

Without this change I get

nemo-run-0/0 --------------------------------------------------------------------------                                                                                                                           
nemo-run-0/0 While trying to create a regular expression of the node names                                                                                                                                        
nemo-run-0/0 used in this application, the regex parser has detected the                                                                                                                                          
nemo-run-0/0 presence of an illegal character in the following node name:                                                                                                                                         
nemo-run-0/0
nemo-run-0/0   node:  eval_1736300653_nemo-run-0
nemo-run-0/0
nemo-run-0/0 Node names must be composed of a combination of ascii letters,
nemo-run-0/0 digits, dots, and the hyphen ('-') character. See the following
nemo-run-0/0 for an explanation:
nemo-run-0/0
nemo-run-0/0   https://en.wikipedia.org/wiki/Hostname
nemo-run-0/0
nemo-run-0/0 Please correct the error and try again.
nemo-run-0/0 --------------------------------------------------------------------------
nemo-run-0/0 --------------------------------------------------------------------------
nemo-run-0/0 An internal error has occurred in ORTE:
nemo-run-0/0
nemo-run-0/0 [[41165,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/base/plm_base_launch_support.c(555)
nemo-run-0/0
nemo-run-0/0 This is something that should be reported to the developers.
nemo-run-0/0 --------------------------------------------------------------------------

Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok requested a review from hemildesai January 8, 2025 01:51
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok merged commit c2a4229 into NVIDIA-NeMo:main Jan 8, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants