New slurm customization parameters (account, containers)#1209
New slurm customization parameters (account, containers)#1209
Conversation
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
📝 WalkthroughWalkthroughThis PR extends the nemo_skills pipeline infrastructure to support Slurm account specification and container image overrides across multiple CLI commands and task creation paths. New optional parameters are added to convert, generate, eval, run_cmd, and start_server commands, with values threaded through to task submission and underlying HardwareConfig and executor infrastructure. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
nemo_skills/pipeline/nemo_evaluator.py (1)
560-590:⚠️ Potential issue | 🟡 MinorDon’t silently ignore
exclusive.The parameter is accepted and threaded through, but never applied. Either honor it via
sbatch_kwargsor fail fast when it’s set so users don’t think they’re getting exclusive nodes.Suggested fail-fast guard
def _hardware_for_group( partition: Optional[str], account: Optional[str], num_gpus: Optional[int], num_nodes: int, qos: Optional[str], exclusive: bool, ) -> HardwareConfig: + if exclusive: + raise ValueError("exclusive is not supported for nemo_evaluator jobs yet; remove --exclusive.") return HardwareConfig( partition=partition, account=account, num_gpus=num_gpus, num_nodes=num_nodes,As per coding guidelines, avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided.
nemo_skills/pipeline/eval.py (1)
816-866:⚠️ Potential issue | 🟠 MajorAccount override is missing for summarize/compute-score tasks.
When a user specifies--account, these Slurm tasks still run under the default account and can fail on clusters without a default. Please propagateaccount=accountin bothadd_taskcalls.🔧 Proposed fix
summarize_task = pipeline_utils.add_task( exp, cmd=command, task_name=f"{expname}-{benchmark}-summarize-results", log_dir=f"{output_dir}/{benchmark_args.eval_subfolder}/summarized-results", container=cluster_config["containers"]["nemo-skills"], cluster_config=cluster_config, + account=account, run_after=run_after, reuse_code_exp=reuse_code_exp, reuse_code=reuse_code, task_dependencies=( dependent_tasks if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies ), installation_command=installation_command, skip_hf_home_check=skip_hf_home_check, sbatch_kwargs=sbatch_kwargs, ) @@ score_task = pipeline_utils.add_task( exp, cmd=command, task_name=f"{expname}-{group}-compute-score", log_dir=f"{output_dir}/eval-results/{group}/compute-score-logs", container=cluster_config["containers"]["nemo-skills"], cluster_config=cluster_config, + account=account, run_after=run_after, reuse_code_exp=reuse_code_exp, reuse_code=reuse_code, task_dependencies=( group_tasks[group] if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies ), installation_command=installation_command, skip_hf_home_check=skip_hf_home_check, sbatch_kwargs=sbatch_kwargs, )As per coding guidelines, Avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided. Use dataclasses or **kwargs syntax to handle this automatically.
gwarmstrong
left a comment
There was a problem hiding this comment.
In general looks good. Have a minor comment about goals for the future with this, but I don't think it requires action.
| main_container: str = typer.Option(None, help="Override container image for the main evaluation client"), | ||
| sandbox_container: str = typer.Option(None, help="Override container image for the sandbox"), | ||
| judge_container: str = typer.Option(None, help="Override container image for GPU-based judges (comet, nvembed)"), |
There was a problem hiding this comment.
I think it's a little bulky to have separate override arguments for each container everywhere. Not sure that there is a better solution though. If we wanted to have overrides like we do for tools, e.g.,
++container_overrides.sandbox = "..."
++container_overrides.judge = "..."
``
But then the choice of key is unclear--since our "job components", e.g., Judge, main, sandbox, ... don't map cleanly to a container name (e.g., "judge" -> containers[judge_server_type], main -> containers["nemo-skills"], sandbox -> containers["sandbox"]).
So I think with the current structure, what you've done the best choice, but maybe we can eventually work toward something a little more general here.
Summary by CodeRabbit
--accountoption across pipeline commands to specify custom Slurm accounts for job submission.--main-container,--sandbox-container,--judge-container,--judge-server-container) for flexible container image selection.