New slurm customization parameters (account, containers) by Kipok · Pull Request #1209 · NVIDIA-NeMo/Skills

Kipok · 2026-02-03T22:04:19Z

Summary by CodeRabbit

New Features
- Added --account option across pipeline commands to specify custom Slurm accounts for job submission.
- Added container override options (--main-container, --sandbox-container, --judge-container, --judge-server-container) for flexible container image selection.
- Enhanced job submission flexibility with non-default resource configurations across convert, eval, generate, run_cmd, start_server, and nemo_evaluator commands.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai · 2026-02-03T22:09:56Z

📝 Walkthrough

Walkthrough

This PR extends the nemo_skills pipeline infrastructure to support Slurm account specification and container image overrides across multiple CLI commands and task creation paths. New optional parameters are added to convert, generate, eval, run_cmd, and start_server commands, with values threaded through to task submission and underlying HardwareConfig and executor infrastructure.

Changes

Cohort / File(s)	Summary
CLI Command Extensions `nemo_skills/pipeline/convert.py`, `nemo_skills/pipeline/run_cmd.py`	Added account and container CLI options that are propagated to task submission via add_task calls. Container selection uses provided override or falls back to default mapping.
Generate Pipeline Updates `nemo_skills/pipeline/generate.py`	Extended _create_job_unified to accept account, main_container, and sandbox_container; updated Command construction to use container overrides with fallback to cluster_config defaults; propagated these parameters through multi-model job creation.
Eval Pipeline Updates `nemo_skills/pipeline/eval.py`	Added account, judge_container, main_container, sandbox_container, and judge_server_container parameters throughout eval entrypoint and helper functions (_create_comet_judge_tasks, _create_nvembed_judge_tasks, _create_llm_judge_tasks); container selection uses overrides with fallback to vllm/main defaults.
Server and Evaluator Commands `nemo_skills/pipeline/start_server.py`, `nemo_skills/pipeline/nemo_evaluator.py`	Added account parameter to start_server CLI; introduced account field to _TaskCreationContext in nemo_evaluator with propagation through _hardware_for_group calls; main_container and sandbox_container overrides in start_server with fallback to cluster defaults.
Core Infrastructure `nemo_skills/pipeline/utils/declarative.py`, `nemo_skills/pipeline/utils/exp.py`	Added account field to HardwareConfig dataclass; extended get_executor and add_task function signatures to accept account parameter with fallback to cluster_config["account"]; threaded account through executor creation for server, main, and sandbox tasks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Use run.Script for generate pipeline #1052: Modifies generate.py's _create_job_unified and HardwareConfig in declarative.py, directly overlapping with container and configuration wiring changes in this PR.
Revert "Use run.Script for generate pipeline (#1052)" #1125: Changes generate.py and nemo_evaluator/declarative infrastructure, with potential conflicts in HardwareConfig and task creation interfaces.
Evaluation on OJBench #848: Updates the same pipeline entrypoints and task-creation APIs (eval, generate, run_cmd, start_server), adding complementary CLI parameters and add_task signature extensions.

Suggested reviewers

activatedgeek

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change across all affected files: adding new Slurm customization parameters (account and container overrides) to the pipeline commands.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/account-arg

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

nemo_skills/pipeline/nemo_evaluator.py (1)

560-590: ⚠️ Potential issue | 🟡 Minor

Don’t silently ignore exclusive.

The parameter is accepted and threaded through, but never applied. Either honor it via sbatch_kwargs or fail fast when it’s set so users don’t think they’re getting exclusive nodes.
Suggested fail-fast guard
 def _hardware_for_group(
     partition: Optional[str],
     account: Optional[str],
     num_gpus: Optional[int],
     num_nodes: int,
     qos: Optional[str],
     exclusive: bool,
 ) -> HardwareConfig:
+    if exclusive:
+        raise ValueError("exclusive is not supported for nemo_evaluator jobs yet; remove --exclusive.")
     return HardwareConfig(
         partition=partition,
         account=account,
         num_gpus=num_gpus,
         num_nodes=num_nodes,
As per coding guidelines, avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided.

nemo_skills/pipeline/eval.py (1)

816-866: ⚠️ Potential issue | 🟠 Major

Account override is missing for summarize/compute-score tasks.
When a user specifies --account, these Slurm tasks still run under the default account and can fail on clusters without a default. Please propagate account=account in both add_task calls.

🔧 Proposed fix

                 summarize_task = pipeline_utils.add_task(
                     exp,
                     cmd=command,
                     task_name=f"{expname}-{benchmark}-summarize-results",
                     log_dir=f"{output_dir}/{benchmark_args.eval_subfolder}/summarized-results",
                     container=cluster_config["containers"]["nemo-skills"],
                     cluster_config=cluster_config,
+                    account=account,
                     run_after=run_after,
                     reuse_code_exp=reuse_code_exp,
                     reuse_code=reuse_code,
                     task_dependencies=(
                         dependent_tasks if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies
                     ),
                     installation_command=installation_command,
                     skip_hf_home_check=skip_hf_home_check,
                     sbatch_kwargs=sbatch_kwargs,
                 )
@@
                 score_task = pipeline_utils.add_task(
                     exp,
                     cmd=command,
                     task_name=f"{expname}-{group}-compute-score",
                     log_dir=f"{output_dir}/eval-results/{group}/compute-score-logs",
                     container=cluster_config["containers"]["nemo-skills"],
                     cluster_config=cluster_config,
+                    account=account,
                     run_after=run_after,
                     reuse_code_exp=reuse_code_exp,
                     reuse_code=reuse_code,
                     task_dependencies=(
                         group_tasks[group] if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies
                     ),
                     installation_command=installation_command,
                     skip_hf_home_check=skip_hf_home_check,
                     sbatch_kwargs=sbatch_kwargs,
                 )

As per coding guidelines, Avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided. Use dataclasses or **kwargs syntax to handle this automatically.

gwarmstrong

In general looks good. Have a minor comment about goals for the future with this, but I don't think it requires action.

gwarmstrong · 2026-02-03T22:10:16Z

nemo_skills/pipeline/eval.py

+    main_container: str = typer.Option(None, help="Override container image for the main evaluation client"),
+    sandbox_container: str = typer.Option(None, help="Override container image for the sandbox"),
+    judge_container: str = typer.Option(None, help="Override container image for GPU-based judges (comet, nvembed)"),


I think it's a little bulky to have separate override arguments for each container everywhere. Not sure that there is a better solution though. If we wanted to have overrides like we do for tools, e.g.,

++container_overrides.sandbox = "..." ++container_overrides.judge = "..." `` But then the choice of key is unclear--since our "job components", e.g., Judge, main, sandbox, ... don't map cleanly to a container name (e.g., "judge" -> containers[judge_server_type], main -> containers["nemo-skills"], sandbox -> containers["sandbox"]). So I think with the current structure, what you've done the best choice, but maybe we can eventually work toward something a little more general here.

Kipok added 2 commits February 3, 2026 11:52

Add an option to specify non-default slurm account

b1c2066

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add overrides for containers

5a89b68

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

gwarmstrong approved these changes Feb 3, 2026

View reviewed changes

Kipok added the run GPU tests label Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

New slurm customization parameters (account, containers)#1209

New slurm customization parameters (account, containers)#1209
Kipok wants to merge 2 commits intomainfrom
igitman/account-arg

Kipok commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

gwarmstrong left a comment

Uh oh!

gwarmstrong Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Kipok commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kipok commented Feb 3, 2026 •

edited by coderabbitai bot

Loading