Skip to content

Comments

New slurm customization parameters (account, containers)#1209

Open
Kipok wants to merge 2 commits intomainfrom
igitman/account-arg
Open

New slurm customization parameters (account, containers)#1209
Kipok wants to merge 2 commits intomainfrom
igitman/account-arg

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Feb 3, 2026

Summary by CodeRabbit

  • New Features
    • Added --account option across pipeline commands to specify custom Slurm accounts for job submission.
    • Added container override options (--main-container, --sandbox-container, --judge-container, --judge-server-container) for flexible container image selection.
    • Enhanced job submission flexibility with non-default resource configurations across convert, eval, generate, run_cmd, start_server, and nemo_evaluator commands.

Kipok added 2 commits February 3, 2026 11:52
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

This PR extends the nemo_skills pipeline infrastructure to support Slurm account specification and container image overrides across multiple CLI commands and task creation paths. New optional parameters are added to convert, generate, eval, run_cmd, and start_server commands, with values threaded through to task submission and underlying HardwareConfig and executor infrastructure.

Changes

Cohort / File(s) Summary
CLI Command Extensions
nemo_skills/pipeline/convert.py, nemo_skills/pipeline/run_cmd.py
Added account and container CLI options that are propagated to task submission via add_task calls. Container selection uses provided override or falls back to default mapping.
Generate Pipeline Updates
nemo_skills/pipeline/generate.py
Extended _create_job_unified to accept account, main_container, and sandbox_container; updated Command construction to use container overrides with fallback to cluster_config defaults; propagated these parameters through multi-model job creation.
Eval Pipeline Updates
nemo_skills/pipeline/eval.py
Added account, judge_container, main_container, sandbox_container, and judge_server_container parameters throughout eval entrypoint and helper functions (_create_comet_judge_tasks, _create_nvembed_judge_tasks, _create_llm_judge_tasks); container selection uses overrides with fallback to vllm/main defaults.
Server and Evaluator Commands
nemo_skills/pipeline/start_server.py, nemo_skills/pipeline/nemo_evaluator.py
Added account parameter to start_server CLI; introduced account field to _TaskCreationContext in nemo_evaluator with propagation through _hardware_for_group calls; main_container and sandbox_container overrides in start_server with fallback to cluster defaults.
Core Infrastructure
nemo_skills/pipeline/utils/declarative.py, nemo_skills/pipeline/utils/exp.py
Added account field to HardwareConfig dataclass; extended get_executor and add_task function signatures to accept account parameter with fallback to cluster_config["account"]; threaded account through executor creation for server, main, and sandbox tasks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • activatedgeek
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change across all affected files: adding new Slurm customization parameters (account and container overrides) to the pipeline commands.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/account-arg

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemo_skills/pipeline/nemo_evaluator.py (1)

560-590: ⚠️ Potential issue | 🟡 Minor

Don’t silently ignore exclusive.

The parameter is accepted and threaded through, but never applied. Either honor it via sbatch_kwargs or fail fast when it’s set so users don’t think they’re getting exclusive nodes.

Suggested fail-fast guard
 def _hardware_for_group(
     partition: Optional[str],
     account: Optional[str],
     num_gpus: Optional[int],
     num_nodes: int,
     qos: Optional[str],
     exclusive: bool,
 ) -> HardwareConfig:
+    if exclusive:
+        raise ValueError("exclusive is not supported for nemo_evaluator jobs yet; remove --exclusive.")
     return HardwareConfig(
         partition=partition,
         account=account,
         num_gpus=num_gpus,
         num_nodes=num_nodes,

As per coding guidelines, avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided.

nemo_skills/pipeline/eval.py (1)

816-866: ⚠️ Potential issue | 🟠 Major

Account override is missing for summarize/compute-score tasks.
When a user specifies --account, these Slurm tasks still run under the default account and can fail on clusters without a default. Please propagate account=account in both add_task calls.

🔧 Proposed fix
                 summarize_task = pipeline_utils.add_task(
                     exp,
                     cmd=command,
                     task_name=f"{expname}-{benchmark}-summarize-results",
                     log_dir=f"{output_dir}/{benchmark_args.eval_subfolder}/summarized-results",
                     container=cluster_config["containers"]["nemo-skills"],
                     cluster_config=cluster_config,
+                    account=account,
                     run_after=run_after,
                     reuse_code_exp=reuse_code_exp,
                     reuse_code=reuse_code,
                     task_dependencies=(
                         dependent_tasks if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies
                     ),
                     installation_command=installation_command,
                     skip_hf_home_check=skip_hf_home_check,
                     sbatch_kwargs=sbatch_kwargs,
                 )
@@
                 score_task = pipeline_utils.add_task(
                     exp,
                     cmd=command,
                     task_name=f"{expname}-{group}-compute-score",
                     log_dir=f"{output_dir}/eval-results/{group}/compute-score-logs",
                     container=cluster_config["containers"]["nemo-skills"],
                     cluster_config=cluster_config,
+                    account=account,
                     run_after=run_after,
                     reuse_code_exp=reuse_code_exp,
                     reuse_code=reuse_code,
                     task_dependencies=(
                         group_tasks[group] if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies
                     ),
                     installation_command=installation_command,
                     skip_hf_home_check=skip_hf_home_check,
                     sbatch_kwargs=sbatch_kwargs,
                 )

As per coding guidelines, Avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided. Use dataclasses or **kwargs syntax to handle this automatically.

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good. Have a minor comment about goals for the future with this, but I don't think it requires action.

Comment on lines +367 to +369
main_container: str = typer.Option(None, help="Override container image for the main evaluation client"),
sandbox_container: str = typer.Option(None, help="Override container image for the sandbox"),
judge_container: str = typer.Option(None, help="Override container image for GPU-based judges (comet, nvembed)"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little bulky to have separate override arguments for each container everywhere. Not sure that there is a better solution though. If we wanted to have overrides like we do for tools, e.g.,

++container_overrides.sandbox = "..."
++container_overrides.judge = "..."
``
But then the choice of key is unclear--since our "job components", e.g., Judge, main, sandbox, ... don't map cleanly to a container name (e.g., "judge" -> containers[judge_server_type], main -> containers["nemo-skills"], sandbox -> containers["sandbox"]).

So I think with the current structure, what you've done the best choice, but maybe we can eventually work toward something a little more general here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants