Set Up Clusters

This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes.

Slurm (Batched and Interactive)

The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively.

Batched Job Submission

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=1  # Total nodes requested (head is colocated on ray-worker-0)

COMMAND="uv run ./examples/run_grpo_math.py" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=1:0:0 \
    --gres=gpu:8 \
    ray.sub

Notes:

Some clusters may or may not need --gres=gpu:8 to be added to the sbatch command.

Which will print the SLURM_JOB_ID:

Submitted batch job 1980204

Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can tail:

tail -f 1980204-logs/ray-driver.log

Interactive Launching

:::{tip} A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the Slurm job queue. This means that during debugging sessions, you can avoid submitting a new sbatch command each time. Instead, you can debug and re-submit your NeMo RL job directly from the interactive session. :::

To run interactively, launch the same command as Batched Job Submission, but omit the COMMAND line:

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=1  # Total nodes requested (head is colocated on ray-worker-0)

CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=1:0:0 \
    --gres=gpu:8 \
    ray.sub

Which will print the SLURM_JOB_ID:

Submitted batch job 1980204

Once the Ray cluster is up, a script should be created to attach to the Ray head node, which you can use to launch experiments.

bash 1980204-attach.sh

Now that you are on the head node, you can launch the command as follows:

uv run ./examples/run_grpo_math.py

Slurm UV_CACHE_DIR

There several choices for UV_CACHE_DIR when using ray.sub:

(default) UV_CACHE_DIR defaults to $SLURM_SUBMIT_DIR/uv_cache when not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs.

Use the warm uv cache from our docker images:

...
UV_CACHE_DIR=/home/ray/.cache/uv \
sbatch ... \
    ray.sub

(1) is more efficient in general since the cache is not ephemeral and is persisted run to run; but for users that don't want to persist the cache, you can use (2), which is just as performant as (1) if the uv.lock is covered by warmed cache.

Kubernetes

TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set Up Clusters

Slurm (Batched and Interactive)

Batched Job Submission

Interactive Launching

Slurm UV_CACHE_DIR

Kubernetes

FilesExpand file tree

cluster.md

Latest commit

History

cluster.md

File metadata and controls

Set Up Clusters

Slurm (Batched and Interactive)

Batched Job Submission

Interactive Launching

Slurm UV_CACHE_DIR

Kubernetes