This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes.
The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively.
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0)
COMMAND="uv run ./examples/run_grpo_math.py" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=1:0:0 \
--gres=gpu:8 \
ray.subNotes:
- Some clusters may or may not need
--gres=gpu:8to be added to thesbatchcommand.
Which will print the SLURM_JOB_ID:
Submitted batch job 1980204
Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can tail:
tail -f 1980204-logs/ray-driver.log:::{tip}
A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the Slurm job queue. This means that during debugging sessions, you can avoid submitting a new sbatch command each time. Instead, you can debug and re-submit your NeMo RL job directly from the interactive session.
:::
To run interactively, launch the same command as Batched Job Submission, but omit the COMMAND line:
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0)
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=1:0:0 \
--gres=gpu:8 \
ray.subWhich will print the SLURM_JOB_ID:
Submitted batch job 1980204
Once the Ray cluster is up, a script should be created to attach to the Ray head node, which you can use to launch experiments.
bash 1980204-attach.shNow that you are on the head node, you can launch the command as follows:
uv run ./examples/run_grpo_math.pyThere several choices for UV_CACHE_DIR when using ray.sub:
- (default)
UV_CACHE_DIRdefaults to$SLURM_SUBMIT_DIR/uv_cachewhen not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs. - Use the warm uv cache from our docker images:
... UV_CACHE_DIR=/home/ray/.cache/uv \ sbatch ... \ ray.sub
(1) is more efficient in general since the cache is not ephemeral and is persisted run to run; but for users that
don't want to persist the cache, you can use (2), which is just as performant as (1) if the uv.lock is
covered by warmed cache.
TBD