Skip to content

Run Slurm on Kubernetes. A Slinky project.

Notifications You must be signed in to change notification settings

SlinkyProject/slurm-operator

Repository files navigation

Kubernetes Operator for Slurm Clusters

License Tag Go-Version Last-Commit

Run Slurm on Kubernetes, by SchedMD. A Slinky project.

Table of Contents

Overview

Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads that typically run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known.

This project enables the best of both workload managers, unified on Kubernetes. It contains a Kubernetes operator to deploy and manage certain components of Slurm clusters. This repository implements custom-controllers and custom resource definitions (CRDs) designed for the lifecycle (creation, upgrade, graceful shutdown) of Slurm clusters.

Slurm Operator Architecture

For additional architectural notes, see the architecture docs.

Slurm Cluster

Slurm clusters are very flexible and can be configured in various ways. Our Slurm helm chart provides a reference implementation that is highly customizable and tries to expose everything Slurm has to offer.

Slurm Architecture

For additional information about Slurm, see the slurm docs.

Features

NodeSets

A set of homogeneous Slurm nodes (compute nodes, workers), which are delegated to execute the Slurm workload.

The operator will take into consideration the running workload among Slurm nodes as it needs to scale-in, upgrade, or otherwise handle node failures. Slurm nodes will be marked as drain before their eventual termination pending scale-in or upgrade.

The operator supports NodeSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with NodeSets.

Slurm

Slurm is a full featured HPC workload manager. To highlight a few features:

  • Accounting: collect accounting information for every job and job step executed.
  • Partitions: job queues with sets of resources and constraints (e.g. job size limit, job time limit, users permitted).
  • Reservations: reserve resources for jobs being executed by select users and/or select accounts.
  • Job Dependencies: defer the start of jobs until the specified dependencies have been satisfied.
  • Job Containers: jobs which run an unprivileged OCI container bundle.
  • MPI: launch parallel MPI jobs, supports various MPI implementations.
  • Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
  • Preemption: stop one or more low-priority jobs to let a high-priority job run.
  • QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
  • Fairshare: distribute resources equitably among users and accounts based on historical usage.
  • Node Health Check: periodically check node health via script.

Limitations

  • Kubernetes Version: >= v1.29
  • Slurm Version: >= 24.11

Installation

Install the slurm-operator:

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --namespace=slinky --create-namespace

Install a Slurm cluster:

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --namespace=slurm --create-namespace

For additional instructions, see the quickstart guide.

Upgrades

0.X Releases

Breaking changes may be introduced into newer CRDs. To upgrade between these versions, uninstall all Slinky charts and delete Slinky CRDs, then install the new release like normal.

helm --namespace=slurm uninstall slurm
helm --namespace=slinky uninstall slurm-operator
kubectl delete clusters.slinky.slurm.net
kubectl delete nodesets.slinky.slurm.net

Documentation

Project documentation is located in the docs directory of this repository.

Slinky documentation can be found here.

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.