Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features#15
Merged
jhu-svg merged 7 commits intorelease-1.0from Jan 30, 2026
Merged
Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features#15jhu-svg merged 7 commits intorelease-1.0from
jhu-svg merged 7 commits intorelease-1.0from
Conversation
Match the shmSize and existingDataClaims handling that was added to nodeset-cr.yaml for consistency. This allows login pods to: - Have configurable shared memory (/dev/shm) size - Mount existing PVCs for storage access (/data, etc.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rebase our slurm-operator fork onto upstream Slinky v1.0 (v1beta1 APIs).
Re‑introduce Together‑specific behavior that was in PR Changes Compiled #13, adapted to the new 1.0 architecture.
Key changes
Add slinky.slurm.net/node-cordon annotation constant.
Implement updateNodeCordonAnnotation / isNodeCordoned logic in nodeset controller so K8s Nodes are marked when Slurm nodes are drained/undrained.
Keep using upstream pod‑level annotations (AnnotationPodCordon, AnnotationPodDeletionCost) from v1.0.
Add login section to helm/slurm/values.yaml (image, resources, nodeSelector, affinity, tolerations, extra volumes).
Add login helpers in _slurm.tpl.
Add helm/slurm/templates/login/login-deployment.yaml and login-service.yaml to deploy the login pod.
Extend Slurm chart values to support:
Additional tolerations for controller/accounting/restapi where needed.
shmSize and persistence.existingDataClaims for compute NodeSets (e.g. /data, /scratch).
Wire these values into the corresponding templates.
Add .Values.operator.tolerations and .Values.operator.affinity to the operator Deployment.
Ensure RBAC includes update on resources required by new behavior (e.g. nodes).
Set module path to github.com/togethercomputer/slurm-operator and adjust imports so tcloud can depend on this fork cleanly.
Verify
Images in https://github.com/orgs/togethercomputer/packages/container/package/slurm-operator