You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to propose adding Kubernetes-native deployment and monitoring support for CoreWeave SUNK (Slurm on Kubernetes) environments. Since SUNK exposes the standard Slurm interface on top of Kubernetes, GCM already works with SUNK clusters out of the box for most collectors. This proposal focuses on closing the gaps and adding Kubernetes-aware monitoring.
Motivation
SUNK is gaining traction as a deployment model for GPU clusters, especially in cloud-native environments. While GCM's existing collectors work with SUNK (same Slurm CLI/REST interface), there are opportunities to:
Enrich Slurm data with Kubernetes metadata — correlate Slurm job IDs with K8s pod status, resource requests/limits, and node conditions
Deploy GCM natively in K8s — run as a sidecar or DaemonSet alongside SUNK components rather than requiring separate host-level installation
Adapt health checks for containerized environments — checks that depend on host access (dmesg, syslogs) need container-aware alternatives
All collectors (squeue, sinfo, sshare, sprio, etc.)
Yes
Identical Slurm output
Health checks
Partial
Host-level checks need adaptation
Exporters/Sinks
Yes
Fully scheduler-agnostic
Testing
I have access to my own lab environment where I can validate the implementation end-to-end on a SUNK cluster. All new code will include unit tests following the existing patterns.
Timeline
I estimate approximately 4 weeks to deliver a complete implementation, as I have a trip scheduled in the middle of the development period. Happy to break this into smaller PRs if preferred.
Questions for Maintainers
Does this direction align with the project's vision for "Support for additional schedulers beyond Slurm" from the roadmap?
Would you prefer this as a single PR or broken into smaller incremental PRs?
Any specific Kubernetes metrics or health checks you'd like to see prioritized?
Looking forward to your feedback before starting implementation.
Summary
I'd like to propose adding Kubernetes-native deployment and monitoring support for CoreWeave SUNK (Slurm on Kubernetes) environments. Since SUNK exposes the standard Slurm interface on top of Kubernetes, GCM already works with SUNK clusters out of the box for most collectors. This proposal focuses on closing the gaps and adding Kubernetes-aware monitoring.
Motivation
SUNK is gaining traction as a deployment model for GPU clusters, especially in cloud-native environments. While GCM's existing collectors work with SUNK (same Slurm CLI/REST interface), there are opportunities to:
Proposed Scope
1. Kubernetes metrics collector (
k8s_pod_monitor.py)SinkImplprotocol,run_data_collection_loop)2. Helm chart adaptation for SUNK deployment
3. Container-aware health checks
4. Documentation
values.yamlfor SUNK integrationCompatibility
Since SUNK exposes the standard Slurm interface, all existing collectors remain fully compatible:
Testing
I have access to my own lab environment where I can validate the implementation end-to-end on a SUNK cluster. All new code will include unit tests following the existing patterns.
Timeline
I estimate approximately 4 weeks to deliver a complete implementation, as I have a trip scheduled in the middle of the development period. Happy to break this into smaller PRs if preferred.
Questions for Maintainers
Looking forward to your feedback before starting implementation.