Open
Description
There are various workflows that the operator enables. Some steps are carried out by the user (creating daskcluster
resources), some things are done by the operator (creating pods/services), some things are recursive (the operator creates worker groups, then the operator creates pods for those worker groups) and other things are done by kubernetes itself (deleting a daskcluster
causes kubernetes to cascade delete the worker groups, pods, services, etc).
It would be good to document all of these things. Something like this.
Installation
- user installs new dask cluster and worker group resource types
- user installs operator daemon
Cluster creation
- User creates cluster resource
- operator notices cluster resource and creates scheduler pod/service and worker group resource
- operator notices worker group resource and creates worker pods
Cluster scaling
- User modifies cluster worker count
- operator notices change and starts/stops pods to match
Cluster adaptive mode
- User toggles adaptive setting on the cluster resource
- operator begins polling the scheduler for desired number of workers and adjusts the worker count on the cluster resource to match (triggering the scaling workflow when it changes)
Cluster deletion
- User deletes cluster resource
- Kubernetes cascade deletes all child resources including worker groups, pods and services
Create additional worker groups for heterogenous clusters
- User creates new worker group resource (with different resources to the default like GPUs or high memory)
- operator notices new worker group and creates pods
- operator adopts the worker group resource to the cluster resource so that it will also be cascade deleted
This could also be a fun time to try out mermaid diagrams.