You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note Creating in rapidsai/deployment so I can use tasklists. When tasklists are GA I'll migrate this issue to the dask/dask-kubernetes repo.
At the end of the summer I want to release V2 of the Dask Kubernetes Operator and fully remove the deprecated classic implementations. This issue outlines the roadmap that we need to complete to get us to a point where we can do that.
High-level goals:
Improve stability
Ensure feature completeness compared to other implementations
Some of the sections here may want to be split off into separate issues, and some tasks may want to be broken down into smaller chunks. But this will be the high-level milestone tracker issue for this work.
Features
Cluster idle timeout
Cleaning up idle clusters automatically becomes critical for cost-reduction when deploying at scale. Especially when using GPUs.
Currently we have partial Istio support where the scheduler uses it but workers do not. This can be a blocker for clusters that enforce Istio on all comms.
Replace Pod resources with higher abstractions like Deployment or at least ReplicaSet
Currently we manage bare Pods. There are downsides to this such as pods not being recreated when they are evicted from a node. It would be good to explore higher-level resources and how they could be used to simplify our controller logic.
Currently we rely on bad configuration being validated by the Kubernetes API, but this doesn't always happen as we expect. We should do more checking and sanitization before calling the Kubernetes API.
The controller event handlers should be idempotent and should be able to be called multiple times. Today they are not which can cause problems when restarting the controller while operations are running.
Today we use pykube-ng, dask_kubernetes.aiopykube, kubernetes_asyncio and subprocess/kubectl to interact with the Kubernetes API. We should consolidate everything around kr8s which was spun out from here with the intention of unifying our API usage.
### Tasks
- [ ] https://github.com/rapidsai/deployment/issues/316
- [ ] Create new helm chart to replace [dask/dask](https://artifacthub.io/packages/helm/dask/dask) with one that uses a DaskCluster instead of scheduler and worker Deployment/Service resources
- [ ] https://github.com/rapidsai/deployment/issues/317
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Dask Kubernetes Summer Roadmap
At the end of the summer I want to release V2 of the Dask Kubernetes Operator and fully remove the deprecated classic implementations. This issue outlines the roadmap that we need to complete to get us to a point where we can do that.
High-level goals:
Some of the sections here may want to be split off into separate issues, and some tasks may want to be broken down into smaller chunks. But this will be the high-level milestone tracker issue for this work.
Features
Cluster idle timeout
Cleaning up idle clusters automatically becomes critical for cost-reduction when deploying at scale. Especially when using GPUs.
Full Istio support
Currently we have partial Istio support where the scheduler uses it but workers do not. This can be a blocker for clusters that enforce Istio on all comms.
UX Improvements
UX can always be improved
Fixes
Replace Pod resources with higher abstractions like Deployment or at least ReplicaSet
Currently we manage bare Pods. There are downsides to this such as pods not being recreated when they are evicted from a node. It would be good to explore higher-level resources and how they could be used to simplify our controller logic.
Ensure patches to DaskCluster and DaskWorkerGroup are propagated to child resources
In the context of CRUD we only have create, read and delete implemented for our resource. We also need to correctly handle updating them.
Ensure scaling/autoscaling is solid
Some users are reporting unwanted behaviour when autoscaling at scale. This needs to be solid.
Input sanitisation
Currently we rely on bad configuration being validated by the Kubernetes API, but this doesn't always happen as we expect. We should do more checking and sanitization before calling the Kubernetes API.
Controller idempotency
The controller event handlers should be idempotent and should be able to be called multiple times. Today they are not which can cause problems when restarting the controller while operations are running.
Hygeine/Tech Debt
Migrate Kubernetes client library to kr8s
Today we use
pykube-ng
,dask_kubernetes.aiopykube
,kubernetes_asyncio
andsubprocess
/kubectl
to interact with the Kubernetes API. We should consolidate everything around kr8s which was spun out from here with the intention of unifying our API usage.Other
The text was updated successfully, but these errors were encountered: