Dask Kubernetes v2 (Stability) Release #216

jacobtomlinson · 2023-04-14T10:23:36Z

Dask Kubernetes Summer Roadmap

Note
Creating in rapidsai/deployment so I can use tasklists. When tasklists are GA I'll migrate this issue to the dask/dask-kubernetes repo.

At the end of the summer I want to release V2 of the Dask Kubernetes Operator and fully remove the deprecated classic implementations. This issue outlines the roadmap that we need to complete to get us to a point where we can do that.

High-level goals:

Improve stability
Ensure feature completeness compared to other implementations

Some of the sections here may want to be split off into separate issues, and some tasks may want to be broken down into smaller chunks. But this will be the high-level milestone tracker issue for this work.

Features

Cluster idle timeout

Cleaning up idle clusters automatically becomes critical for cost-reduction when deploying at scale. Especially when using GPUs.

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/pull/667

Full Istio support

Currently we have partial Istio support where the scheduler uses it but workers do not. This can be a blocker for clusters that enforce Istio on all comms.

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/issues/497

UX Improvements

UX can always be improved

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/pull/699

Fixes

Replace Pod resources with higher abstractions like Deployment or at least ReplicaSet

Currently we manage bare Pods. There are downsides to this such as pods not being recreated when they are evicted from a node. It would be good to explore higher-level resources and how they could be used to simplify our controller logic.

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/issues/606 
- [ ] https://github.com/dask/dask-kubernetes/issues/603 
- [ ] https://github.com/dask/dask-kubernetes/issues/695

Ensure patches to DaskCluster and DaskWorkerGroup are propagated to child resources

In the context of CRUD we only have create, read and delete implemented for our resource. We also need to correctly handle updating them.

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/issues/636

Ensure scaling/autoscaling is solid

Some users are reporting unwanted behaviour when autoscaling at scale. This needs to be solid.

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/issues/677
- [ ] https://github.com/dask/dask-kubernetes/issues/674
- [ ] https://github.com/dask/dask-kubernetes/issues/659 
- [ ] https://github.com/dask/dask-kubernetes/issues/645 
- [ ] https://github.com/dask/dask-kubernetes/issues/733

Input sanitisation

Currently we rely on bad configuration being validated by the Kubernetes API, but this doesn't always happen as we expect. We should do more checking and sanitization before calling the Kubernetes API.

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/issues/650
- [ ] https://github.com/dask/dask-kubernetes/issues/698
- [ ] https://github.com/dask/dask-kubernetes/issues/682
- [ ] https://github.com/dask/dask-kubernetes/issues/536

Controller idempotency

The controller event handlers should be idempotent and should be able to be called multiple times. Today they are not which can cause problems when restarting the controller while operations are running.

### Tasks
- [ ] https://github.com/dask/dask-kubernetes/issues/654

Hygeine/Tech Debt

Migrate Kubernetes client library to kr8s

Today we use pykube-ng, dask_kubernetes.aiopykube, kubernetes_asyncio and subprocess/kubectl to interact with the Kubernetes API. We should consolidate everything around kr8s which was spun out from here with the intention of unifying our API usage.

### Tasks
- [ ] https://github.com/rapidsai/deployment/issues/220

Other

### Tasks
- [ ] https://github.com/rapidsai/deployment/issues/316
- [ ] Create new helm chart to replace [dask/dask](https://artifacthub.io/packages/helm/dask/dask) with one that uses a DaskCluster instead of scheduler and worker Deployment/Service resources
- [ ] https://github.com/rapidsai/deployment/issues/317

The text was updated successfully, but these errors were encountered:

tasansal · 2023-06-07T18:18:21Z

@jacobtomlinson would it make sense to add this to the list as well?

dask/dask-kubernetes#605

jacobtomlinson · 2023-06-08T10:46:53Z

@tasansal We have #603 on the list, which that PR closes 😊

jacobtomlinson · 2024-04-30T15:21:38Z

I'm going to close this epic out as done. Not all tasks here are complete, but they will be prioritised as part of future work.

jacobtomlinson assigned jacobtomlinson and skirui-source Apr 14, 2023

jacobtomlinson added the tool/dask-kubernetes Uses the Dask Kubernetes classic cluster manager label Apr 14, 2023

jacobtomlinson changed the title ~~Dask Kubernetes v2 (Stability)~~ Dask Kubernetes v2 (Stability) Release Apr 14, 2023

jacobtomlinson mentioned this issue Apr 18, 2023

Dask Kubernetes v2 (Stability) Release dask/dask-kubernetes#697

Closed

jacobtomlinson mentioned this issue May 19, 2023

Automatically rescale to recover from pod deletion dask/dask-kubernetes#717

Closed

skirui-source removed their assignment Jun 7, 2023

jacobtomlinson closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dask Kubernetes v2 (Stability) Release #216

Dask Kubernetes v2 (Stability) Release #216

jacobtomlinson commented Apr 14, 2023 •

edited

Loading

tasansal commented Jun 7, 2023

Uh oh!

jacobtomlinson commented Jun 8, 2023

Uh oh!

jacobtomlinson commented Apr 30, 2024

Uh oh!

Dask Kubernetes v2 (Stability) Release #216

Dask Kubernetes v2 (Stability) Release #216

Comments

jacobtomlinson commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dask Kubernetes Summer Roadmap

Features

Cluster idle timeout

Full Istio support

UX Improvements

Fixes

Replace Pod resources with higher abstractions like Deployment or at least ReplicaSet

Ensure patches to DaskCluster and DaskWorkerGroup are propagated to child resources

Ensure scaling/autoscaling is solid

Input sanitisation

Controller idempotency

Hygeine/Tech Debt

Migrate Kubernetes client library to kr8s

Other

tasansal commented Jun 7, 2023

Uh oh!

jacobtomlinson commented Jun 8, 2023

Uh oh!

jacobtomlinson commented Apr 30, 2024

Uh oh!

jacobtomlinson commented Apr 14, 2023 •

edited

Loading