Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA DRA: review calculating Node utilization for DRA resources #7781

Open
towca opened this issue Jan 29, 2025 · 0 comments
Open

CA DRA: review calculating Node utilization for DRA resources #7781

towca opened this issue Jan 29, 2025 · 0 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@towca
Copy link
Collaborator

towca commented Jan 29, 2025

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Cluster Autoscaler scale-down logic calculates a utilization value for each Node in the cluster. Only Nodes with utilization below a configured threshold are considered candidates for scale-down. Utilization is computed separately for every resource, by dividing the sum of Pods' requests by the Node allocatable value (it's not looking at "real" usage, just requests). If a Node has a GPU, only the GPU utilization is taken into account - CPU and memory are ignored. If a Node doesn't have a GPU, its utilization is the max of CPU and memory utilization values. It's not immediately clear how to extend this model for DRA resources.

This DRA MVP PR extended the utilization model in the following way. All Node-local Devices exposed in ResourceSlices for a given Node are grouped by their Driver and Pool. If ResourceSlices don't present a complete view of a Pool, utilization is not calculated and the Node is not scaled down until that changes. Allocated Node-local devices from the scheduled Pods' ResourceClaims are also grouped by their Driver and Pool. Utilization is calculated separately for every <Driver, Pool> pair by dividing the number of allocated devices by the number of total devices for a given Driver and Pool. Utilization of the Node is the max of these <Driver, Pool> utilization values. Similarly to how GPUs are treated, if DRA resources are exposed for a Node only DRA utilization is taken into account - CPU and memory are ignored.

The logic above should work pretty well for Nodes with one local Pool with identical, expensive (compared to CPU/memory) Devices - basically mimicking the current GPU approach. For other scenarios, it will behave predictably but it doesn't seem nearly flexible enough to be usable in practice (I might be wrong here though). What if a given DRA Device is not more expensive than CPU/memory and shouldn't be prioritized? What if it's some fake, "free" Device that shouldn't actually be taken into account when calculating utilization? Etc.

Describe the solution you'd like.:

IMO we need to have some ability to configure utilization calculation details from within the ResourceSlices. For example, Devices exposed in ResourceSlices could specify how to be treated when calculating (e.g. ignore this Device because it's "free", or treat the same as other Device in the Pool, or only look at this Device because it dominates the Node cost).

Not sure what the actual solution should be, but it should be designed (not sure if KEP-level or something more lightweight) and discussed with WG Device Management, then implemented.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jan 29, 2025
@mbrow137 mbrow137 moved this from 🆕 New to 🏗 In Progress in SIG Autoscaler: Dynamic Resource Allocation Feb 4, 2025
@mbrow137 mbrow137 moved this from 🏗 In Progress to 🆕 New in SIG Autoscaler: Dynamic Resource Allocation Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
None yet
Development

No branches or pull requests

2 participants