CA DRA: review calculating Node utilization for DRA resources #7781
Labels
area/cluster-autoscaler
area/core-autoscaler
Denotes an issue that is related to the core autoscaler and is not specific to any provider.
wg/device-management
Categorizes an issue or PR as relevant to WG Device Management.
Which component are you using?:
/area cluster-autoscaler
/area core-autoscaler
/wg device-management
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
Cluster Autoscaler scale-down logic calculates a utilization value for each Node in the cluster. Only Nodes with utilization below a configured threshold are considered candidates for scale-down. Utilization is computed separately for every resource, by dividing the sum of Pods' requests by the Node allocatable value (it's not looking at "real" usage, just requests). If a Node has a GPU, only the GPU utilization is taken into account - CPU and memory are ignored. If a Node doesn't have a GPU, its utilization is the max of CPU and memory utilization values. It's not immediately clear how to extend this model for DRA resources.
This DRA MVP PR extended the utilization model in the following way. All Node-local Devices exposed in ResourceSlices for a given Node are grouped by their Driver and Pool. If ResourceSlices don't present a complete view of a Pool, utilization is not calculated and the Node is not scaled down until that changes. Allocated Node-local devices from the scheduled Pods' ResourceClaims are also grouped by their Driver and Pool. Utilization is calculated separately for every <Driver, Pool> pair by dividing the number of allocated devices by the number of total devices for a given Driver and Pool. Utilization of the Node is the max of these <Driver, Pool> utilization values. Similarly to how GPUs are treated, if DRA resources are exposed for a Node only DRA utilization is taken into account - CPU and memory are ignored.
The logic above should work pretty well for Nodes with one local Pool with identical, expensive (compared to CPU/memory) Devices - basically mimicking the current GPU approach. For other scenarios, it will behave predictably but it doesn't seem nearly flexible enough to be usable in practice (I might be wrong here though). What if a given DRA Device is not more expensive than CPU/memory and shouldn't be prioritized? What if it's some fake, "free" Device that shouldn't actually be taken into account when calculating utilization? Etc.
Describe the solution you'd like.:
IMO we need to have some ability to configure utilization calculation details from within the ResourceSlices. For example, Devices exposed in ResourceSlices could specify how to be treated when calculating (e.g. ignore this Device because it's "free", or treat the same as other Device in the Pool, or only look at this Device because it dominates the Node cost).
Not sure what the actual solution should be, but it should be designed (not sure if KEP-level or something more lightweight) and discussed with WG Device Management, then implemented.
Additional context.:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.
The text was updated successfully, but these errors were encountered: