Add csinode limit awareness in cluster-autoscaler #8721

gnufied · 2025-10-31T16:00:12Z

This makes CAS aware of volume limits on new nodes when scaling for pending pods.

This includes relevant changes in k/k repo from - https://github.com/gnufied/kubernetes/tree/volume-limits-redux-cas

But I have tested the changes together and they work fine.

Add support for counting CSI volume limits when scaling nodes

k8s-ci-robot · 2025-10-31T16:00:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

towca · 2025-11-07T13:33:08Z

@mtrqq Could you do a first-pass review here? This is supposed to follow the way that DRA integration was implemented in CA as much as possible, but instead of the DRA objects we have CSINode.

mtrqq · 2025-11-14T17:03:20Z

@towca Certainly, I'm on my way reading through the KEP.

@gnufied is it PR still WIP?

gnufied · 2025-11-18T19:41:32Z

@mtrqq yes, the PR is ready for review. It was marked WIP, because I had to keep rebasing it and I had to make bunch of changes because of latest kube rebase.

Even the tests that are still failing are most likely unrelated to this change and are happening because of kube version bump.

Bump vendor dependencies for latest k8s master

Fix code to use new framework

k8s-ci-robot · 2025-11-20T14:08:57Z

@gnufied: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-autoscaling-e2e-ca-build	`dd1da1e`	link	false	`/test pull-autoscaling-e2e-ca-build`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mtrqq · 2025-11-28T14:20:49Z

cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot.go

 		return nil, err
 	}

+	wrappedNodeInfo := framework.WrapSchedulerNodeInfo(schedNodeInfo, nil, nil)


It took me sometime to understand that you've changed the approach from wrapping the node info object inside the CSI/DRA snapshots to mutating the node info object inplace, this somewhat aligns with the approach to reduce the memory allocations when handling node infos, I like this part.

But can we come up with a consistent name for this method across snapshots? For example - AugmentNodeInfo

mtrqq · 2025-11-28T15:50:22Z

cluster-autoscaler/simulator/clustersnapshot/store/basic.go

 	"k8s.io/klog/v2"
 	fwk "k8s.io/kube-scheduler/framework"
-	schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
+	intreeschedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"


Why change the import name here? The previous name is consistent across the other parts of the codebase and I don't see how the new name is better

So, they changed some of the interfaces in kube-scheduler and I renamed this import because it felt more consistent.

But, Please ignore these renaming for now, because once I rebase my PR with #8827 then these renamings will be gone.

mtrqq · 2025-11-28T15:50:35Z

cluster-autoscaler/simulator/clustersnapshot/store/delta.go

 	"k8s.io/klog/v2"
 	fwk "k8s.io/kube-scheduler/framework"
-	schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
+	intreeschedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"


Why change the import name here? The previous name is consistent across the other parts of the codebase and I don't see how the new name is better

mtrqq · 2025-11-28T15:53:31Z

cluster-autoscaler/simulator/csi/provider/csinode.go

+
+// Provider provides access to CSI node information for the cluster.
+type Provider struct {
+	csINodesLister v1storagelister.CSINodeLister


casing typo: csiNodesLister

mtrqq · 2025-11-28T15:56:58Z

cluster-autoscaler/simulator/csi/snapshot/snapshot.go

+}
+
+// AddCSINodes adds a list of CSI nodes to the snapshot.
+func (s *Snapshot) AddCSINodes(csiNodes []*storagev1.CSINode) error {


Can we reuse AddCSINode here?

mtrqq · 2025-11-28T16:20:02Z

cluster-autoscaler/simulator/framework/infos.go

 	return result
 }

+func (n *NodeInfo) AddNodeResourceSlices(slices []*resourceapi.ResourceSlice) *NodeInfo {


Please resolve the error by adding comments to new public methods

exported method NodeInfo.AddCSINode should have comment or be unexported

mtrqq · 2025-11-28T16:34:33Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup_test.go

 		}
+
+		// Validate CSINode if expected
+		if config.expectedCSINode != nil {


These nested conditions are extremely hard to read, and while in principle I can understand what's supposed to happen here - it's hard to analyse and likely debug. I would advice to move that to helper function and use short-circuit returns for such validation

if config.expectedCSINode == nil || nodeInfo.CSINode == nil { ...check that nil value match return fmt.Errorf("...") } if len(expectedDrivers) != len(gotDrivers) { return fmt.Errorf("...") }

Same can be done for the rest of the config

mtrqq · 2025-11-28T16:38:47Z

cluster-autoscaler/go.mod

 replace github.com/rancher/go-rancher => github.com/rancher/go-rancher v0.1.0

-replace k8s.io/api => k8s.io/api v0.34.1
+replace k8s.io/api => github.com/kubernetes/api v0.0.0-20251107002836-f1737241c064


What's the reason behind adding github.com to all k8s dependencies?

mtrqq · 2025-11-28T16:43:42Z

cluster-autoscaler/go.mod


 replace k8s.io/externaljwt => k8s.io/externaljwt v0.34.1
+
+replace k8s.io/kubernetes => github.com/kubernetes/kubernetes v1.35.0-alpha.3.0.20251107154100-609e2e57dacd


@towca is that a common practice to update dependencies as part of the PR and not in the context of the release?

In principle I understand that it can be unavoidable, but do we depend on the unmerged branch state in here? @gnufied

That's what I understood based on "This includes relevant changes in k/k repo from - https://github.com/gnufied/kubernetes/tree/volume-limits-redux-cas"

Again, I am going to rebase my branch on top of #8827 . I am hoping the beta-0 rebase merges in CAS code, so as these replaces go away.

mtrqq · 2025-11-28T16:45:53Z

cluster-autoscaler/processors/customresources/csi_processor.go

+		// if cloudprovider does not provide CSI related stuff, then we can skip the CSI readiness check
+		if nodeInfo.CSINode == nil {
+			newReadyNodes = append(newReadyNodes, node)
+			klog.Warningf("No CSI node found for node %s, Skipping CSI readiness check and keeping node in ready list.", node.Name)


Warning level seems to be extreme due to potential noise level in the logs, or do we anticipate all the nodes to have the matching CsiNode?

k8s-ci-robot · 2025-12-01T12:26:38Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from BigDarkClown and elmiko October 31, 2025 16:00

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 31, 2025

gnufied mentioned this pull request Oct 31, 2025

Do not schedule pods to a node without CSI driver kubernetes/kubernetes#135012

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025

gnufied force-pushed the add-csinode-limit branch from c6d08e7 to c26409b Compare November 5, 2025 20:15

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2025

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 10, 2025

gnufied force-pushed the add-csinode-limit branch from 10b6319 to 18906a8 Compare November 10, 2025 17:05

This was referenced Nov 10, 2025

CA DRA: implement DeviceClassResolver() #8742

Open

Implement CSIManager interface introduced in https://github.com/kubernetes/kubernetes/pull/135012 #8788

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2025

gnufied force-pushed the add-csinode-limit branch from 48c0b28 to 2a287f3 Compare November 18, 2025 17:30

k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 18, 2025

gnufied force-pushed the add-csinode-limit branch from 2a287f3 to 2811f12 Compare November 19, 2025 19:25

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 20, 2025

gnufied force-pushed the add-csinode-limit branch from 2811f12 to e673a9b Compare November 20, 2025 13:59

gnufied changed the title ~~WIP - Add csinode limit awareness in cluster-autoscaler~~ Add csinode limit awareness in cluster-autoscaler Nov 20, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025

gnufied added 3 commits November 20, 2025 09:03

Vendor all the k8s stuff and dependencies

9caf573

Bump vendor dependencies for latest k8s master

Update code to enable CSI node awareness

8388db9

Fix code to use new framework

Add code for reading unstructured clusterapi csidriver annotations

dd1da1e

gnufied force-pushed the add-csinode-limit branch from e673a9b to dd1da1e Compare November 20, 2025 14:03

mtrqq suggested changes Nov 28, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2025


		replace k8s.io/externaljwt => k8s.io/externaljwt v0.34.1

		replace k8s.io/kubernetes => github.com/kubernetes/kubernetes v1.35.0-alpha.3.0.20251107154100-609e2e57dacd

Add csinode limit awareness in cluster-autoscaler #8721

Are you sure you want to change the base?

Add csinode limit awareness in cluster-autoscaler #8721

Conversation

gnufied commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 31, 2025

Uh oh!

towca commented Nov 7, 2025

Uh oh!

mtrqq commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnufied commented Nov 18, 2025

Uh oh!

k8s-ci-robot commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gnufied commented Oct 31, 2025 •

edited

Loading

mtrqq commented Nov 14, 2025 •

edited

Loading