Skip to content

Conversation

@gnufied
Copy link
Member

@gnufied gnufied commented Oct 31, 2025

xref - kubernetes/enhancements#5030

This makes CAS aware of volume limits on new nodes when scaling for pending pods.

This includes relevant changes in k/k repo from - https://github.com/gnufied/kubernetes/tree/volume-limits-redux-cas

But I have tested the changes together and they work fine.

Add support for counting CSI volume limits when scaling nodes

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler and removed do-not-merge/needs-area labels Oct 31, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 31, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2025
@towca
Copy link
Collaborator

towca commented Nov 7, 2025

@mtrqq Could you do a first-pass review here? This is supposed to follow the way that DRA integration was implemented in CA as much as possible, but instead of the DRA objects we have CSINode.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 10, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2025
@mtrqq
Copy link
Contributor

mtrqq commented Nov 14, 2025

@towca Certainly, I'm on my way reading through the KEP.

@gnufied is it PR still WIP?

@k8s-ci-robot k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 18, 2025
@gnufied
Copy link
Member Author

gnufied commented Nov 18, 2025

@mtrqq yes, the PR is ready for review. It was marked WIP, because I had to keep rebasing it and I had to make bunch of changes because of latest kube rebase.

Even the tests that are still failing are most likely unrelated to this change and are happening because of kube version bump.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 20, 2025
@gnufied gnufied changed the title WIP - Add csinode limit awareness in cluster-autoscaler Add csinode limit awareness in cluster-autoscaler Nov 20, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 20, 2025
@k8s-ci-robot
Copy link
Contributor

@gnufied: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-autoscaling-e2e-ca-build dd1da1e link false /test pull-autoscaling-e2e-ca-build

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

return nil, err
}

wrappedNodeInfo := framework.WrapSchedulerNodeInfo(schedNodeInfo, nil, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me sometime to understand that you've changed the approach from wrapping the node info object inside the CSI/DRA snapshots to mutating the node info object inplace, this somewhat aligns with the approach to reduce the memory allocations when handling node infos, I like this part.

But can we come up with a consistent name for this method across snapshots? For example - AugmentNodeInfo

"k8s.io/klog/v2"
fwk "k8s.io/kube-scheduler/framework"
schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
intreeschedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the import name here? The previous name is consistent across the other parts of the codebase and I don't see how the new name is better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, they changed some of the interfaces in kube-scheduler and I renamed this import because it felt more consistent.

But, Please ignore these renaming for now, because once I rebase my PR with #8827 then these renamings will be gone.

"k8s.io/klog/v2"
fwk "k8s.io/kube-scheduler/framework"
schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
intreeschedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the import name here? The previous name is consistent across the other parts of the codebase and I don't see how the new name is better


// Provider provides access to CSI node information for the cluster.
type Provider struct {
csINodesLister v1storagelister.CSINodeLister
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

casing typo: csiNodesLister

}

// AddCSINodes adds a list of CSI nodes to the snapshot.
func (s *Snapshot) AddCSINodes(csiNodes []*storagev1.CSINode) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse AddCSINode here?

return result
}

func (n *NodeInfo) AddNodeResourceSlices(slices []*resourceapi.ResourceSlice) *NodeInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the error by adding comments to new public methods

exported method NodeInfo.AddCSINode should have comment or be unexported

}

// Validate CSINode if expected
if config.expectedCSINode != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These nested conditions are extremely hard to read, and while in principle I can understand what's supposed to happen here - it's hard to analyse and likely debug. I would advice to move that to helper function and use short-circuit returns for such validation

if config.expectedCSINode == nil || nodeInfo.CSINode == nil 
{
  ...check that nil value match
  return fmt.Errorf("...")
}

if len(expectedDrivers) != len(gotDrivers) {
  return fmt.Errorf("...")
}

Same can be done for the rest of the config

replace github.com/rancher/go-rancher => github.com/rancher/go-rancher v0.1.0

replace k8s.io/api => k8s.io/api v0.34.1
replace k8s.io/api => github.com/kubernetes/api v0.0.0-20251107002836-f1737241c064
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind adding github.com to all k8s dependencies?


replace k8s.io/externaljwt => k8s.io/externaljwt v0.34.1

replace k8s.io/kubernetes => github.com/kubernetes/kubernetes v1.35.0-alpha.3.0.20251107154100-609e2e57dacd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@towca is that a common practice to update dependencies as part of the PR and not in the context of the release?

In principle I understand that it can be unavoidable, but do we depend on the unmerged branch state in here? @gnufied

That's what I understood based on "This includes relevant changes in k/k repo from - https://github.com/gnufied/kubernetes/tree/volume-limits-redux-cas"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I am going to rebase my branch on top of #8827 . I am hoping the beta-0 rebase merges in CAS code, so as these replaces go away.

// if cloudprovider does not provide CSI related stuff, then we can skip the CSI readiness check
if nodeInfo.CSINode == nil {
newReadyNodes = append(newReadyNodes, node)
klog.Warningf("No CSI node found for node %s, Skipping CSI readiness check and keeping node in ready list.", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning level seems to be extreme due to potential noise level in the logs, or do we anticipate all the nodes to have the matching CsiNode?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants