OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806

jlamillan · 2025-11-13T02:22:44Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR updates the behavior of OCI's nodepool implementation of DeleteNodes() when it is called with an unfulfilled placeholder node that is in an error state. Typically is is due to capacity or quota issues. Rather than returning an error because the placeholder node has an empty instance ID, the DeleteNodes() request effectively cancels the failed scale-up, allowing the Cluster Autoscaler to recover more quickly and try scaling up a different node pool that meets the scheduling requirements.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

These changes are currently running in an environment where we have proactively configured multiple node-pools with different shape configurations in order to make the Cluster Autoscaler more resilient against capacity issues..

Does this PR introduce a user-facing change?

Remove unprovisioned nodes from OCI provider as requested instead of returning an error.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

…e provisioned.

k8s-ci-robot · 2025-11-13T02:22:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlamillan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/oci/OWNERS~~ [jlamillan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlamillan · 2025-11-13T02:53:45Z

@trungng92 let me know what you think.

jlamillan · 2025-11-13T02:55:33Z

@vbhargav875 you as well

trungng92 · 2025-11-14T16:00:40Z

cluster-autoscaler/cloudprovider/oci/nodepools/cache.go

 			if *node.Id == instanceID {
 				return nodePool, nil
+			} else if ocicommon.InstanceIDUnfulfilled == instanceID && *node.Id == "" {
+				return nodePool, nil


Is it possible that this line is going to cause us to return the wrong node pool?

For instance if a user used balance-similar-node-groups, and 3 node pools scaled up all at once with "unfulfilled instances".

But when we do getByInstance, will the "unfulfilled instances" return the first node pool every time?

I pushed a commit to only match an unfulfilled instance to a node pool if that pool has a node in CREATING status with an associated error.

In short, if a node in a pool can't be created due to an error, we treat that pool as the match to an unfulfilled instance.

vbhargav875 · 2025-11-17T06:47:47Z

cluster-autoscaler/cloudprovider/oci/nodepools/cache.go

+		if err != nil {
+			return err
+		}
+		return c.setSize(nodePoolID, size-1)


With this code change would we ever have a situation where the above if-condition (instanceID == "")is true?

If yes, should we decrement the size of the nodepool there as well?

This proposed code change is handling a specific scenario where the Cluster Autoscaler has requested that we remove an unregistered node whose underlying compute instance ID cannot be established. Outside of that scenario, I am not sure why a compute ID could not be determined or if whether decrementing the size of the node-pool is the right thing to do.

jackfrancis · 2025-11-17T23:36:12Z

/assign @trungng92 @vbhargav875 @gvnc

@jlamillan ping me when you have provider consensus and I can move this into master

k8s-ci-robot · 2025-11-17T23:36:16Z

@jackfrancis: GitHub didn't allow me to assign the following users: trungng92, vbhargav875, gvnc.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @trungng92 @vbhargav875 @gvnc

@jlamillan ping me when you have provider consensus and I can move this into master

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…e failing to create due to an error.

Use placeholder nodes created by the CA to delete nodes that cannot b…

95af8c0

…e provisioned.

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2025

k8s-ci-robot requested review from BigDarkClown and x13n November 13, 2025 02:22

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 13, 2025

trungng92 reviewed Nov 14, 2025

View reviewed changes

vbhargav875 reviewed Nov 17, 2025

View reviewed changes

Feedback: match an unfulfilled instance to a node pool that has a nod…

e1406ac

…e failing to create due to an error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806

OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806

Uh oh!

jlamillan commented Nov 13, 2025

Uh oh!

k8s-ci-robot commented Nov 13, 2025

Uh oh!

jlamillan commented Nov 13, 2025

Uh oh!

jlamillan commented Nov 13, 2025

Uh oh!

trungng92 Nov 14, 2025

Uh oh!

jlamillan Nov 18, 2025

Uh oh!

vbhargav875 Nov 17, 2025

Uh oh!

jlamillan Nov 18, 2025 •

edited

Loading

Uh oh!

jackfrancis commented Nov 17, 2025

Uh oh!

k8s-ci-robot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806

Are you sure you want to change the base?

OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806

Uh oh!

Conversation

jlamillan commented Nov 13, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Nov 13, 2025

Uh oh!

jlamillan commented Nov 13, 2025

Uh oh!

jlamillan commented Nov 13, 2025

Uh oh!

trungng92 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

jlamillan Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

vbhargav875 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

jlamillan Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Nov 17, 2025

Uh oh!

k8s-ci-robot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jlamillan Nov 18, 2025 •

edited

Loading