-
Notifications
You must be signed in to change notification settings - Fork 4.3k
OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlamillan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@trungng92 let me know what you think. |
|
@vbhargav875 you as well |
| if *node.Id == instanceID { | ||
| return nodePool, nil | ||
| } else if ocicommon.InstanceIDUnfulfilled == instanceID && *node.Id == "" { | ||
| return nodePool, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that this line is going to cause us to return the wrong node pool?
For instance if a user used balance-similar-node-groups, and 3 node pools scaled up all at once with "unfulfilled instances".
But when we do getByInstance, will the "unfulfilled instances" return the first node pool every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit to only match an unfulfilled instance to a node pool if that pool has a node in CREATING status with an associated error.
In short, if a node in a pool can't be created due to an error, we treat that pool as the match to an unfulfilled instance.
| if err != nil { | ||
| return err | ||
| } | ||
| return c.setSize(nodePoolID, size-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this code change would we ever have a situation where the above if-condition (instanceID == "")is true?
If yes, should we decrement the size of the nodepool there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposed code change is handling a specific scenario where the Cluster Autoscaler has requested that we remove an unregistered node whose underlying compute instance ID cannot be established. Outside of that scenario, I am not sure why a compute ID could not be determined or if whether decrementing the size of the node-pool is the right thing to do.
|
/assign @trungng92 @vbhargav875 @gvnc @jlamillan ping me when you have provider consensus and I can move this into master |
|
@jackfrancis: GitHub didn't allow me to assign the following users: trungng92, vbhargav875, gvnc. Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…e failing to create due to an error.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR updates the behavior of OCI's nodepool implementation of
DeleteNodes()when it is called with an unfulfilled placeholder node that is in an error state. Typically is is due to capacity or quota issues. Rather than returning an error because the placeholder node has an empty instance ID, theDeleteNodes()request effectively cancels the failed scale-up, allowing the Cluster Autoscaler to recover more quickly and try scaling up a different node pool that meets the scheduling requirements.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
These changes are currently running in an environment where we have proactively configured multiple node-pools with different shape configurations in order to make the Cluster Autoscaler more resilient against capacity issues..
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: