RFC for Degraded NodePool Status Condition #1910

jigisha620 · 2025-01-10T23:21:01Z

Description

Adding RFC for Degraded NodePool Status Condition.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-01-10T23:21:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign bwagner5 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-10T23:21:11Z

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-01-10T23:42:51Z

Pull Request Test Coverage Report for Build 12718791708

Details

0 of 0 changed or added relevant lines in 0 files are covered.
4 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.02%) to 81.184%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/disruption/drift.go	2	89.66%
pkg/scheduling/requirements.go	2	98.01%

Totals
Change from base Build 12718181288:	-0.02%
Covered Lines:	9082
Relevant Lines:	11187

💛 - Coveralls

jmdeal

Checkpointing

designs/degraded-nodepools.md

jmdeal · 2025-01-11T01:29:58Z

designs/degraded-nodepools.md

+
+#### Considerations
+
+1. 👎 Heuristics can be wrong and mask failures


Could you elaborate on what type of failures are being masked? As for it being wrong, I'm wondering if we should only consider degraded unknown or true. Maybe we don't ever transition it to false?

jonathan-innis · 2025-01-13T23:14:21Z

designs/degraded-nodepools.md

+
+One example is that when a network path does not exist due to a misconfigured VPC (network access control lists, subnets, route tables), Karpenter will not be able to provision compute with that NodeClass that joins the cluster until the error is fixed. Crucially, this will continue to charge users for compute that can never be used in a cluster.
+
+To improve visibility of these failure modes, this RFC proposes adding a `Degraded` status condition on the Nodepool that indicate to cluster users there may be a problem with a NodePool/NodeClass combination that needs to be investigated and corrected. 


I think like Jason has called out online and offline, I would make our motivation front and center here. Why do we think that there is a need for something like this to exist? Does it make tracking down failures to NodePools easier? Does alarming get easier with this kind of a setup?

designs/degraded-nodepools.md

jonathan-innis · 2025-01-13T23:20:56Z

designs/degraded-nodepools.md

+Evaluation conditions -
+
+1. We start with an empty buffer with `Degraded: Unknown`.
+2. There have to be 2 minimum failures in the buffer for `Degraded` to transition to `True` or basically the threshold would be 80%. 


One thing that I still feel like would be better here is if we considered flipping the polarity of this condition type at all -- Degraded: False meaning that it's healthy feels a bit weird to me but I get that we have to come up with some other word besides "Degraded" that isn't "Ready" and probably isn't "Healthy" to really reflect what this condition is evaluating

designs/degraded-nodepools.md

jonathan-innis · 2025-01-13T23:23:31Z

designs/degraded-nodepools.md

+Unsuccessful Launch: -1
+
+[] = 'Degraded: Unknown'
+[-1] = 'Degraded: Unknown'


nit: It's slightly confusing to call this "Degraded: Unknown". The only reason that I say that is becasue this doesn't necessarily mean that we transition the condition to Unknown when the condition is already set -- I know this is said above but I did find it a tad semantically odd as I was reading through this and trying to parse-out the design

designs/degraded-nodepools.md

saurav-agarwalla · 2025-01-16T22:15:02Z

designs/degraded-nodepools.md

+    Last Transition Time:  2025-01-13T18:57:20Z
+    Message:               
+    Observed Generation:   1
+    Reason:                Degraded


One thing that I had discussed with Reed is making Reason a more structured object and putting a serialized string output of that here since I understand that this has to be a string. That way, we can expose more details including error codes mentioning the reason behind the degradation, expose resource IDs/dependents causing it as well as have more than one reason for the degradation. Making it a structured object will also allow us to parse it better for metrics.

Is reason the right field to surface that level of detail? I agree with the direction, but it seems like message would be more appropriate.

enxebre · 2025-01-22T11:46:11Z

designs/degraded-nodepools.md

+
+This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.
+
+The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status.


For instance, if a NodePool is restricted to a specific zone using the topology.kubernetes.io/zone label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a Degraded status.

Can we enumerate different semantics for failures that we'd want to capture as different .Reasons that should trigger degraded == true, e.g. badSecurityGroup

Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join

jmdeal · 2025-01-30T22:18:36Z

designs/degraded-nodepools.md

+
+### Option 1: In-memory Buffer to store history - Recommended
+
+This option will have an in-memory FIFO buffer, which will grow to a max size of 10 (this can be changed later). This buffer will store data about the success or failure during launch/registration and is evaluated by a controller to determine the relative health of the NodePool. This will be an int buffer and a positive means `Degraded: False`, negative means `Degraded: True` and 0 means `Degraded: Unknown`.


If I understand correctly, the final sentence states that the buffer can have three values: -1 (degraded true), 0 (unknown), 1 (degraded false). I don't think this matches the example, which only has two values, and the values map to launch success / failure, not the actual degraded state right? I think it might be more clear like this:

Suggested change

This option will have an in-memory FIFO buffer, which will grow to a max size of 10 (this can be changed later). This buffer will store data about the success or failure during launch/registration and is evaluated by a controller to determine the relative health of the NodePool. This will be an int buffer and a positive means `Degraded: False`, negative means `Degraded: True` and 0 means `Degraded: Unknown`.

This option will have an in-memory FIFO buffer, which will grow to a max size of 10 (this can be changed later). This buffer will store data about the success or failure during launch/registration and is evaluated by a controller to determine the relative health of the NodePool. This would be implemented as a `[]bool`, where `true` indicates a launch success, and `false` represents a failure. The state of the degraded condition would be based on the number of `false` entries in the buffer.

Agreed, that's a miss on my end. I can update this to reflect two states instead of positive, negative, neutral.

jonathan-innis · 2025-01-30T22:10:51Z

designs/degraded-nodepools.md

+
+This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.
+
+The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status.


Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join

jonathan-innis · 2025-01-30T22:11:23Z

designs/degraded-nodepools.md

+
+## Motivation
+
+Karpenter may initiate the creation of nodes based on a NodePool configuration, but these nodes might fail to join the cluster due to unforeseen registration issues that Karpenter cannot anticipate or prevent. An example illustrating this issue is when network connectivity is impeded by incorrect cluster security group configuration, such as missing outbound rule that allows outbound access to any IPv4 address. In such cases, Karpenter will continue its attempts to provision compute resources, but these resources will fail to join the cluster until the outbound rule for the security group is updated. The critical concern here is that users will incur charges for these compute resources despite their inability to be utilized within the cluster. 


To be clear: I don't think that we're really solving for the problem of cost here -- we're still going to be launching instances and retrying

jonathan-innis · 2025-01-30T22:18:57Z

designs/degraded-nodepools.md

+[-1, +1, +1, +1, +1, +1, +1, +1, +1, +1] = 'Degraded: False'
+```
+
+#### Considerations


We discussed this but what happens if we have one success and that causes us to stop trying new NodeClaims -- when that happens, what's the way that we make sure that we eventually get out of a Degraded state

I haven't updated the RFC since our discussion about that. But @rschalo and I were thinking about expiring the entries in the buffer after some time - maybe 3x the registration ttl.

I think one of the options we had discussed was expiring entries some amount of time after the last write so that there is some recency bias.

Expiring the entries like this should also see when was the last update made to the buffer if it is frequent enough (we can define the time), then we don't expire entries until the buffer is full.

jonathan-innis · 2025-01-30T22:19:51Z

designs/degraded-nodepools.md

+1. 👎 Three retries can still be a long time to wait on compute that never provisions correctly.
+2. 👎 Setting `Degraded: False` on an update to NodePool implies Karpenter can vet with certainty that NodePool is correctly configured which is misleading.
+
+### How Does this Affect Metrics and Improve Observability?


I'm still a bit fuzzy on the reason that we need this extra label

engedaam · 2025-01-30T22:28:01Z

designs/degraded-nodepools.md

+2. True - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers.
+3. False - There has been successful node registration using this NodePool.
+
+The state transition is not unidirectional meaning it can go from True to False and back to True or Unknown. A NodePool marked as Degraded can still be used for provisioning workloads, as this status isn't a precondition for readiness. However, when multiple NodePools have the same weight, a degraded NodePool will receive lower priority during the provisioning process compared to non-degraded ones. 


If the other nodepool are healthy, when will we ever retry the degraded nodepool? Also why not make this for nodepools with different weights?

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2025

k8s-ci-robot requested review from jackfrancis and jmdeal January 10, 2025 23:21

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 10, 2025

jigisha620 force-pushed the degraded-nodepool-rfc branch from 79262b2 to 1bc7741 Compare January 10, 2025 23:24

jmdeal reviewed Jan 11, 2025

View reviewed changes

jonathan-innis reviewed Jan 13, 2025

View reviewed changes

rschalo and others added 7 commits January 14, 2025 15:41

feat: introduce degraded nodepool status condition

5402857

update recovery modes

dfbf8a9

edit to timing

e5ce434

clean up numbering

03ba7a9

remove irrelevant line

5173b3d

add a couple approaches and update wording

a160a0b

WIP: RFC for Degraded NodePool Status Condition

3ffdbcc

jigisha620 force-pushed the degraded-nodepool-rfc branch from 1bc7741 to 3ffdbcc Compare January 14, 2025 23:41

jigisha620 changed the title ~~WIP: RFC for Degraded NodePool Status Condition~~ RFC for Degraded NodePool Status Condition Jan 16, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2025

saurav-agarwalla reviewed Jan 16, 2025

View reviewed changes

enxebre reviewed Jan 22, 2025

View reviewed changes

jonathan-innis mentioned this pull request Jan 22, 2025

BREAKING: Allow custom condition reasons to be returned from Create #1925

Merged

jmdeal reviewed Jan 30, 2025

View reviewed changes

jonathan-innis reviewed Jan 30, 2025

View reviewed changes

engedaam reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC for Degraded NodePool Status Condition #1910

RFC for Degraded NodePool Status Condition #1910

jigisha620 commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

coveralls commented Jan 10, 2025 •

edited

Loading

jmdeal left a comment

jmdeal Jan 11, 2025

jonathan-innis Jan 13, 2025

jonathan-innis Jan 13, 2025

jonathan-innis Jan 13, 2025

saurav-agarwalla Jan 16, 2025 •

edited

Loading

jmdeal Jan 30, 2025

enxebre Jan 22, 2025

jonathan-innis Jan 30, 2025

jmdeal Jan 30, 2025

jigisha620 Jan 30, 2025

jonathan-innis Jan 30, 2025

jonathan-innis Jan 30, 2025

jonathan-innis Jan 30, 2025

jigisha620 Jan 30, 2025

rschalo Jan 30, 2025

jigisha620 Jan 30, 2025

jonathan-innis Jan 30, 2025

engedaam Jan 30, 2025


		#### Considerations

		1. 👎 Heuristics can be wrong and mask failures


		One example is that when a network path does not exist due to a misconfigured VPC (network access control lists, subnets, route tables), Karpenter will not be able to provision compute with that NodeClass that joins the cluster until the error is fixed. Crucially, this will continue to charge users for compute that can never be used in a cluster.

		To improve visibility of these failure modes, this RFC proposes adding a `Degraded` status condition on the Nodepool that indicate to cluster users there may be a problem with a NodePool/NodeClass combination that needs to be investigated and corrected.


		This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.

		The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status.


		### Option 1: In-memory Buffer to store history - Recommended

		This option will have an in-memory FIFO buffer, which will grow to a max size of 10 (this can be changed later). This buffer will store data about the success or failure during launch/registration and is evaluated by a controller to determine the relative health of the NodePool. This will be an int buffer and a positive means `Degraded: False`, negative means `Degraded: True` and 0 means `Degraded: Unknown`.


		## Motivation

		Karpenter may initiate the creation of nodes based on a NodePool configuration, but these nodes might fail to join the cluster due to unforeseen registration issues that Karpenter cannot anticipate or prevent. An example illustrating this issue is when network connectivity is impeded by incorrect cluster security group configuration, such as missing outbound rule that allows outbound access to any IPv4 address. In such cases, Karpenter will continue its attempts to provision compute resources, but these resources will fail to join the cluster until the outbound rule for the security group is updated. The critical concern here is that users will incur charges for these compute resources despite their inability to be utilized within the cluster.

RFC for Degraded NodePool Status Condition #1910

Are you sure you want to change the base?

RFC for Degraded NodePool Status Condition #1910

Conversation

jigisha620 commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

coveralls commented Jan 10, 2025 • edited Loading

Pull Request Test Coverage Report for Build 12718791708

Details

💛 - Coveralls

jmdeal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saurav-agarwalla Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 10, 2025 •

edited

Loading

saurav-agarwalla Jan 16, 2025 •

edited

Loading