OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout #1789

wangke19 · 2025-01-24T09:23:11Z

PR fixed:

It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)

After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.

$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

openshift-ci-robot · 2025-01-24T09:55:37Z

@wangke19: This pull request references OCPQE-28167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)

After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.

$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wangke19 · 2025-01-24T11:36:15Z

/assign @p0lyn0mial

openshift-ci-robot · 2025-01-24T11:41:52Z

@wangke19: This pull request references OCPQE-28167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

PR fixed:

It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)

After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.

$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

p0lyn0mial

Thanks for fixing the broken test. This is awesome!
I left a few minor comments. Overall it LGTM.

p0lyn0mial · 2025-01-27T11:36:27Z

test/e2e-sno-disruptive/sno_disruptive_test.go

 	setUnsupportedConfig(t, cs, getDefaultUnsupportedConfigForCurrentPlatform(t, cs))
-	err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, clusterMustBeReadyFor)
+	err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, 5*clusterMustBeReadyFor)


is this increase in 5*clusterMustBeReadyFor necessary ?

Yes, After setUnsupportedConfig, KAS operator still was not in stable, it will take some time to recover.

OK, so maybe we should update waitForFallbackDegradedConditionTimeout to return the modified time for all cases ?

p0lyn0mial · 2025-01-27T11:39:22Z

test/e2e-sno-disruptive/sno_disruptive_test.go


-	return wait.Poll(10*time.Second, waitPollTimeout, func() (bool, error) {
+	return wait.PollUntilContextTimeout(context.Background(), 20*time.Second, waitPollTimeout, true, func(cxt context.Context) (bool, error) {


could use update the signature of the waitForClusterInGoodState function to accept a context instead ?

waitForClusterInGoodState(ctx context.Context, t testing.TB, cs clientSet, waitPollTimeout, mustBeReadyFor time.Duration) error

test/e2e-sno-disruptive/sno_disruptive_test.go

wangke19 · 2025-02-05T17:51:50Z

Test results in local,

$  go test -v -timeout 25m ./test/e2e-sno-disruptive
=== RUN   TestFallback
Found configuration for host https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443.
Feb  6 01:25:25.821: Starting the fallback test
Feb  6 01:25:26.983: Setting UnsupportedConfigOverrides to map[]
Feb  6 01:25:28.724: Waiting 1m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Feb  6 01:26:29.271: The cluster has been in good condition for 1m0s
Feb  6 01:26:29.549: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Feb  6 01:26:30.601: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 15m0s
Feb  6 01:26:30.883: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:26:51.260: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:27:11.125: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:27:31.120: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:27:50.958: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:28:10.926: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:28:30.895: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:28:50.863: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:29:10.933: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:29:30.856: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:29:50.870: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:30:10.941: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:30:31.110: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:30:50.877: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:31:10.948: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:31:30.905: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:31:50.987: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:32:10.955: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:32:30.924: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:32:50.892: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:33:10.963: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:33:30.931: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:33:50.899: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:34:10.867: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:34:31.862: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:34:51.102: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:35:11.183: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:35:31.255: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:35:51.223: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:36:11.191: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:36:31.262: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:36:51.230: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:37:11.096: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:37:31.166: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:37:51.237: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:38:11.205: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:38:31.099: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:38:51.171: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:39:11.213: Checking if a NodeStatus has been updated to report the fallback condition
Feb  6 01:39:11.455: The fallback has been reported on node kewang-0518sn3-5lcs4-master-0, failed revision is 13
Feb  6 01:39:11.456: Verifying if a kube-apiserver pod has been annotated with revision: 13 on node: kewang-0518sn3-5lcs4-master-0
Feb  6 01:39:12.031: Setting UnsupportedConfigOverrides to map[]
Feb  6 01:39:13.261: Waiting 5m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Feb  6 01:39:13.570: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 15, currentRevision: 14, targetRevision: 15
Feb  6 01:39:33.844: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 15, currentRevision: 14, targetRevision: 15
Feb  6 01:39:53.914: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:40:13.808: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:40:33.849: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:40:54.291: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:41:13.890: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:44:13.913: The cluster has been in good condition for 5m0s
--- PASS: TestFallback (1128.09s)
PASS
ok  	github.com/openshift/cluster-kube-apiserver-operator/test/e2e-sno-disruptive	1128.123s

wangke19 · 2025-02-06T04:29:19Z

/test e2e-gcp-operator-single-node

p0lyn0mial

a few more nits, overall LGTM, thanks!

test/e2e-sno-disruptive/sno_disruptive_test.go

p0lyn0mial · 2025-02-06T06:14:38Z

test/e2e-sno-disruptive/sno_disruptive_test.go

 	setUnsupportedConfig(t, cs, getDefaultUnsupportedConfigForCurrentPlatform(t, cs))
-	err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, clusterMustBeReadyFor)
+	err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, 5*clusterMustBeReadyFor)


OK, so maybe we should update waitForFallbackDegradedConditionTimeout to return the modified time for all cases ?

wangke19 · 2025-02-06T12:30:59Z

OK, so maybe we should update waitForFallbackDegradedConditionTimeout to return the modified time for all cases ?

I updated clusterMustBeReadyFor to 5 minutes which is used for kube-apiserver restarting. waitForFallbackDegradedConditionTimeout is updated to18 minutes, this time is based on multiple tests. Previous tests of the case always failed, mainly due to not waiting enough time for Fallback Degraded Condition.
Before starting a new test make sure the current state of the cluster is good, I think we can use 1 mins for this, not use clusterMustBeReadyFor.
ensureClusterInGoodState(ctx, t, cs, clusterStateWaitPollTimeout, 1*time.Minute)

wangke19 · 2025-02-06T12:32:39Z

Test results in local for update,

--- PASS: TestFallback (1231.08s)
PASS
ok  	github.com/openshift/cluster-kube-apiserver-operator/test/e2e-sno-disruptive	1231.099s

wangke19 · 2025-02-06T16:41:50Z

/retest

wangke19 · 2025-02-07T01:46:43Z

/test e2e-gcp-operator-single-node

openshift-ci · 2025-02-07T03:32:35Z

@wangke19: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-operator-single-node	`707728f`	link	false	`/test e2e-gcp-operator-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

p0lyn0mial · 2025-02-07T07:49:09Z

This is awesome! The test had been broken for nearly two years, and thanks to your work, it is finally fixed.

/lgtm

openshift-ci · 2025-02-07T07:49:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, wangke19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [p0lyn0mial]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wangke19 · 2025-02-07T08:40:46Z

/label acknowledge-critical-fixes-only

wangke19 · 2025-02-07T08:45:49Z

/cherry-pick release-4.18

openshift-cherrypick-robot · 2025-02-07T08:46:37Z

@wangke19: new pull request created: #1798

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wangke19 · 2025-02-07T09:40:15Z

/cherry-pick release-4.17

wangke19 · 2025-02-07T09:40:21Z

/cherry-pick release-4.16

wangke19 · 2025-02-07T09:40:26Z

/cherry-pick release-4.15

wangke19 · 2025-02-07T09:40:32Z

/cherry-pick release-4.14

openshift-cherrypick-robot · 2025-02-07T09:40:55Z

@wangke19: new pull request created: #1799

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2025-02-07T09:41:03Z

@wangke19: new pull request created: #1800

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2025-02-07T09:41:06Z

@wangke19: new pull request created: #1801

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2025-02-07T09:41:13Z

@wangke19: new pull request created: #1802

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-bot · 2025-02-07T13:52:09Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-kube-apiserver-operator
This PR has been included in build ose-cluster-kube-apiserver-operator-container-v4.19.0-202502071306.p0.gc3e5c90.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot · 2025-02-10T04:57:49Z

@wangke19: Jira Issue OCPBUGS-50478: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-50478 has been moved to the MODIFIED state.

In response to this:

PR fixed:

It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)

After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.

$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2025

openshift-ci bot requested review from deads2k and sanchezl January 24, 2025 09:23

wangke19 changed the title ~~[WIP]increase waitForFallbackDegradedConditionTimeout~~ [WIP]OCPQE-28167: increase waitForFallbackDegradedConditionTimeout Jan 24, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 24, 2025

wangke19 changed the title ~~[WIP]OCPQE-28167: increase waitForFallbackDegradedConditionTimeout~~ [WIP]OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout Jan 24, 2025

wangke19 force-pushed the sno-fallback-fix branch from 25f5271 to 69bdf11 Compare January 24, 2025 11:29

openshift-ci bot assigned p0lyn0mial Jan 24, 2025

wangke19 changed the title ~~[WIP]OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout~~ OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout Jan 24, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2025

p0lyn0mial reviewed Jan 27, 2025

View reviewed changes

wangke19 force-pushed the sno-fallback-fix branch from 69bdf11 to 58a23f3 Compare February 5, 2025 17:47

wangke19 force-pushed the sno-fallback-fix branch from 58a23f3 to 5d6451d Compare February 6, 2025 02:06

p0lyn0mial reviewed Feb 6, 2025

View reviewed changes

wangke19 force-pushed the sno-fallback-fix branch from 5d6451d to 1e4aa28 Compare February 6, 2025 12:21

wangke19 force-pushed the sno-fallback-fix branch 3 times, most recently from 308bfef to 4fc7f87 Compare February 6, 2025 17:23

increase waitForFallbackDegradedConditionTimeout

707728f

wangke19 force-pushed the sno-fallback-fix branch from 4fc7f87 to 707728f Compare February 6, 2025 17:24

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2025

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 7, 2025

openshift-merge-bot bot merged commit c3e5c90 into openshift:master Feb 7, 2025
15 of 16 checks passed

wangke19 deleted the sno-fallback-fix branch February 7, 2025 08:44

openshift-cherrypick-robot mentioned this pull request Feb 7, 2025

[release-4.18] OCPBUGS-50479: Increase waitForFallbackDegradedConditionTimeout #1798

Merged

openshift-cherrypick-robot mentioned this pull request Feb 7, 2025

[release-4.17] OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout #1799

Open

openshift-cherrypick-robot mentioned this pull request Feb 7, 2025

[release-4.16] OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout #1800

Open

openshift-cherrypick-robot mentioned this pull request Feb 7, 2025

[release-4.15] OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout #1801

Open

openshift-cherrypick-robot mentioned this pull request Feb 7, 2025

[release-4.14] OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout #1802

Open

wangke19 changed the title ~~OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout~~ OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout Feb 10, 2025

wangke19 restored the sno-fallback-fix branch February 14, 2025 04:23

wangke19 deleted the sno-fallback-fix branch February 14, 2025 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout #1789

OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout #1789

wangke19 commented Jan 24, 2025 •

edited

Loading

openshift-ci-robot commented Jan 24, 2025 •

edited by openshift-ci bot

Loading

wangke19 commented Jan 24, 2025

openshift-ci-robot commented Jan 24, 2025 •

edited by openshift-ci bot

Loading

p0lyn0mial left a comment

p0lyn0mial Jan 27, 2025

wangke19 Feb 5, 2025

p0lyn0mial Feb 6, 2025

p0lyn0mial Jan 27, 2025

wangke19 Feb 5, 2025

wangke19 commented Feb 5, 2025

wangke19 commented Feb 6, 2025

p0lyn0mial left a comment

p0lyn0mial Feb 6, 2025

wangke19 commented Feb 6, 2025 •

edited

Loading

wangke19 commented Feb 6, 2025

wangke19 commented Feb 6, 2025

wangke19 commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025

p0lyn0mial commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-bot commented Feb 7, 2025

openshift-ci-robot commented Feb 10, 2025


		return wait.Poll(10*time.Second, waitPollTimeout, func() (bool, error) {
		return wait.PollUntilContextTimeout(context.Background(), 20*time.Second, waitPollTimeout, true, func(cxt context.Context) (bool, error) {

OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout #1789

OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout #1789

Conversation

wangke19 commented Jan 24, 2025 • edited Loading

openshift-ci-robot commented Jan 24, 2025 • edited by openshift-ci bot Loading

wangke19 commented Jan 24, 2025

openshift-ci-robot commented Jan 24, 2025 • edited by openshift-ci bot Loading

p0lyn0mial left a comment

Choose a reason for hiding this comment

p0lyn0mial Jan 27, 2025

Choose a reason for hiding this comment

wangke19 Feb 5, 2025

Choose a reason for hiding this comment

p0lyn0mial Feb 6, 2025

Choose a reason for hiding this comment

p0lyn0mial Jan 27, 2025

Choose a reason for hiding this comment

wangke19 Feb 5, 2025

Choose a reason for hiding this comment

wangke19 commented Feb 5, 2025

wangke19 commented Feb 6, 2025

p0lyn0mial left a comment

Choose a reason for hiding this comment

p0lyn0mial Feb 6, 2025

Choose a reason for hiding this comment

wangke19 commented Feb 6, 2025 • edited Loading

wangke19 commented Feb 6, 2025

wangke19 commented Feb 6, 2025

wangke19 commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025

p0lyn0mial commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

wangke19 commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-cherrypick-robot commented Feb 7, 2025

openshift-bot commented Feb 7, 2025

openshift-ci-robot commented Feb 10, 2025

wangke19 commented Jan 24, 2025 •

edited

Loading

openshift-ci-robot commented Jan 24, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 24, 2025 •

edited by openshift-ci bot

Loading

wangke19 commented Feb 6, 2025 •

edited

Loading