Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout #1789

Merged
merged 1 commit into from
Feb 7, 2025

Conversation

wangke19
Copy link
Contributor

@wangke19 wangke19 commented Jan 24, 2025

PR fixed:

  1. It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)
  1. After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.
$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2025
@openshift-ci openshift-ci bot requested review from deads2k and sanchezl January 24, 2025 09:23
@wangke19 wangke19 changed the title [WIP]increase waitForFallbackDegradedConditionTimeout [WIP]OCPQE-28167: increase waitForFallbackDegradedConditionTimeout Jan 24, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 24, 2025

@wangke19: This pull request references OCPQE-28167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

  1. It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)
  1. After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.
$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 24, 2025
@wangke19 wangke19 changed the title [WIP]OCPQE-28167: increase waitForFallbackDegradedConditionTimeout [WIP]OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout Jan 24, 2025
@wangke19
Copy link
Contributor Author

/assign @p0lyn0mial

@wangke19 wangke19 changed the title [WIP]OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout Jan 24, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 24, 2025

@wangke19: This pull request references OCPQE-28167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

PR fixed:

  1. It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)
  1. After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.
$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@p0lyn0mial p0lyn0mial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the broken test. This is awesome!
I left a few minor comments. Overall it LGTM.

setUnsupportedConfig(t, cs, getDefaultUnsupportedConfigForCurrentPlatform(t, cs))
err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, clusterMustBeReadyFor)
err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, 5*clusterMustBeReadyFor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this increase in 5*clusterMustBeReadyFor necessary ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, After setUnsupportedConfig, KAS operator still was not in stable, it will take some time to recover.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so maybe we should update waitForFallbackDegradedConditionTimeout to return the modified time for all cases ?


return wait.Poll(10*time.Second, waitPollTimeout, func() (bool, error) {
return wait.PollUntilContextTimeout(context.Background(), 20*time.Second, waitPollTimeout, true, func(cxt context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use update the signature of the waitForClusterInGoodState function to accept a context instead ?

waitForClusterInGoodState(ctx context.Context, t testing.TB, cs clientSet, waitPollTimeout, mustBeReadyFor time.Duration) error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 5, 2025

Test results in local,

$  go test -v -timeout 25m ./test/e2e-sno-disruptive
=== RUN   TestFallback
Found configuration for host https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443.
Feb  6 01:25:25.821: Starting the fallback test
Feb  6 01:25:26.983: Setting UnsupportedConfigOverrides to map[]
Feb  6 01:25:28.724: Waiting 1m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Feb  6 01:26:29.271: The cluster has been in good condition for 1m0s
Feb  6 01:26:29.549: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Feb  6 01:26:30.601: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 15m0s
Feb  6 01:26:30.883: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:26:51.260: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:27:11.125: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:27:31.120: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:27:50.958: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:28:10.926: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:28:30.895: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:28:50.863: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:29:10.933: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:29:30.856: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:29:50.870: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:30:10.941: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:30:31.110: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:30:50.877: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:31:10.948: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:31:30.905: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:31:50.987: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:32:10.955: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:32:30.924: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:32:50.892: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:33:10.963: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:33:30.931: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:33:50.899: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:34:10.867: unable to get kube-apiserver-operator resource: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:34:31.862: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:34:51.102: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:35:11.183: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:35:31.255: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:35:51.223: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:36:11.191: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:36:31.262: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:36:51.230: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:37:11.096: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:37:31.166: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:37:51.237: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:38:11.205: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:38:31.099: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:38:51.171: StaticPodFallbackRevisionDegraded condition hasn't been set yet
Feb  6 01:39:11.213: Checking if a NodeStatus has been updated to report the fallback condition
Feb  6 01:39:11.455: The fallback has been reported on node kewang-0518sn3-5lcs4-master-0, failed revision is 13
Feb  6 01:39:11.456: Verifying if a kube-apiserver pod has been annotated with revision: 13 on node: kewang-0518sn3-5lcs4-master-0
Feb  6 01:39:12.031: Setting UnsupportedConfigOverrides to map[]
Feb  6 01:39:13.261: Waiting 5m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Feb  6 01:39:13.570: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 15, currentRevision: 14, targetRevision: 15
Feb  6 01:39:33.844: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 15, currentRevision: 14, targetRevision: 15
Feb  6 01:39:53.914: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:40:13.808: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:40:33.849: Get "https://api.kewang-0518sn3.xxx.xxx.xxx.openshift.com:6443/apis/operator.openshift.io/v1/kubeapiservers/cluster": dial tcp xxx.xxx.xxx.70:6443: connect: connection refused
Feb  6 01:40:54.291: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:41:13.890: Node kewang-0518sn3-5lcs4-master-0 is progressing, latestAvailableRevision: 16, currentRevision: 14, targetRevision: 16
Feb  6 01:44:13.913: The cluster has been in good condition for 5m0s
--- PASS: TestFallback (1128.09s)
PASS
ok  	github.com/openshift/cluster-kube-apiserver-operator/test/e2e-sno-disruptive	1128.123s

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 6, 2025

/test e2e-gcp-operator-single-node

Copy link
Contributor

@p0lyn0mial p0lyn0mial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more nits, overall LGTM, thanks!

setUnsupportedConfig(t, cs, getDefaultUnsupportedConfigForCurrentPlatform(t, cs))
err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, clusterMustBeReadyFor)
err := waitForClusterInGoodState(t, cs, clusterStateWaitPollTimeout, 5*clusterMustBeReadyFor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so maybe we should update waitForFallbackDegradedConditionTimeout to return the modified time for all cases ?

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 6, 2025

OK, so maybe we should update waitForFallbackDegradedConditionTimeout to return the modified time for all cases ?

I updated clusterMustBeReadyFor to 5 minutes which is used for kube-apiserver restarting. waitForFallbackDegradedConditionTimeout is updated to18 minutes, this time is based on multiple tests. Previous tests of the case always failed, mainly due to not waiting enough time for Fallback Degraded Condition.
Before starting a new test make sure the current state of the cluster is good, I think we can use 1 mins for this, not use clusterMustBeReadyFor.
ensureClusterInGoodState(ctx, t, cs, clusterStateWaitPollTimeout, 1*time.Minute)

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 6, 2025

Test results in local for update,

--- PASS: TestFallback (1231.08s)
PASS
ok  	github.com/openshift/cluster-kube-apiserver-operator/test/e2e-sno-disruptive	1231.099s

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 6, 2025

/retest

@wangke19 wangke19 force-pushed the sno-fallback-fix branch 3 times, most recently from 308bfef to 4fc7f87 Compare February 6, 2025 17:23
@wangke19
Copy link
Contributor Author

wangke19 commented Feb 7, 2025

/test e2e-gcp-operator-single-node

Copy link
Contributor

openshift-ci bot commented Feb 7, 2025

@wangke19: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-operator-single-node 707728f link false /test e2e-gcp-operator-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@p0lyn0mial
Copy link
Contributor

This is awesome! The test had been broken for nearly two years, and thanks to your work, it is finally fixed.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2025
Copy link
Contributor

openshift-ci bot commented Feb 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, wangke19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2025
@wangke19
Copy link
Contributor Author

wangke19 commented Feb 7, 2025

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 7, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit c3e5c90 into openshift:master Feb 7, 2025
15 of 16 checks passed
@wangke19 wangke19 deleted the sno-fallback-fix branch February 7, 2025 08:44
@wangke19
Copy link
Contributor Author

wangke19 commented Feb 7, 2025

/cherry-pick release-4.18

@openshift-cherrypick-robot

@wangke19: new pull request created: #1798

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 7, 2025

/cherry-pick release-4.17

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 7, 2025

/cherry-pick release-4.16

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 7, 2025

/cherry-pick release-4.15

@wangke19
Copy link
Contributor Author

wangke19 commented Feb 7, 2025

/cherry-pick release-4.14

@openshift-cherrypick-robot

@wangke19: new pull request created: #1799

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@wangke19: new pull request created: #1800

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@wangke19: new pull request created: #1801

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@wangke19: new pull request created: #1802

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-kube-apiserver-operator
This PR has been included in build ose-cluster-kube-apiserver-operator-container-v4.19.0-202502071306.p0.gc3e5c90.assembly.stream.el9.
All builds following this will include this PR.

@wangke19 wangke19 changed the title OCPQE-28167: Increase waitForFallbackDegradedConditionTimeout OCPBUGS-50478: Increase waitForFallbackDegradedConditionTimeout Feb 10, 2025
@openshift-ci-robot
Copy link

@wangke19: Jira Issue OCPBUGS-50478: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-50478 has been moved to the MODIFIED state.

In response to this:

PR fixed:

  1. It takes longer to wait for the condition ‵StaticPodFallbackRevisionDegraded` to be true , about 15mins,

The case execution time takes about 20mins,

$ go test -v -timeout 25m ./test/e2e-sno-disruptive; oc get co/kube-apiserver
=== RUN   TestFallback
...
Jan 24 15:20:50.867: Starting the fallback test
Jan 24 15:20:51.653: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:20:52.808: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
Jan 24 15:22:53.409: The cluster has been in good condition for 2m0s
Jan 24 15:22:53.716: Setting UnsupportedConfigOverrides to map[apiServerArguments:map[non-existing-flag:[true]]]
Jan 24 15:22:55.866: Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 18m0s
Jan 24 15:22:56.343: StaticPodFallbackRevisionDegraded condition hasn't been set yet
...
Jan 24 15:36:56.434: Checking if a NodeStatus has been updated to report the fallback condition
Jan 24 15:36:56.701: The fallback has been reported on node kewang-2418sn1-rmbfx-master-0, failed revision is 23
Jan 24 15:36:56.701: Verifying if a kube-apiserver pod has been annotated with revision: 23 on node: kewang-2418sn1-rmbfx-master-0
Jan 24 15:36:57.210: Setting UnsupportedConfigOverrides to map[]
Jan 24 15:36:58.266: Waiting 2m0s for the cluster to be in a good condition, interval = 20s, timeout 10m0s
...
--- PASS: TestFallback (1128.05s)
  1. After the case ran passed, checked the cluster operators, found KAS operator still was not in stable, after a few mins, it was in good state, so add to check extra conditions StaticPodsAvailable, NodeInstallerProgressing and NodeControllerDegraded to make sure the final KAS operator status.
$ oc get co/kube-apiserver
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.18.0-0.nightly-2025-01-23-202230   True        True          True       3h42m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-2418sn1-rmbfx-master-0 was rolled back to revision 15 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wangke19 wangke19 restored the sno-fallback-fix branch February 14, 2025 04:23
@wangke19 wangke19 deleted the sno-fallback-fix branch February 14, 2025 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants