Skip to content

Conversation

@jupierce
Copy link

Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.

@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Oct 24, 2025
@openshift-ci-robot
Copy link

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@jupierce jupierce changed the title UPSTREAM: <carry>: extend k8s suite timeouts for parallel testing load Extend k8s suite timeouts for parallel testing load Oct 25, 2025
@jupierce jupierce changed the title Extend k8s suite timeouts for parallel testing load NO-ISSSUE: Extend k8s suite timeouts for parallel testing load Oct 27, 2025
@jupierce jupierce changed the title NO-ISSSUE: Extend k8s suite timeouts for parallel testing load NO-ISSUE: Extend k8s suite timeouts for parallel testing load Oct 27, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 27, 2025
@openshift-ci-robot
Copy link

@jupierce: This pull request explicitly references no jira issue.

In response to this:

Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jupierce
Copy link
Author

/verified by e2e testing

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 27, 2025
@openshift-ci-robot
Copy link

@jupierce: This PR has been marked as verified by e2e testing.

In response to this:

/verified by e2e testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Analysis of flakes from the k8s suite has shown consistent examples
of otherwise well behaved testing failing due timeouts because of
temporary load on controllers during parallel testing. Increasing these
timeouts will reduce flakes.
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Oct 27, 2025
@openshift-ci-robot
Copy link

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

Copy link
Member

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try to get this upstream instead? If we have data about these timeouts being too short under high load, I think we can make a point in bumping these values.

Edit: and we would avoid carrying a commit.

@benluddy @jacobsee WDYT?

@benluddy
Copy link

benluddy commented Nov 6, 2025

Could we try to get this upstream instead? If we have data about these timeouts being too short under high load, I think we can make a point in bumping these values.

Edit: and we would avoid carrying a commit.

Are we sure that the controller load mentioned here is necessary to the tests? kubernetes#131518 comes to mind as an example where a group of E2E tests were failing not because they were invalid tests but because they were generating an unnecessary amount of reconciliation work for controllers, and the controllers could not catch up before the tests timed out.

@jupierce
Copy link
Author

jupierce commented Nov 6, 2025

As to whether this can go upstream, I don't have a strong opinion. The k8s-suite tests are run in parallel with non-k8s-suite tests from origin. Our load is therefore unique and upstream may not face the same early timeout issues. We also test on a minimum supported CPU configuration. If upstream is testing with more CPU, they may not see it.

@benluddy the analysis uses statistical tools across the huge volume of data we collect in our CI runs. It outputs insights like "Test X fails 6x more often when run at the same time as test Y". The pattern I've seen for these timeout issues is that test X fails 3x-6x more often when run with in parallel with any of a dozen other tests (since our testing is randomized, test X can get paired with different tests across CI runs). For the cases I've investigated at the code level, test X is just waiting (i.e. it is not misbehaving). So we can conclude that (a) the other dozen tests are misbehaving, (b) that a given controller is not performant, or (c) we are asking too much of the system in the time permitted. (a) and (b) are unlikely or extremely expensive to fix relative to the reward here, so I'm advocating for (c).

In cases where "Test X fails % more when run at the same time as test Y", but there are only 1 or 2 Y's, I'm treating those as a test bug to pursue.

@jupierce
Copy link
Author

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 14, 2025
@openshift-ci-robot
Copy link

@jupierce: This PR has been marked as verified by CI.

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@benluddy
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 14, 2025
@openshift-ci
Copy link

openshift-ci bot commented Nov 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, jupierce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2025
@benluddy
Copy link

/remove-label backports/unvalidated-commits

@openshift-ci openshift-ci bot removed the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Nov 14, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 891f5bb and 2 for PR HEAD 59a2d7e in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 01f3cb1 and 1 for PR HEAD 59a2d7e in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 737c81e and 0 for PR HEAD 59a2d7e in total

@openshift-ci
Copy link

openshift-ci bot commented Nov 20, 2025

@jupierce: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial 59a2d7e link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-techpreview-serial 59a2d7e link false /test e2e-aws-ovn-techpreview-serial
ci/prow/e2e-aws-ovn-cgroupsv2 59a2d7e link unknown /test e2e-aws-ovn-cgroupsv2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/hold

Revision 59a2d7e was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants