Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: system test parallelization: two-pass approach #23275

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

edsantiago
Copy link
Member

Split system tests into two: those that can be run in
parallel, and those that can't. Run tests in two passes.
This requires eliminating the per-test leak check and
teardown. I think that's okay.

Tests that can run in parallel:

  • use unique container/pod/volume/network names
    • bonus: added a way to track names to their test,
      so the leak test at end can be useful
  • do not run 'podman rm -a' or 'rmi -a'
  • do not run 'podman ps/images' and expect precise output

Signed-off-by: Ed Santiago [email protected]

None

@openshift-ci openshift-ci bot added release-note-none do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jul 15, 2024
Copy link
Contributor

openshift-ci bot commented Jul 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 15, 2024
@edsantiago edsantiago marked this pull request as draft July 15, 2024 12:16
@edsantiago
Copy link
Member Author

THIS IS NOT EVEN CLOSE TO DREAMING ABOUT MERGING!

@Luap99 I think this approach holds promise. I would like to spend some time pursuing it. Before I do so, WDYT?

@Luap99
Copy link
Member

Luap99 commented Jul 15, 2024

@Luap99 I think this approach holds promise. I would like to spend some time pursuing it. Before I do so, WDYT?

Just some quick thoughts, will add more once I am back.

Syntax wise this seems to be better as there were many files I could not run parallel because one or two tests. Assuming this now runs things parallel across files as well it should utilize the cpu better.

However I do not see how this addresses the functional issues from my PR.
How are we going to debug flakes? There is nothing in the logs for timings etc... It is practically impossible to correlate problematic test interactions (this can be done well in e2e tests as we have a full log with timings)
Reviewing tests for possible conflicts will be hard and we will fail from time to time causing extra flakes.

I think there are nice gains here but honestly I am no longer sure that the ongoing maintenance will not cause to much work on all maintainers.

@edsantiago edsantiago force-pushed the bats-parallel branch 3 times, most recently from f456c95 to e79fca4 Compare July 16, 2024 19:03
@edsantiago
Copy link
Member Author

Okay..... I'm really favorably impressed with this approach. The two-pass requirement sucks, and debugging failures is really hard, but I think the benefits (so far) are outweighing those negatives. Running lots and lots of different tests in parallel, not just from one file, is finding a lot of bugs.

CI is likely to fail because of #23282. This is still very much a WIP. My plan is to break out much of the safename work, commit that separately in individual reviewable PRs, in order to minimize the changes in this one.

Copy link

Cockpit tests failed for commit e79fca4. @martinpitt, @jelly, @mvollmer please check.

@Luap99
Copy link
Member

Luap99 commented Jul 17, 2024

@edsantiago Ok let's do this then. I will try to fix all the related podman bugs which you reported in the next days.

@Luap99
Copy link
Member

Luap99 commented Jul 18, 2024

re tag name:
I would prefer parallel over para as this makes it more clear to readers. And I don't see a problem if the tag name is a bit longer.

@edsantiago
Copy link
Member Author

Full name: my concern is typos. I know that we'll get occasional "parralel" or "parrallel" misspellings and those are hard to catch in review. I've been letting my brain think about this in the background and still haven't come up with any ideas.

The other consideration is a string that's easily greppable in source code and command-line history. ^Rpara (for rerunning tests) is pretty useless. Maybe ci:parallel and just try really hard to catch typos in review?

@Luap99
Copy link
Member

Luap99 commented Jul 18, 2024

Maybe ci:parallel and just try really hard to catch typos in review?

ci:parallel SGTM. Another reason for something like codespell to be part of the actual CI checks.
I am not too concerned about typos, it is not like they would break anything. Also most people likely copy the thing from another test and would not really think about it to much anyway I think.

@Luap99
Copy link
Member

Luap99 commented Jul 18, 2024

In general I would good to get some docs in test/system/README.md that descripe how this parallel mode works and what test can/cannot run in parallel (--all,--latest, output checks like podman ps empty output, etc... )

@Luap99
Copy link
Member

Luap99 commented Jul 18, 2024

Also another flake I saw locally.

   [14:18:02.492492892] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman __completeNoDesc  system connection remove arg
   [14:18:02.522931040] m_t114-lgdlrt8i
   m_t114-lgdlrt8i-root
   :4
   Completion ended with directive: ShellCompDirectiveNoFileComp
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Unexpected non-Debug output line: m_t114-lgdlrt8i
   #| expected: \[Debug\]
   #|   actual: m_t114-

I know what is wrong with that and will do another PR to fix that.

@Luap99
Copy link
Member

Luap99 commented Jul 18, 2024

Also another flake I saw locally.

   [14:18:02.492492892] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman __completeNoDesc  system connection remove arg
   [14:18:02.522931040] m_t114-lgdlrt8i
   m_t114-lgdlrt8i-root
   :4
   Completion ended with directive: ShellCompDirectiveNoFileComp
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Unexpected non-Debug output line: m_t114-lgdlrt8i
   #| expected: \[Debug\]
   #|   actual: m_t114-

I know what is wrong with that and will do another PR to fix that.

Fix in #23326

edsantiago and others added 20 commits September 19, 2024 08:45
Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
...not in individual setup_file()s.

Reason: setup_file doesn't have a real safename, because it
doesn't seem to get the bats test number, so we had a
conflict:

   <+132ms> # Error: adding pod to state: name "p-t-bpalhr6r" is in use: pod already exists

Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
Big refactor of two kube ones, merged into one

Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2024
Signed-off-by: Ed Santiago <[email protected]>
Copy link

Cockpit tests failed for commit 0261686. @martinpitt, @jelly, @mvollmer please check.

@edsantiago edsantiago force-pushed the bats-parallel branch 3 times, most recently from e109722 to 0cd7a7d Compare September 19, 2024 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Change to remote API; merits scrutiny machine release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants