Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: system tests: enable parallel tests #23989

Merged
merged 9 commits into from
Sep 17, 2024

Conversation

edsantiago
Copy link
Member

Well, here goes. Let's try enabling parallel system tests.

There's still a ton of work left to do, but I think we're ready enough. The remaining work is in #23275, and I will continue submitting piecemeal PRs over time.

We're gonna see flakes. Sorry. The remaining flakes are really hard to track down. OTOH rerun time will be shorter.

Note to reviewers: the only way to handle this is by looking at individual commits.

None

For the past two months we've been splitting system tests
into two categories: those that CAN be run in parallel,
and those that CANNOT. Much work has been done to replace
hardcoded names (mycontainer, mypod) with safename().
Hundreds of test runs, in CI and on Ed's laptop, have
proven this approach viable.

make {local,remote}system now runs in two steps: first
the serial ones, then the parallel ones. hack/bats will
now recognize the 'ci:parallel' tag and add --jobs (nprocs).

This requires some tweaking of leak_check, because there
can be umpteen tests running (affecting image/container/pod/etc
state) when any given test completes.

Rules for enabling parallelization in tests:

   * use unique container/pod/volume/network names (safename)
   * do not run 'podman rm -a' or 'rmi -a'
   * never use the -l (--latest) option
   * do not run 'podman ps/images' and expect precise output

Signed-off-by: Ed Santiago <[email protected]>
Workaround for containers#23292, where simultaneous 'pod create' commands
will all start a podman-build of the pause image, but only
one of them will be tagged, and the others will leak <none>
images.

Signed-off-by: Ed Santiago <[email protected]>
For tests run in parallel, show file number as |nnn| (vs [nnn])

Teach logformatter to distinguish the two, adding 'p' to anchors
in parallel tests. Necessary because in this scheme we run bats
twice, thus see 'ok 1' twice, and we want to differentiate them.

Signed-off-by: Ed Santiago <[email protected]>
Add a few best-practices examples, and add a whole section
describing the dos and donts of writing parallel-safe tests.

Signed-off-by: Ed Santiago <[email protected]>
When running parallel, multiple tests could be trying to start
the registry at once. Make this parallel-safe.

Also, use a safer port range for the registry. Something
outside of /proc/sys/net/ipv4/ip_local_port_range

Sorry, I'm including a FIXME section that I haven't investigated
deeply enough.

Signed-off-by: Ed Santiago <[email protected]>
Need --layers=false in podman build, otherwise a buildah race
can trigger "layer not known" failures:

   containers/buildah#5674

Signed-off-by: Ed Santiago <[email protected]>
...for dealing with flakes in parallel mode

Signed-off-by: Ed Santiago <[email protected]>
Copy link
Contributor

openshift-ci bot commented Sep 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2024
@mheon
Copy link
Member

mheon commented Sep 17, 2024

LGTM on my end. Let's bite the bullet and get it enabled.

@rhatdan
Copy link
Member

rhatdan commented Sep 17, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 17, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 4dfff40 into containers:main Sep 17, 2024
88 checks passed
@edsantiago edsantiago deleted the enable-bats-parallel branch September 18, 2024 00:07
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have been great if I had gotten a chance to review this first before merging so quickly. Anyhow no major blocker so it is fine.

gce_instance: *standardvm
timeout_in: 35m
gce_instance: *fastvm
timeout_in: 25m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slowest test in this PR was 23:23 so this looks very close to hitting this already, maybe it was as particular bad run but we should not set it to close IMO
I guess it is fine as you keep converting more files to run parallel so we speed up over the next days/weeks?

#
# This is an example of what NOT to do when enabling parallel tests.
#
# bats test_tags=ci:parallel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely mention the possibility of a file_tag for the whole file? i.e. do not set it if the file already has it set and a author must check if the whole file is parallel

Also examples of what not to do are bad if they do not show what to do instead. This should be in the style of

do not do
run_podman --name mycontainer $IMAGE top
instead do
local cname="c-$(safename)"
run_podman --name $cname $IMAGE top

etc...

@@ -975,6 +976,7 @@ EOF

service_setup $QUADLET_SERVICE_NAME

# FIXME: log.91: Starting, not Started
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what this is about?

Comment on lines +47 to +52

###### FIXME FIXME FIXME TEMPORARY!
###### Trying to understand flake #23725. What happens if we stop
###### doing the network reload?
###### FIXME FIXME FIXME, should we do it in stop_registry??
###### run_podman --storage-driver vfs $(podman_isolation_opts ${PODMAN_LOGIN_WORKDIR}) network reload --all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have been commited?

If we want to avoid any networks errors one way is to switch the registy to use --network host like you did for the CI cache registry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--network=host is incompatible with -p xxx:5000. AFAICT there's no environment-override of 5000, the only way to override is by editing the in-image config.yml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-e REGISTRY_HTTP_ADDR=0.0.0.0:${PORT} (untested)

all options can be set via env https://distribution.github.io/distribution/about/configuration/#override-specific-configuration-options


# Network namespace leak check. List should match what we saw above.
echo
ip netns list > $BATS_SUITE_TMPDIR/netns-post
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really correct for rootless. It is a good start for sure but the issue here is that as rootless you check namespaces for root allowing for false positives even when there is no point. I often different reproducers running as root and rootless running at the same time so this might very well cause me some troubles.

And in general we should check $XDG_RUNTIME)DIR/netns as well as rootless. I work on a commit to fix that up to make it work for rootless.

@@ -5,11 +5,10 @@ debug failures.
Quick Start
===========

Look at [030-run.bats](030-run.bats) for a simple but packed example.
Look at [000-TEMPLATE](000-TEMPLATE) for a simple starting point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no docs here in the Readme about parallel tests requirements and I assume many will not look into the template file. I think the parallel testing should have its own section here

@edsantiago
Copy link
Member Author

I was expecting a week of back-and-forth in review. I tried to be diligent about cleaning up my test-only comments and code, but missed some. I will clean those up today. Thank you.

(Documentation will be lower priority. I still have a few more tests to submit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants