pkg/machine/e2e: fix broken cleanup #23154

Luap99 · 2024-07-01T12:28:35Z

pkg/machine/e2e: use tmp file for connections

On linux and macos the connections are stored under the home dir by
default so it is not a problem there but on windows we first check
the APPDATA env and use this dir as config storage. This has the problem
that it is not cleaned up after each test as such connections might leak
into the following test causing failues there.

Fixes #22844

pkg/machine/e2e: fix broken cleanup

Currently all podman machine rm errors in AfterEach were ignored.
This means some leaked and caused issues later on, see #22844.

To fix it first rework the logic to only remove machines when needed at
the place were they are created using DeferCleanup(), however
DeferCleanup() does not work well together with AfterEach() as it always
run AfterEach() before DeferCleanup(). As AfterEach() deletes the dir
the podman machine rm call can not be done afterwards.

As such migrate all cleanup to use DeferCleanup() and while I have to
touch this fix the code to remove the per file duplciation and define
the setup/cleanup once in the global scope.

Does this PR introduce a user-facing change?

None

On linux and macos the connections are stored under the home dir by default so it is not a problem there but on windows we first check the APPDATA env and use this dir as config storage. This has the problem that it is not cleaned up after each test as such connections might leak into the following test causing failues there. Fixes containers#22844 Signed-off-by: Paul Holzinger <[email protected]>

Currently all podman machine rm errors in AfterEach were ignored. This means some leaked and caused issues later on, see containers#22844. To fix it first rework the logic to only remove machines when needed at the place were they are created using DeferCleanup(), however DeferCleanup() does not work well together with AfterEach() as it always run AfterEach() before DeferCleanup(). As AfterEach() deletes the dir the podman machine rm call can not be done afterwards. As such migrate all cleanup to use DeferCleanup() and while I have to touch this fix the code to remove the per file duplciation and define the setup/cleanup once in the global scope. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2024-07-01T12:29:15Z

@edsantiago @baude @ashley-cui PTAL

edsantiago

LGTM. Two questions inline.

edsantiago · 2024-07-01T12:46:29Z

pkg/machine/e2e/machine_test.go

@@ -111,6 +111,9 @@ func setup() (string, *machineTestBuilder) {
 	if err := os.Unsetenv("SSH_AUTH_SOCK"); err != nil {
 		Fail("unable to unset SSH_AUTH_SOCK")
 	}
+	if err := os.Setenv("PODMAN_CONNECTIONS_CONF", filepath.Join(homeDir, "connections.json")); err != nil {


In the interest of allowing CI tests to run locally, would it make sense to set this to a tempdir?

that is already a tmpdir, see above. The entire home is overwritten but as windows is using APPDATA it does not help there. Of course I could overwrite APPDATA for windows but this seems simpler and more consistent to me

edsantiago · 2024-07-01T12:57:14Z

pkg/machine/e2e/config_init_test.go

+		// Some test create a invalid VM so the VM does not exists in this case we have to ignore the error.
+		// It would be much better if rm -f would behave like other commands and ignore not exists errors.
+		if session.ExitCode() == 125 {


There are a lot of those tests! Would it make sense for those tests to set a SkipMachineCleanup or InvalidVM state flag?

I am not trusting test writers to remove this flag when it is no longer needed and it doesn't help if a bug causes the machine to be created all of the sudden then we leak machines.

IMO the reasonable fix to to make rm -f not error on non existing machine like our other commands do, i.e. podman rm -f blah

openshift-ci · 2024-07-01T12:57:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edsantiago · 2024-07-01T13:45:04Z

LGTM

ashley-cui · 2024-07-01T19:05:03Z

/lgtm

Luap99 · 2024-07-02T09:49:53Z

I think I noticed one weird failure pattern:

$ podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo1
  [FAILED] Timed out after 240.001s.
...

-> next test
$ podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw f357ac67e822
  Error: truncate /private/tmp/ci/podman_test9067091/.local/share/containers/podman/machine/applehv/foo1-arm64.raw: no such file or directory
  Machine init complete
  To start your machine run:

  	podman machine start f357ac67e822
-> this one is a success despite the error message?! And notice how the error path contains the machine name from the previous failed test.

I see this pattern in basically all my failed runs here.

My best guess is that was caused by #23068. I know we had the flake before but the fact that it got that bad all of the sudden suggest to me that something must have changed that causes this.
Looking at the runs there it took 7 tries: https://cirrus-ci.com/task/5748607108775936

I also pushed #23162 that should hopefully add useful debug output to find the root cause.

Luap99 · 2024-07-02T12:16:29Z

It took 13 tries to get the mac machine test to pass

Luap99 added 2 commits July 1, 2024 13:04

openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 1, 2024

github-actions bot added the machine label Jul 1, 2024

edsantiago approved these changes Jul 1, 2024

View reviewed changes

openshift-ci bot assigned ashley-cui Jul 1, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2024

edsantiago mentioned this pull request Jul 2, 2024

podman-machine uncategorized flakes #22551

Open

openshift-merge-bot bot merged commit f5d50a6 into containers:main Jul 2, 2024
89 of 90 checks passed

Luap99 deleted the machine-test-connection branch July 2, 2024 12:14

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 1, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/machine/e2e: fix broken cleanup #23154

pkg/machine/e2e: fix broken cleanup #23154

Luap99 commented Jul 1, 2024

Luap99 commented Jul 1, 2024

edsantiago left a comment

edsantiago Jul 1, 2024

Luap99 Jul 1, 2024

edsantiago Jul 1, 2024

Luap99 Jul 1, 2024

openshift-ci bot commented Jul 1, 2024

edsantiago commented Jul 1, 2024

ashley-cui commented Jul 1, 2024

Luap99 commented Jul 2, 2024

Luap99 commented Jul 2, 2024

pkg/machine/e2e: fix broken cleanup #23154

pkg/machine/e2e: fix broken cleanup #23154

Conversation

Luap99 commented Jul 1, 2024

Does this PR introduce a user-facing change?

Luap99 commented Jul 1, 2024

edsantiago left a comment

Choose a reason for hiding this comment

edsantiago Jul 1, 2024

Choose a reason for hiding this comment

Luap99 Jul 1, 2024

Choose a reason for hiding this comment

edsantiago Jul 1, 2024

Choose a reason for hiding this comment

Luap99 Jul 1, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Jul 1, 2024

edsantiago commented Jul 1, 2024

ashley-cui commented Jul 1, 2024

Luap99 commented Jul 2, 2024

Luap99 commented Jul 2, 2024