Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/machine/e2e: fix broken cleanup #23154

Merged
merged 2 commits into from
Jul 2, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Jul 1, 2024

pkg/machine/e2e: use tmp file for connections

On linux and macos the connections are stored under the home dir by
default so it is not a problem there but on windows we first check
the APPDATA env and use this dir as config storage. This has the problem
that it is not cleaned up after each test as such connections might leak
into the following test causing failues there.

Fixes #22844


pkg/machine/e2e: fix broken cleanup

Currently all podman machine rm errors in AfterEach were ignored.
This means some leaked and caused issues later on, see #22844.

To fix it first rework the logic to only remove machines when needed at
the place were they are created using DeferCleanup(), however
DeferCleanup() does not work well together with AfterEach() as it always
run AfterEach() before DeferCleanup(). As AfterEach() deletes the dir
the podman machine rm call can not be done afterwards.

As such migrate all cleanup to use DeferCleanup() and while I have to
touch this fix the code to remove the per file duplciation and define
the setup/cleanup once in the global scope.

Does this PR introduce a user-facing change?

None

On linux and macos the connections are stored under the home dir by
default so it is not a problem there but on windows we first check
the APPDATA env and use this dir as config storage. This has the problem
that it is not cleaned up after each test as such connections might leak
into the following test causing failues there.

Fixes containers#22844

Signed-off-by: Paul Holzinger <[email protected]>
Currently all podman machine rm errors in AfterEach were ignored.
This means some leaked and caused issues later on, see containers#22844.

To fix it first rework the logic to only remove machines when needed at
the place were they are created using DeferCleanup(), however
DeferCleanup() does not work well together with AfterEach() as it always
run AfterEach() before DeferCleanup(). As AfterEach() deletes the dir
the podman machine rm call can not be done afterwards.

As such migrate all cleanup to use DeferCleanup() and while I have to
touch this fix the code to remove the per file duplciation and define
the setup/cleanup once in the global scope.

Signed-off-by: Paul Holzinger <[email protected]>
@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 1, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 1, 2024

@edsantiago @baude @ashley-cui PTAL

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two questions inline.

@@ -111,6 +111,9 @@ func setup() (string, *machineTestBuilder) {
if err := os.Unsetenv("SSH_AUTH_SOCK"); err != nil {
Fail("unable to unset SSH_AUTH_SOCK")
}
if err := os.Setenv("PODMAN_CONNECTIONS_CONF", filepath.Join(homeDir, "connections.json")); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of allowing CI tests to run locally, would it make sense to set this to a tempdir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is already a tmpdir, see above. The entire home is overwritten but as windows is using APPDATA it does not help there. Of course I could overwrite APPDATA for windows but this seems simpler and more consistent to me

Comment on lines +87 to +89
// Some test create a invalid VM so the VM does not exists in this case we have to ignore the error.
// It would be much better if rm -f would behave like other commands and ignore not exists errors.
if session.ExitCode() == 125 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of those tests! Would it make sense for those tests to set a SkipMachineCleanup or InvalidVM state flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not trusting test writers to remove this flag when it is no longer needed and it doesn't help if a bug causes the machine to be created all of the sudden then we leak machines.

IMO the reasonable fix to to make rm -f not error on non existing machine like our other commands do, i.e. podman rm -f blah

Copy link
Contributor

openshift-ci bot commented Jul 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@edsantiago
Copy link
Member

LGTM

@ashley-cui
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

I think I noticed one weird failure pattern:

$ podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo1
  [FAILED] Timed out after 240.001s.
...

-> next test
$ podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw f357ac67e822
  Error: truncate /private/tmp/ci/podman_test9067091/.local/share/containers/podman/machine/applehv/foo1-arm64.raw: no such file or directory
  Machine init complete
  To start your machine run:

  	podman machine start f357ac67e822
-> this one is a success despite the error message?! And notice how the error path contains the machine name from the previous failed test.

I see this pattern in basically all my failed runs here.

My best guess is that was caused by #23068. I know we had the flake before but the fact that it got that bad all of the sudden suggest to me that something must have changed that causes this.
Looking at the runs there it took 7 tries: https://cirrus-ci.com/task/5748607108775936

I also pushed #23162 that should hopefully add useful debug output to find the root cause.

@openshift-merge-bot openshift-merge-bot bot merged commit f5d50a6 into containers:main Jul 2, 2024
89 of 90 checks passed
@Luap99 Luap99 deleted the machine-test-connection branch July 2, 2024 12:14
@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

It took 13 tries to get the mac machine test to pass

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 1, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

windows: system connection: unexpected ports
3 participants