add epbf program to trace podman cleanup errors in CI #23487

Luap99 · 2024-08-02T15:58:34Z

See commits, I just want to check if this even works. Hopefully this allows me to properly understand cleanup errors.

Does this PR introduce a user-facing change?

None

openshift-ci · 2024-08-02T15:58:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

packit-as-a-service · 2024-08-02T16:22:10Z

Cockpit tests failed for commit f241ed0. @martinpitt, @jelly, @mvollmer please check.

I like to run a bpftrace based program in CI to collect better logs for specific processes not observed in the normal testing such as the podman container cleanup command. Given you need to have full privs to run ebpf and the package pulls in an entire toolchain which is almost 500MB in install size we do not add it the the container images to not bloat them without reason. containers/podman#23487 Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2024-08-06T17:08:30Z

In case anyone is interested here is an example log:
https://api.cirrus-ci.com/v1/artifact/task/6197821693493248/cleanup_tracer/podman-cleanup-tracer.log

It doesn't look nice mostly because the argv lines is printed with zero bytes not spaces, I don't have a way to get rid of that in my program but I guess I could pipe into sed to replace all zero bytes with spaces to make it easier to read.

Luap99 · 2024-08-06T19:05:02Z

@edsantiago please pick this one as well for the no retry PR, maybe I can find the cause in this new log.

edsantiago · 2024-08-06T20:53:52Z

Here ya go f39 root. [EDIT: direct link to cleanup tracer log]. This is my second run; first one did not have any relevant flakes. This run had only one. I'll rerun at least once more today.

edsantiago · 2024-08-07T01:48:55Z

This one has two (debian root). It was the only failure in the third run. I've pushed one more time, but will not be checking its status again tonight.

Luap99 · 2024-08-07T09:52:04Z

Ok that doesn't seems to include any new things AFAICS, it does confirm what I have been saying the cleanup is called twice. The only way I see this happening is what the cleanup process failed but in the logs exit codes I clearly see no errors or kill signals send. So there truly must be a condition in our code that is calling cleanup twice even if there was no error the first time...

Luap99 · 2024-08-07T12:20:22Z

@edsantiago I assume the no flake retry is the only thing I need to trigger this?

I added the parent pid info to my log as it wasn't clear if netavark teardown was run by the some podman command or two different ones.

edsantiago · 2024-08-07T12:23:54Z

Not to trigger it, more precisely, to see it. I think the error is happening all the time, it's just that normally we have to examine each individual log to find it. Yes, setting retries to 0 should make failures visible.

Luap99 · 2024-08-07T12:25:47Z

Not to trigger it, more precisely, to see it. I think the error is happening all the time, it's just that normally we have to examine each individual log to find it. Yes, setting retries to 0 should make failures visible.

Yes that is what I meant.

Luap99 · 2024-08-07T13:25:39Z

Interesting I found the netns error also in my log for the cleanup process. So it can happen without us ever seeing as wel and without causing a test failure.

stderr 07:36:29.528707 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="Cleaning up container: failed to clean up container e22e54194869285aef9d593f0cdfff9e23d5fdfb14615dbcf71aac829d225a77: cannot get namespace path unless container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 is running: container is stopped"
stderr 07:36:29.528744 79038    79009    podman      Error: failed to clean up container e22e54194869285aef9d593f0cdfff9e23d5fdfb14615dbcf71aac829d225a77: cannot get namespace path unless container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 is running: container is stopped
ainerexec   07:36:29.529410 79114    79061    podman       /usr/libexec/podman/netavark --config /run/containers/networks --rootless=false --aardvark-binary=/usr/libexec/podman/aardvark-dns teardown /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46
cmd    07:36:29.529780 79114    79061    /usr/libexec/podman/netavark�--config�/run/containers/networks�--rootless=false�--aardvark-binary=/usr/libexec/podman/aardvark-dns�teardown�/run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46�
exit   07:36:29.617636 79114    79061    netavark     0 0
stderr 07:36:29.638209 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="IPAM error: failed to get ips for container ID edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 on network podman-default-kube-network"
ontainer edb052e52aac57efd13979ce3921481aff6baexec   07:36:29.638942 79150    79038    podman       /usr/libexec/podman/netavark --config /run/containers/networks --rootless=false --aardvark-binary=/usr/libexec/podman/aardvark-dns teardown /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46
cmd    07:36:29.639236 79150    79038    /usr/libexec/podman/netavark�--config�/run/containers/networks�--rootless=false�--aardvark-binary=/usr/libexec/podman/aardvark-dns�teardown�/run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46�
exit   07:36:29.642237 79150    79038    netavark     1 0
stderr 07:36:29.642647 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
775ab0ca6b567cdd8 on network podman-defaulstderr 07:36:29.642714 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="netavark: open container netns: open /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: IO error: No such file or directory (os error 2)"
-network"
ontainer edb052e52aac57efd13979ce3921481aff6bastderr 07:36:29.642747 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="Unable to clean up network for container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8: \"unmounting network namespace for container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8: failed to remove ns path: remove /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: no such file or directory, failed to unmount NS: at /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: no such file or directory\""
exit   07:36:29.656424 79103    78908    podman       0 0
exit   07:36:29.676194 79038    79009    podman       125 0

https://api.cirrus-ci.com/v1/artifact/task/5220866676490240/cleanup_tracer/podman-cleanup-tracer.log
https://api.cirrus-ci.com/v1/artifact/task/5220866676490240/html/int-podman-rawhide-root-host-sqlite.log.html

packit-as-a-service · 2024-08-09T11:59:49Z

Ephemeral COPR build failed. @containers/packit-build please check.

Luap99 · 2024-08-09T17:03:57Z

@edsantiago Great news I am confident that I found the root cause, a locking/container state misuse bug that caused us to add the netns back after we cleaned it up already.
Please give d6bcc14 a try. Hopefully this is enough, I will prepare a proper PR with a better description on Monday.

edsantiago · 2024-08-10T18:15:32Z

One new failure seen, f39 root:

# podman [options] network ls --quiet
net5ef1e5e5df3c173e8948e8e487fc7c21c14438e290de0a5a04e3f8b610f60b47
networkIDTest
podman
podman-default-kube-network
time="2024-08-10T17:56:31Z" level=warning msg="Error reading network config file \"/etc/containers/networks/net25120a9e54ed259eaa0b8b853f1e8cda89d0c538a55825de0e591c2e67950554.json\": EOF"

[FAILED] Unexpected warnings seen on stderr: "time=\"2024-08-10T17:56:31Z\" level=warning msg=\"Error reading network config file \\\"/etc/containers/networks/net25120a9e54ed259eaa0b8b853f1e8cda89d0c538a55825de0e591c2e67950554.json\\\": EOF\""

Looks more like a race in network ls than anything you've changed, but still, it's a new one I don't remember seeing before.

edsantiago · 2024-08-11T12:52:59Z

It took a LOT of runs, but here's what looks like the same issue. rawhide rootless:

$ podman [options] stop --all -t 0
Error: removing container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c network: unmounting network namespace for container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c: failed to remove ns path: remove /run/user/2153/netns/netns-1fac2ba0-0d2a-10cb-7dbe-900b0f29ae60: device or resource busy

Luap99 · 2024-08-11T16:43:34Z

It took a LOT of runs, but here's what looks like the same issue. rawhide rootless:

$ podman [options] stop --all -t 0
Error: removing container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c network: unmounting network namespace for container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c: failed to remove ns path: remove /run/user/2153/netns/netns-1fac2ba0-0d2a-10cb-7dbe-900b0f29ae60: device or resource busy

Note the same issue (EBUSY vs ENOENT), this should be fixed by containers/common#2112, vendor in #23519

Luap99 · 2024-08-11T16:47:53Z

Also d6bcc14 is such a generic issue it would not surprise me if this fixes other weird stop flakes as well.

Add a new program based on bpftrace[1] to trace all podman processes with arguments and exit code/signals. Additionally this captures stderr from all podman container cleanup processes spawned by conmon which otherwise go to /dev/null and are never seen in any CI logs. Hopefull this allows us to debug strange network cleanup error seen in CI, my plan is to add this to the cirrus setup and upload the logs so we can check them when the flakes happen. [1] https://github.com/bpftrace/bpftrace Signed-off-by: Paul Holzinger <[email protected]>

In order to get better debug data for cleanup flakes. The argv is printed with 0 bytes so replace them with spaces to make the log readable for humans. Signed-off-by: Paul Holzinger <[email protected]>

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Aug 2, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 2, 2024

Luap99 force-pushed the ebpf branch 2 times, most recently from 4cf3a21 to f8a2000 Compare August 2, 2024 16:22

This was referenced Aug 5, 2024

add bpftrace for CI debugging containers/automation_images#371

Merged

update c/common to add some netns cleanup fixes #23519

Merged

Luap99 force-pushed the ebpf branch from f8a2000 to 34ff046 Compare August 6, 2024 17:00

Luap99 force-pushed the ebpf branch from 34ff046 to 961d2bb Compare August 6, 2024 18:15

Luap99 force-pushed the ebpf branch from 961d2bb to 995e196 Compare August 7, 2024 12:14

Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Aug 7, 2024

Luap99 force-pushed the ebpf branch 6 times, most recently from 4f5828b to ed8dc4d Compare August 8, 2024 14:43

Luap99 mentioned this pull request Aug 8, 2024

libpod: cleanupNetwork() return error #23553

Merged

Luap99 force-pushed the ebpf branch from 691d9da to 0ae64b1 Compare August 8, 2024 17:53

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2024

Luap99 force-pushed the ebpf branch from d9909c3 to 00cad2e Compare August 9, 2024 15:21

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2024

Luap99 force-pushed the ebpf branch from 00cad2e to 32b2a62 Compare August 9, 2024 16:12

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2024

Luap99 mentioned this pull request Aug 26, 2024

run --events-backend=file sometimes means journal #23750

Open

Luap99 force-pushed the ebpf branch from d6bcc14 to 4ce6d01 Compare September 3, 2024 09:25

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024

Luap99 force-pushed the ebpf branch from 4ce6d01 to 88412f6 Compare September 3, 2024 09:25

Luap99 changed the title ~~WIP: add epbf program to trace podman cleanup errors in CI~~ add epbf program to trace podman cleanup errors in CI Sep 3, 2024

Luap99 marked this pull request as ready for review September 3, 2024 09:25

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2024

Luap99 added 2 commits September 20, 2024 15:57

CI: run and collect cleanup tracer logs

e38f210

In order to get better debug data for cleanup flakes. The argv is printed with 0 bytes so replace them with spaces to make the log readable for humans. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 force-pushed the ebpf branch from 88412f6 to e38f210 Compare September 20, 2024 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add epbf program to trace podman cleanup errors in CI #23487

add epbf program to trace podman cleanup errors in CI #23487

Luap99 commented Aug 2, 2024

openshift-ci bot commented Aug 2, 2024

packit-as-a-service bot commented Aug 2, 2024

Luap99 commented Aug 6, 2024

Luap99 commented Aug 6, 2024

edsantiago commented Aug 6, 2024 •

edited

Loading

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

packit-as-a-service bot commented Aug 9, 2024

Luap99 commented Aug 9, 2024

edsantiago commented Aug 10, 2024

edsantiago commented Aug 11, 2024

Luap99 commented Aug 11, 2024

Luap99 commented Aug 11, 2024

add epbf program to trace podman cleanup errors in CI #23487

Are you sure you want to change the base?

add epbf program to trace podman cleanup errors in CI #23487

Conversation

Luap99 commented Aug 2, 2024

Does this PR introduce a user-facing change?

openshift-ci bot commented Aug 2, 2024

packit-as-a-service bot commented Aug 2, 2024

Luap99 commented Aug 6, 2024

Luap99 commented Aug 6, 2024

edsantiago commented Aug 6, 2024 • edited Loading

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

packit-as-a-service bot commented Aug 9, 2024

Luap99 commented Aug 9, 2024

edsantiago commented Aug 10, 2024

edsantiago commented Aug 11, 2024

Luap99 commented Aug 11, 2024

Luap99 commented Aug 11, 2024

edsantiago commented Aug 6, 2024 •

edited

Loading