wait: fix handling of multiple conditions with exited #23985

Luap99 · 2024-09-17T15:35:36Z

As it turns on things are not so simple after all...
In podman-py it was reported[1] that waiting might hang, per our docs wait
on multiple conditions should exit once the first one is hit and not all
of them. However because the new wait logic never checked if the context
was cancelled the goroutine kept running until conmon exited and because
we used a waitgroup to wait for all of them to finish it blocked until
that happened.

First we can remove the waitgroup as we only need to wait for one of
them anyway via the channel. While this alone fixes the hang it would
still leak the other goroutine. As there is no way to cancel a goroutine
all the code must check for a cancelled context in the wait loop to no
leak.

Fixes 8a94331 ("libpod: simplify WaitForExit()")
[1] containers/podman-py#425

Does this PR introduce a user-facing change?

None (The bug was only in main so not worth to include it in the changlog)

As it turns on things are not so simple after all... In podman-py it was reported[1] that waiting might hang, per our docs wait on multiple conditions should exit once the first one is hit and not all of them. However because the new wait logic never checked if the context was cancelled the goroutine kept running until conmon exited and because we used a waitgroup to wait for all of them to finish it blocked until that happened. First we can remove the waitgroup as we only need to wait for one of them anyway via the channel. While this alone fixes the hang it would still leak the other goroutine. As there is no way to cancel a goroutine all the code must check for a cancelled context in the wait loop to no leak. Fixes 8a94331 ("libpod: simplify WaitForExit()") [1] containers/podman-py#425 Signed-off-by: Paul Holzinger <[email protected]>

The issue is closed and I recently fixed a number of races (bf74797) in the remote attach API that sound like exactly like the same error that was mentioned in issue containers#9597. As such I think this works, if it start flaking again we can revert this or better fix the actual bug. Signed-off-by: Paul Holzinger <[email protected]>

openshift-ci · 2024-09-17T15:35:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2024-09-17T18:06:07Z

It feels like the wait code could use a rearchitecture given how fragile this seems to be, but I don't see how it can be improved that much with our current architecture. We could maybe trigger off of events for some things, but that probably isn't sufficient for all states.

baude · 2024-09-17T20:23:41Z

nice removal of sleep

LGTM

Luap99 · 2024-09-18T08:07:03Z

It feels like the wait code could use a rearchitecture given how fragile this seems to be, but I don't see how it can be improved that much with our current architecture. We could maybe trigger off of events for some things, but that probably isn't sufficient for all states.

yeah using the event stream would make sense for many things but given we allow events-backend=none we should not have it as hard requirement. I don't no matter what we do it will be much better, in the end we always have to keep checking the ctx.Done case all the time as there is no way to cancel actively in go.

mheon · 2024-09-18T12:22:56Z

/lgtm

Luap99 added 2 commits September 17, 2024 17:35

openshift-ci bot added the release-note label Sep 17, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2024

openshift-ci bot assigned mheon Sep 18, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2024

openshift-merge-bot bot merged commit f580ae0 into containers:main Sep 18, 2024
80 checks passed

Luap99 deleted the wait-hang branch September 18, 2024 12:32

edsantiago mentioned this pull request Oct 9, 2024

CI: e2e: pull/run and decrypt from local registry: timeout #24219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait: fix handling of multiple conditions with exited #23985

wait: fix handling of multiple conditions with exited #23985

Luap99 commented Sep 17, 2024

openshift-ci bot commented Sep 17, 2024

mheon commented Sep 17, 2024

baude commented Sep 17, 2024

Luap99 commented Sep 18, 2024

mheon commented Sep 18, 2024

wait: fix handling of multiple conditions with exited #23985

wait: fix handling of multiple conditions with exited #23985

Conversation

Luap99 commented Sep 17, 2024

Does this PR introduce a user-facing change?

openshift-ci bot commented Sep 17, 2024

mheon commented Sep 17, 2024

baude commented Sep 17, 2024

Luap99 commented Sep 18, 2024

mheon commented Sep 18, 2024