-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wait: fix handling of multiple conditions with exited #23985
Conversation
As it turns on things are not so simple after all... In podman-py it was reported[1] that waiting might hang, per our docs wait on multiple conditions should exit once the first one is hit and not all of them. However because the new wait logic never checked if the context was cancelled the goroutine kept running until conmon exited and because we used a waitgroup to wait for all of them to finish it blocked until that happened. First we can remove the waitgroup as we only need to wait for one of them anyway via the channel. While this alone fixes the hang it would still leak the other goroutine. As there is no way to cancel a goroutine all the code must check for a cancelled context in the wait loop to no leak. Fixes 8a94331 ("libpod: simplify WaitForExit()") [1] containers/podman-py#425 Signed-off-by: Paul Holzinger <[email protected]>
The issue is closed and I recently fixed a number of races (bf74797) in the remote attach API that sound like exactly like the same error that was mentioned in issue containers#9597. As such I think this works, if it start flaking again we can revert this or better fix the actual bug. Signed-off-by: Paul Holzinger <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
It feels like the wait code could use a rearchitecture given how fragile this seems to be, but I don't see how it can be improved that much with our current architecture. We could maybe trigger off of events for some things, but that probably isn't sufficient for all states. |
nice removal of sleep LGTM |
yeah using the event stream would make sense for many things but given we allow events-backend=none we should not have it as hard requirement. I don't no matter what we do it will be much better, in the end we always have to keep checking the ctx.Done case all the time as there is no way to cancel actively in go. |
/lgtm |
As it turns on things are not so simple after all...
In podman-py it was reported[1] that waiting might hang, per our docs wait
on multiple conditions should exit once the first one is hit and not all
of them. However because the new wait logic never checked if the context
was cancelled the goroutine kept running until conmon exited and because
we used a waitgroup to wait for all of them to finish it blocked until
that happened.
First we can remove the waitgroup as we only need to wait for one of
them anyway via the channel. While this alone fixes the hang it would
still leak the other goroutine. As there is no way to cancel a goroutine
all the code must check for a cancelled context in the wait loop to no
leak.
Fixes 8a94331 ("libpod: simplify WaitForExit()")
[1] containers/podman-py#425
Does this PR introduce a user-facing change?