Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait: fix handling of multiple conditions with exited #23985

Merged
merged 2 commits into from
Sep 18, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Sep 17, 2024

As it turns on things are not so simple after all...
In podman-py it was reported[1] that waiting might hang, per our docs wait
on multiple conditions should exit once the first one is hit and not all
of them. However because the new wait logic never checked if the context
was cancelled the goroutine kept running until conmon exited and because
we used a waitgroup to wait for all of them to finish it blocked until
that happened.

First we can remove the waitgroup as we only need to wait for one of
them anyway via the channel. While this alone fixes the hang it would
still leak the other goroutine. As there is no way to cancel a goroutine
all the code must check for a cancelled context in the wait loop to no
leak.

Fixes 8a94331 ("libpod: simplify WaitForExit()")
[1] containers/podman-py#425

Does this PR introduce a user-facing change?

None (The bug was only in main so not worth to include it in the changlog)

As it turns on things are not so simple after all...
In podman-py it was reported[1] that waiting might hang, per our docs wait
on multiple conditions should exit once the first one is hit and not all
of them. However because the new wait logic never checked if the context
was cancelled the goroutine kept running until conmon exited and because
we used a waitgroup to wait for all of them to finish it blocked until
that happened.

First we can remove the waitgroup as we only need to wait for one of
them anyway via the channel. While this alone fixes the hang it would
still leak the other goroutine. As there is no way to cancel a goroutine
all the code must check for a cancelled context in the wait loop to no
leak.

Fixes 8a94331 ("libpod: simplify WaitForExit()")
[1] containers/podman-py#425

Signed-off-by: Paul Holzinger <[email protected]>
The issue is closed and I recently fixed a number of races (bf74797)
in the remote attach API that sound like exactly like the same error
that was mentioned in issue containers#9597.

As such I think this works, if it start flaking again we can revert this
or better fix the actual bug.

Signed-off-by: Paul Holzinger <[email protected]>
Copy link
Contributor

openshift-ci bot commented Sep 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2024
@mheon
Copy link
Member

mheon commented Sep 17, 2024

It feels like the wait code could use a rearchitecture given how fragile this seems to be, but I don't see how it can be improved that much with our current architecture. We could maybe trigger off of events for some things, but that probably isn't sufficient for all states.

@baude
Copy link
Member

baude commented Sep 17, 2024

nice removal of sleep

LGTM

@Luap99
Copy link
Member Author

Luap99 commented Sep 18, 2024

It feels like the wait code could use a rearchitecture given how fragile this seems to be, but I don't see how it can be improved that much with our current architecture. We could maybe trigger off of events for some things, but that probably isn't sufficient for all states.

yeah using the event stream would make sense for many things but given we allow events-backend=none we should not have it as hard requirement. I don't no matter what we do it will be much better, in the end we always have to keep checking the ctx.Done case all the time as there is no way to cancel actively in go.

@mheon
Copy link
Member

mheon commented Sep 18, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit f580ae0 into containers:main Sep 18, 2024
80 checks passed
@Luap99 Luap99 deleted the wait-hang branch September 18, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants