Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: pod finalizer removal and odd pod status #14088

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Joibel
Copy link
Member

@Joibel Joibel commented Jan 15, 2025

Motivation

Finalizers

If pod finalizers are in use they should not prevent pod deletion after the pod is complete.

For example: If you have a podGC.strategy of OnPodSuccess with a deleteDelayDuration set and you delete the owning Workflow during the deleteDelayDuration then the pod will remain until deleteDelayDuration expires. If the workflow-controller is restarted during this window the pod is orphaned, with the finalizer still in place.

blockOwnerDeletion in the ownerReference of a pod does not prevent the owner (Workflow) being deleted in all circumstances

Wait Running whilst Pod Failed

It is possible for a node to disappear from a cluster as a surprise. In this case the Pod ContainerStatus could remain in running (because the container never went into any further state), whilst the Pod's own Status is in Error. We have seen this in real clusters, but it is rare.

This PR attempts to recognise this case and set the Workflow Node status accordingly.

Modifications

When a pod has a finalizer on it and the workflow node on it is Fulfilled, we don't need it any more, so always remove our finalizer on it if present. This will allow the workflow to get deleted independently and for ownerReference deletion to propagate and delete the pod. It also takes care of some race conditions and the event that the only reference to a completed pod is in the delayed cleanup queue, which is not persistent across restarts.

When a pod's status is Failed always mark the workflow nodes on it as Failed. Previously you could get leaving phase un-changed: wait container is not yet terminated log messages, and there is no path out of this state. Allow a path out of this state, and added a unit test to show it works. Also allow acknowledge this state when reconciling ContainerSets.

Verification

Added unit tests

Ran this in production with pod finalizers on and a PodGC strategy enabled. Without this change (vanilla 3.6.2) this would result in pods stuck in Terminating on a reasonably regular basis with the finalizer still on them. This has not happened with this change.

@Joibel Joibel requested a review from isubasinghe January 15, 2025 15:54
@Joibel Joibel added the area/controller Controller issues, panics label Jan 15, 2025
@isubasinghe
Copy link
Member

/retest

Copy link
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, I'm not entirely confident every edge case is accounted for but lets ship it for now.

@Joibel Joibel marked this pull request as ready for review January 16, 2025 11:03
@Joibel Joibel requested review from jswxstw and shuangkun January 16, 2025 11:03
@Joibel
Copy link
Member Author

Joibel commented Jan 16, 2025

@shuangkun and @jswxstw, I'd like to hear your thoughts on these changes if you have time.

@Joibel
Copy link
Member Author

Joibel commented Jan 16, 2025

I realise now that #13491 does something similar (I originally wrote this patch against the 3.5 codebase).

In the scenario we're seeing with a non-evicted pod, the workflow node still gets stuck with #13491 in place.

case deletePod:
woc.controller.queuePodForCleanupAfter(pod.Namespace, pod.Name, deletePod, delay)
case removeFinalizer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable, is it need to add

wfc.queuePodForCleanup(p.Namespace, p.Name, removeFinalizer) 

here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, oops. I had thought the final parameter was action. Will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @shuangkun, I missed this in the review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this section to this:

		action := determinePodCleanupAction(selector, pod.Labels, strategy, workflowPhase, pod.Status.Phase, pod.Finalizers)
		if action == deletePod {
			woc.controller.queuePodForCleanupAfter(pod.Namespace, pod.Name, action, delay)
		} else {
			woc.controller.queuePodForCleanup(pod.Namespace, pod.Name, action)
		}

@jswxstw
Copy link
Member

jswxstw commented Jan 17, 2025

It is possible for a node to disappear from a cluster as a surprise. In this case the Pod ContainerStatus could remain in running (because the container never went into any further state), whilst the Pod's own Status is in Error. We have seen this in real clusters, but it is rare.

Indeed, this issue kubernetes/kubernetes#98718 in older versions of Kubernetes can also cause inconsistencies between pod status and container status, which can lead to the workflow getting stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants