fix: pod finalizer removal and odd pod status #14088

Joibel · 2025-01-15T15:54:11Z

Motivation

Finalizers

If pod finalizers are in use they should not prevent pod deletion after the pod is complete.

For example: If you have a podGC.strategy of OnPodSuccess with a deleteDelayDuration set and you delete the owning Workflow during the deleteDelayDuration then the pod will remain until deleteDelayDuration expires. If the workflow-controller is restarted during this window the pod is orphaned, with the finalizer still in place.

blockOwnerDeletion in the ownerReference of a pod does not prevent the owner (Workflow) being deleted in all circumstances

Wait Running whilst Pod Failed

It is possible for a node to disappear from a cluster as a surprise. In this case the Pod ContainerStatus could remain in running (because the container never went into any further state), whilst the Pod's own Status is in Error. We have seen this in real clusters, but it is rare.

This PR attempts to recognise this case and set the Workflow Node status accordingly.

Modifications

When a pod has a finalizer on it and the workflow node on it is Fulfilled, we don't need it any more, so always remove our finalizer on it if present. This will allow the workflow to get deleted independently and for ownerReference deletion to propagate and delete the pod. It also takes care of some race conditions and the event that the only reference to a completed pod is in the delayed cleanup queue, which is not persistent across restarts.

When a pod's status is Failed always mark the workflow nodes on it as Failed. Previously you could get leaving phase un-changed: wait container is not yet terminated log messages, and there is no path out of this state. Allow a path out of this state, and added a unit test to show it works. Also allow acknowledge this state when reconciling ContainerSets.

Verification

Added unit tests

Ran this in production with pod finalizers on and a PodGC strategy enabled. Without this change (vanilla 3.6.2) this would result in pods stuck in Terminating on a reasonably regular basis with the finalizer still on them. This has not happened with this change.

Signed-off-by: Alan Clucas <[email protected]>

isubasinghe · 2025-01-16T06:38:31Z

/retest

isubasinghe

This makes sense, I'm not entirely confident every edge case is accounted for but lets ship it for now.

Joibel · 2025-01-16T11:03:42Z

@shuangkun and @jswxstw, I'd like to hear your thoughts on these changes if you have time.

Joibel · 2025-01-16T11:19:46Z

I realise now that #13491 does something similar (I originally wrote this patch against the 3.5 codebase).

In the scenario we're seeing with a non-evicted pod, the workflow node still gets stuck with #13491 in place.

shuangkun · 2025-01-16T15:55:32Z

workflow/controller/pod_cleanup.go

 		case deletePod:
 			woc.controller.queuePodForCleanupAfter(pod.Namespace, pod.Name, deletePod, delay)
+		case removeFinalizer:


Reasonable, is it need to add

wfc.queuePodForCleanup(p.Namespace, p.Name, removeFinalizer)

here？

Yeah, oops. I had thought the final parameter was action. Will fix.

Nice catch @shuangkun, I missed this in the review.

Maybe change this section to this:

action := determinePodCleanupAction(selector, pod.Labels, strategy, workflowPhase, pod.Status.Phase, pod.Finalizers) if action == deletePod { woc.controller.queuePodForCleanupAfter(pod.Namespace, pod.Name, action, delay) } else { woc.controller.queuePodForCleanup(pod.Namespace, pod.Name, action) }

jswxstw · 2025-01-17T08:39:03Z

It is possible for a node to disappear from a cluster as a surprise. In this case the Pod ContainerStatus could remain in running (because the container never went into any further state), whilst the Pod's own Status is in Error. We have seen this in real clusters, but it is rare.

Indeed, this issue kubernetes/kubernetes#98718 in older versions of Kubernetes can also cause inconsistencies between pod status and container status, which can lead to the workflow getting stuck.

fix: pod finalizer removal and odd pod status

88fcdf3

Signed-off-by: Alan Clucas <[email protected]>

Joibel requested a review from isubasinghe January 15, 2025 15:54

Joibel added the area/controller Controller issues, panics label Jan 15, 2025

chore: add a unit test for operator change

e2905a2

Signed-off-by: Alan Clucas <[email protected]>

isubasinghe approved these changes Jan 16, 2025

View reviewed changes

Joibel marked this pull request as ready for review January 16, 2025 11:03

Joibel requested review from jswxstw and shuangkun January 16, 2025 11:03

shuangkun reviewed Jan 16, 2025

View reviewed changes

Joibel mentioned this pull request Jan 27, 2025

fix: split pod controller from workflow controller #14129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pod finalizer removal and odd pod status #14088

fix: pod finalizer removal and odd pod status #14088

Joibel commented Jan 15, 2025 •

edited

Loading

isubasinghe commented Jan 16, 2025

isubasinghe left a comment

Joibel commented Jan 16, 2025

Joibel commented Jan 16, 2025

shuangkun Jan 16, 2025

Joibel Jan 16, 2025

isubasinghe Jan 16, 2025

isubasinghe Jan 16, 2025

jswxstw commented Jan 17, 2025

fix: pod finalizer removal and odd pod status #14088

Are you sure you want to change the base?

fix: pod finalizer removal and odd pod status #14088

Conversation

Joibel commented Jan 15, 2025 • edited Loading

Motivation

Finalizers

Wait Running whilst Pod Failed

Modifications

Verification

isubasinghe commented Jan 16, 2025

isubasinghe left a comment

Choose a reason for hiding this comment

Joibel commented Jan 16, 2025

Joibel commented Jan 16, 2025

shuangkun Jan 16, 2025

Choose a reason for hiding this comment

Joibel Jan 16, 2025

Choose a reason for hiding this comment

isubasinghe Jan 16, 2025

Choose a reason for hiding this comment

isubasinghe Jan 16, 2025

Choose a reason for hiding this comment

jswxstw commented Jan 17, 2025

Joibel commented Jan 15, 2025 •

edited

Loading