Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: controller: ensures workflow reconciling task result properly wh…
…en failing to received timely updates from api server error scenario: a pod for a step in a workflow has completed, and its task result are properly created and finalized by its wait container (judging from the exit status of the wait container), however, the task result informer in the controller leader has not received any updates about it (due to overloaded api server or etcd). currently, the argo workflow controller doesn't handle the above scenario properly. it would mark the workflow node succeeded and shows no artifact outputs (even though they are already uploaded to the repository). we did run into this situation in our production instance (it's v3.5.8). it's not easy to reproduce this problem, but we can have a manual fault injection in `workflow/controller/taskresult.go:func (woc *wfOperationCtx) taskResultReconciliation()` to simulate the situation and I did reproduce the issue on release v3.6.2: ```diff +++ workflow/controller/taskresult.go @@ -1,7 +1,9 @@ package controller import ( + "os" "reflect" + "strings" "time" log "github.com/sirupsen/logrus" @@ -62,6 +64,12 @@ func (woc *wfOperationCtx) taskResultReconciliation() { objs, _ := woc.controller.taskResultInformer.GetIndexer().ByIndex(indexes.WorkflowIndex, woc.wf.Namespace+"/"+woc.wf.Name) woc.log.WithField("numObjs", len(objs)).Info("Task-result reconciliation") + if strings.Contains(woc.wf.Name, "-xhu-debug-") { + if _, err := os.Stat("/tmp/xhu-debug-control"); err != nil { + return + } + } ``` the change is to forcefully mark the workflow having incomplete TaskResult in assessNodeStatus. this fix doesn't handle the case when a pod failed, there are too many potentially failure scenarios (like the wait container might not be able to insert a task result). plus, a retry is probably needed when there are failures. the loss is probably not as great as a successful one. Signed-off-by: Xiaofan Hu <[email protected]>
- Loading branch information