Error updating workflow: rpc error: code = Unavailable needs retries #14100

tooptoop4 · 2025-01-20T00:17:55Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

am running 10000s of wfs a day, and all have been fine except one i noticed was still running for more than 24hrs, i saw the only step within it was 'running' but when i look at the pod it was completed (all 3 containers inside completed as well). i saw this in workflow controller logs:

{"time":"2025-01-18T23:35:00.111391931Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.974Z\" level=info msg=\"Outbound nodes of mywf-2812951024 is [mywf-468570289]\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111400081Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.974Z\" level=info msg=\"node mywf-2812951024 phase Running -> Succeeded\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111405241Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.974Z\" level=info msg=\"node mywf-2812951024 finished: 2025-01-18 23:34:39.974935407 +0000 UTC\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111411672Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"node mywf-835224713 phase Running -> Succeeded\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111417542Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"node mywf-835224713 message: retryStrategy.expression evaluated to false\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111421852Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"node mywf-835224713 finished: 2025-01-18 23:34:39.975516899 +0000 UTC\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111427012Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111431822Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"Marking workflow completed\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111437272Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111453102Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:47.911Z\" level=warning msg=\"Waited for 7.93542039s, request: Update:https://clusteripredact:443/apis/argoproj.io/v1alpha1/namespaces/myns/workflows/mywf\""
{"time":"2025-01-18T23:35:00.111459222Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:47.911Z\" level=warning msg=\"Error updating workflow: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:52824->ip2redact:2379: read: connection timed out \" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:43.810561717Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:43.806Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"
{"time":"2025-01-18T23:36:28.781077898Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:55.733Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"
{"time":"2025-01-18T23:43:00.05557023Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:41:15.809Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"
{"time":"2025-01-18T23:50:00.102084482Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:49:57.811Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"

argo-workflows/workflow/controller/operator.go

Line 763 in 89d75a6

    
           woc.log.Warnf("Error updating workflow: %v %s", err, apierr.ReasonForError(err))

seems to be relevant code

even manually deleting the pod did not help the workflow to become success, it stays running

i have seen a couple of other 'unavailable' errors around same time, however all those workflows appear to have completed gracefully:

2nd case (note that its trying Delete not Update)

{"time":"2025-01-18T23:35:00.111465423Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:47.913Z\" level=warning msg=\"Waited for 14.912253722s, request: Delete:https://clusteripredact:443/apis/argoproj.io/v1alpha1/namespaces/myns/workflows/anotherwf\""
{"time":"2025-01-18T23:35:00.111471863Z","stream":"stdout","_p":"F","log":"E0118 23:34:47.913688       7 gc_controller.go:170] rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:52824->ip3redact:2379: read: connection timed out"

3rd case (seems to be for create pods)

{"time":"2025-01-18T23:35:23.364429081Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.329Z\" level=warning msg=\"Waited for 6.247140327s, request: Create:https://clusteripredact:443/api/v1/namespaces/myns/pods\""
{"time":"2025-01-18T23:35:23.364484852Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"Transient error: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\""
{"time":"2025-01-18T23:35:23.364491832Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"Transient error: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\""
{"time":"2025-01-18T23:35:23.364501663Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"Mark node yetanotherwf(0) as Pending, due to: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\" namespace=myns workflow=yetanotherwf"
{"time":"2025-01-18T23:35:23.364508413Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"node yetanotherwf-162523283 message: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\" namespace=myns workflow=yetanotherwf"
{"time":"2025-01-18T23:35:27.138726861Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:26.333Z\" level=info msg=\"node changed\" namespace=myns new.message= new.phase=Running new.progress=0/1 nodeID=yetanotherwf-162523283 old.message=\"rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\" old.phase=Pending old.progress=0/1 workflow=yetanotherwf"

4th case (seems to be for DeleteCollection workflowtaskresults)

{"time":"2025-01-18T23:37:00.05854048Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:36:43.888Z\" level=warning msg=\"Waited for 10.540198872s, request: DeleteCollection:https://clusteripredact:443/apis/argoproj.io/v1alpha1/namespaces/myns/workflowtaskresults?labelSelector=workflows.argoproj.io%2Fworkflow%3Deventmorewf\""

even activedeadlineseconds is being ignored!

am also seeing "Non-transient error: etcdserver: request timed out"

Version(s)

3.4.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

see above

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

n/a

The text was updated successfully, but these errors were encountered:

tooptoop4 added the type/bug label Jan 20, 2025

This was referenced Jan 21, 2025

fix: update workflow in case of transient error #14107

Open

a running workflow turn postponed due parallelism limit #14123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error updating workflow: rpc error: code = Unavailable needs retries #14100

Error updating workflow: rpc error: code = Unavailable needs retries #14100

tooptoop4 commented Jan 20, 2025 •

edited

Loading

Error updating workflow: rpc error: code = Unavailable needs retries #14100

Error updating workflow: rpc error: code = Unavailable needs retries #14100

Comments

tooptoop4 commented Jan 20, 2025 • edited Loading

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

tooptoop4 commented Jan 20, 2025 •

edited

Loading