Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error updating workflow: rpc error: code = Unavailable needs retries #14100

Open
3 of 4 tasks
tooptoop4 opened this issue Jan 20, 2025 · 0 comments · May be fixed by #14107
Open
3 of 4 tasks

Error updating workflow: rpc error: code = Unavailable needs retries #14100

tooptoop4 opened this issue Jan 20, 2025 · 0 comments · May be fixed by #14107
Labels

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Jan 20, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

am running 10000s of wfs a day, and all have been fine except one i noticed was still running for more than 24hrs, i saw the only step within it was 'running' but when i look at the pod it was completed (all 3 containers inside completed as well). i saw this in workflow controller logs:

{"time":"2025-01-18T23:35:00.111391931Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.974Z\" level=info msg=\"Outbound nodes of mywf-2812951024 is [mywf-468570289]\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111400081Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.974Z\" level=info msg=\"node mywf-2812951024 phase Running -> Succeeded\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111405241Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.974Z\" level=info msg=\"node mywf-2812951024 finished: 2025-01-18 23:34:39.974935407 +0000 UTC\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111411672Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"node mywf-835224713 phase Running -> Succeeded\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111417542Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"node mywf-835224713 message: retryStrategy.expression evaluated to false\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111421852Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"node mywf-835224713 finished: 2025-01-18 23:34:39.975516899 +0000 UTC\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111427012Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111431822Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"Marking workflow completed\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111437272Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:39.975Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:00.111453102Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:47.911Z\" level=warning msg=\"Waited for 7.93542039s, request: Update:https://clusteripredact:443/apis/argoproj.io/v1alpha1/namespaces/myns/workflows/mywf\""
{"time":"2025-01-18T23:35:00.111459222Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:47.911Z\" level=warning msg=\"Error updating workflow: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:52824->ip2redact:2379: read: connection timed out \" namespace=myns workflow=mywf"
{"time":"2025-01-18T23:35:43.810561717Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:43.806Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"
{"time":"2025-01-18T23:36:28.781077898Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:55.733Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"
{"time":"2025-01-18T23:43:00.05557023Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:41:15.809Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"
{"time":"2025-01-18T23:50:00.102084482Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:49:57.811Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=myns/mywf"

woc.log.Warnf("Error updating workflow: %v %s", err, apierr.ReasonForError(err))
seems to be relevant code

even manually deleting the pod did not help the workflow to become success, it stays running

i have seen a couple of other 'unavailable' errors around same time, however all those workflows appear to have completed gracefully:

2nd case (note that its trying Delete not Update)

{"time":"2025-01-18T23:35:00.111465423Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:34:47.913Z\" level=warning msg=\"Waited for 14.912253722s, request: Delete:https://clusteripredact:443/apis/argoproj.io/v1alpha1/namespaces/myns/workflows/anotherwf\""
{"time":"2025-01-18T23:35:00.111471863Z","stream":"stdout","_p":"F","log":"E0118 23:34:47.913688       7 gc_controller.go:170] rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:52824->ip3redact:2379: read: connection timed out"

3rd case (seems to be for create pods)

{"time":"2025-01-18T23:35:23.364429081Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.329Z\" level=warning msg=\"Waited for 6.247140327s, request: Create:https://clusteripredact:443/api/v1/namespaces/myns/pods\""
{"time":"2025-01-18T23:35:23.364484852Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"Transient error: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\""
{"time":"2025-01-18T23:35:23.364491832Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"Transient error: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\""
{"time":"2025-01-18T23:35:23.364501663Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"Mark node yetanotherwf(0) as Pending, due to: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\" namespace=myns workflow=yetanotherwf"
{"time":"2025-01-18T23:35:23.364508413Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:16.330Z\" level=info msg=\"node yetanotherwf-162523283 message: rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\" namespace=myns workflow=yetanotherwf"
{"time":"2025-01-18T23:35:27.138726861Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:35:26.333Z\" level=info msg=\"node changed\" namespace=myns new.message= new.phase=Running new.progress=0/1 nodeID=yetanotherwf-162523283 old.message=\"rpc error: code = Unavailable desc = error reading from server: read tcp ip1redact:36478->ip2redact:2379: read: connection timed out\" old.phase=Pending old.progress=0/1 workflow=yetanotherwf"

4th case (seems to be for DeleteCollection workflowtaskresults)

{"time":"2025-01-18T23:37:00.05854048Z","stream":"stdout","_p":"F","log":"time=\"2025-01-18T23:36:43.888Z\" level=warning msg=\"Waited for 10.540198872s, request: DeleteCollection:https://clusteripredact:443/apis/argoproj.io/v1alpha1/namespaces/myns/workflowtaskresults?labelSelector=workflows.argoproj.io%2Fworkflow%3Deventmorewf\""

even activedeadlineseconds is being ignored!

am also seeing "Non-transient error: etcdserver: request timed out"

Version(s)

3.4.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

see above

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

n/a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant