You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During production support. I noticed an issue. When candence-frontend goes down, by some reason. Java client does not try to reconnect to cadence-frontend.
This situation leads to outage of client-worker node.
hopefully there is a auto-reconnect feature on cadence-java-client, in our use case we deploy cadence on ECS Fargate, and due to AWS issue, the service can restart for no explicit reason. And in this case all the worker node would lose connection.
Using quartz ^ seems like a good option but also an overkill
Hello @szaluzhskiy@jchenseated which versions are you using? I try to reproduce it with 2.7.8 but it's working as expected.
What I did:
After running workers and make sure it's up by starting some hello world workflows, shutdown cadence server.
After ~2 minutes, client throws this exception after about timeout:
15:34:14.079 [Workflow Poller taskList="HelloActivity", domain="sample": 1] ERROR c.u.cadence.internal.worker.Poller - Failure in thread Workflow Poller taskList="HelloActivity", domain="sample": 1
org.apache.thrift.TException: Rpc error:<ErrorResponse id=35 errorType=Timeout message=Request timeout after 121003ms>
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:505)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:480)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.pollForDecisionTask(WorkflowServiceTChannel.java:860)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$PollForDecisionTask$9(WorkflowServiceTChannel.java:848)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:525)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.PollForDecisionTask(WorkflowServiceTChannel.java:847)
at com.uber.cadence.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:74)
at com.uber.cadence.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:34)
at com.uber.cadence.internal.worker.Poller$PollExecutionTask.run(Poller.java:254)
at com.uber.cadence.internal.worker.Poller$PollLoopTask.run(Poller.java:225)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
helloHello noop activity1Indeed
helloHello noop activity2Indeed
helloHello noop activity3Indeed
Then I start server again and the client re-connect to the server and is able to run workflows.
I guess the problem that rpcLongPollTimeoutMillis by default is 125 seconds(~ 2 minutes). So when server restart, your worker will experience 2 minutes down time. (I think we should decrease the default to 10 seconds, will make a PR later)
If that's the problem you can use
setRpcLongPollTimeout in ClientOptions to adjust the value.
hi @longquanzheng , i also experienced the same issue. with errorResponse id=-1. i also used 2.7.8 version. but still got the same errors. btw are you still facing the same issue @szaluzhskiy@jchenseated? thank you
Activity
jchenseated commentedon Sep 10, 2020
hopefully there is a auto-reconnect feature on cadence-java-client, in our use case we deploy cadence on ECS Fargate, and due to AWS issue, the service can restart for no explicit reason. And in this case all the worker node would lose connection.
Using quartz ^ seems like a good option but also an overkill
szaluzhskiy commentedon Sep 10, 2020
@jchenseated Seems like unfortunally there is no reconnection option in cadence-java-client at the moment.
jchenseated commentedon Sep 11, 2020
@szaluzhskiy yeah. does the community remedy works for you tho?
szaluzhskiy commentedon Nov 10, 2020
yes, it works
longquanzheng commentedon Dec 15, 2020
Hello @szaluzhskiy @jchenseated which versions are you using? I try to reproduce it with 2.7.8 but it's working as expected.
What I did:
After running workers and make sure it's up by starting some hello world workflows, shutdown cadence server.
After ~2 minutes, client throws this exception after about timeout:
Then I start server again and the client re-connect to the server and is able to run workflows.
longquanzheng commentedon Dec 16, 2020
I guess the problem that rpcLongPollTimeoutMillis by default is 125 seconds(~ 2 minutes). So when server restart, your worker will experience 2 minutes down time. (I think we should decrease the default to 10 seconds, will make a PR later)
If that's the problem you can use
setRpcLongPollTimeout in ClientOptions to adjust the value.
ekomanurung commentedon May 2, 2023
hi @longquanzheng , i also experienced the same issue. with errorResponse id=-1. i also used 2.7.8 version. but still got the same errors. btw are you still facing the same issue @szaluzhskiy @jchenseated? thank you