[cuegui] Add gRPC keepalive and automatic channel recovery #2138

ramonfigueiredo · 2026-01-05T22:20:36Z

Link the Issue(s) this Pull Request is related to.

[cuegui] UI stops updating due to gRPC connection drops ("Connection reset by peer") #2139

Summarize your change.
Fix intermittent "Connection reset by peer" and "gRPC connection interrupted" errors that cause CueGUI to stop updating job and frame details.

Changes:

Configure gRPC keepalive to preserve long-lived connections through load balancers and firewalls (30s ping interval, 10s timeout)
Add channel health monitoring with automatic reconnection after 3 consecutive failures
Integrate health tracking in FrameMonitorTree, FrameMonitor, and ThreadPool to detect and recover from stale connections

These changes prevent idle connections from being dropped by network infrastructure and ensure graceful recovery when connection issues occur.

Fix intermittent "Connection reset by peer" and "gRPC connection interrupted" errors that cause CueGUI to stop updating job and frame details. Changes: - Configure gRPC keepalive to preserve long-lived connections through load balancers and firewalls (30s ping interval, 10s timeout) - Add channel health monitoring with automatic reconnection after 3 consecutive failures - Integrate health tracking in FrameMonitorTree, FrameMonitor, and ThreadPool to detect and recover from stale connections - Add unit tests for connection health tracking functionality These changes prevent idle connections from being dropped by network infrastructure and ensure graceful recovery when connection issues occur.

ramonfigueiredo · 2026-01-06T02:51:32Z

@DiegoTavares / @lithorus
Ready for review!

DiegoTavares

I don't think this PR is going the right direction.

The some of gprc'a channel configuration attributes need to be set on both server and client. This PR doesn't touch the server.
The grpc reconnection design is fallible and not broad enough to handle all cases.
The grpc channel health tracking mechanism pollutes the code with a static call to pycue's Cuebot class, which in my opinion doesn't make sense when the Cuebot class has everything it needs to track success/failure calls independently.

DiegoTavares · 2026-01-06T16:43:35Z

pycue/opencue/cuebot.py

+                    ('grpc.keepalive_timeout_ms', keepalive_timeout_ms),
+                    ('grpc.keepalive_permit_without_calls', keepalive_permit_without_calls),
+                    # Allow client to send keepalive pings even without data
+                    ('grpc.http2.max_pings_without_data', 0),


I don't recommend removing empty ping limitations. If you open a python shell and import pycue for example, a connection will be created, if you don't sent any data, the channel will live forever until the shell is closed. This has the potential to overwhelm the server with too many opened empty channels.

The default value is 2, maybe we can increase it to 10. But given how noisy cuegui's communication with Cuebot is, I doubt there's a period of inactivity where pings are being sent without payload.

DiegoTavares · 2026-01-06T16:50:33Z

pycue/opencue/cuebot.py

+                    # Minimum time between pings (allows more frequent pings)
+                    ('grpc.http2.min_time_between_pings_ms', 10000),
+                    # Don't limit ping strikes (server may reject too many pings)
+                    ('grpc.http2.min_ping_interval_without_data_ms', 5000),


This is mainly a server side configuration, and it needs to be set in both server and client. Reducing the empty-ping interval might have the opposite effect of what you expect.

Read grpc.config

Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings as described in A8-client-side-keepalive.md. Some scenarios where this can happen are

if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false while the client has set this to true resulting in keepalive pings being sent even when there is no call in flight.

if the client's GRPC_ARG_KEEPALIVE_TIME_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.

DiegoTavares · 2026-01-06T16:52:02Z

pycue/opencue/cuebot.py

+        return False
+
+    @staticmethod
+    def checkChannelHealth():


I can't find where this method is being called outside of a unit test context. Am I missing something?

DiegoTavares · 2026-01-06T16:56:52Z

cuegui/cuegui/FrameMonitorTree.py

+            # Record successful call for connection health tracking
+            Cuebot.recordSuccessfulCall()


Using a static method call to act as a channel health-check is error prune and not recommended. What is the rationale behind having this on frameMonitorTree and not LayerMonitorTree for example? It looks like a patch for a symptom and not the facing the actual illness.

If you want to implement a logic to keep track of successful calls, please use the class RetryOnRpcErrorClientInterceptor which intercepts every call to the grpc channel.

DiegoTavares · 2026-01-06T16:57:20Z

cuegui/cuegui/FrameMonitorTree.py

+                    # Record failed call and potentially reset the channel
+                    if Cuebot.recordFailedCall():
+                        logger.info("Channel reset due to connection issues, retrying")


Move this check to RetryOnRpcErrorClientInterceptor as explained above.

DiegoTavares · 2026-01-06T16:58:47Z

pycue/opencue/cuebot.py

+DEFAULT_KEEPALIVE_TIME_MS = 30000  # Send keepalive ping every 30 seconds
+DEFAULT_KEEPALIVE_TIMEOUT_MS = 10000  # Wait 10 seconds for keepalive response
+DEFAULT_KEEPALIVE_PERMIT_WITHOUT_CALLS = True  # Send keepalive even when no active RPCs
+DEFAULT_MAX_CONNECTION_IDLE_MS = 0  # Disable max idle time (keep connection open)
+DEFAULT_MAX_CONNECTION_AGE_MS = 0  # Disable max connection age


Not all these constants are being used when creating the channel. Please sanitize.

DiegoTavares · 2026-01-06T17:02:25Z

pycue/opencue/cuebot.py

+                keepalive_permit_without_calls = Cuebot.Config.get(
+                    'cuebot.keepalive_permit_without_calls', DEFAULT_KEEPALIVE_PERMIT_WITHOUT_CALLS)


This config has to be set on both server and client to have effect. As this is not configured on the server, it has no effect.

DiegoTavares · 2026-01-06T17:20:11Z

One way to implement this is to pass a control variable to RetryOnRpcErrorClientInterceptor to be used to flag that an intercepted call failed with a Status code that signifies the channel needs to be reset. Then on getStub you check if the control flag has been activated, and if so, reset the channel (resetChannel).

ramonfigueiredo · 2026-01-06T17:30:23Z

Thanks for the review, @DiegoTavares .

I'll submit a revised solution incorporating your recommendations shortly, most likely as a new PR. I’m closing this one for now.

Since the issue is difficult to reproduce, I am keeping investigating for better solutions. I'll keep you posted on any progress.

ramonfigueiredo changed the title ~~[CueGUI] Add gRPC keepalive and automatic channel recovery~~ [cuegui] Add gRPC keepalive and automatic channel recovery Jan 5, 2026

ramonfigueiredo self-assigned this Jan 5, 2026

ramonfigueiredo force-pushed the fix/cuegui-grpc-connection-stability branch 2 times, most recently from 9162eca to d0316f3 Compare January 6, 2026 02:17

ramonfigueiredo force-pushed the fix/cuegui-grpc-connection-stability branch from d0316f3 to 9a03ea4 Compare January 6, 2026 02:37

ramonfigueiredo marked this pull request as ready for review January 6, 2026 02:51

ramonfigueiredo requested review from DiegoTavares and lithorus as code owners January 6, 2026 02:51

DiegoTavares requested changes Jan 6, 2026

View reviewed changes

ramonfigueiredo marked this pull request as draft January 6, 2026 17:24

ramonfigueiredo closed this Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cuegui] Add gRPC keepalive and automatic channel recovery #2138

[cuegui] Add gRPC keepalive and automatic channel recovery #2138

Uh oh!

ramonfigueiredo commented Jan 5, 2026 •

edited

Loading

Uh oh!

ramonfigueiredo commented Jan 6, 2026

Uh oh!

DiegoTavares left a comment

Uh oh!

DiegoTavares Jan 6, 2026

Uh oh!

DiegoTavares Jan 6, 2026

Uh oh!

DiegoTavares Jan 6, 2026

Uh oh!

DiegoTavares Jan 6, 2026

Uh oh!

DiegoTavares Jan 6, 2026

Uh oh!

DiegoTavares Jan 6, 2026

Uh oh!

DiegoTavares Jan 6, 2026

Uh oh!

DiegoTavares commented Jan 6, 2026

Uh oh!

ramonfigueiredo commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Record successful call for connection health tracking
		Cuebot.recordSuccessfulCall()

		keepalive_permit_without_calls = Cuebot.Config.get(
		'cuebot.keepalive_permit_without_calls', DEFAULT_KEEPALIVE_PERMIT_WITHOUT_CALLS)

[cuegui] Add gRPC keepalive and automatic channel recovery #2138

[cuegui] Add gRPC keepalive and automatic channel recovery #2138

Uh oh!

Conversation

ramonfigueiredo commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramonfigueiredo commented Jan 6, 2026

Uh oh!

DiegoTavares left a comment

Choose a reason for hiding this comment

Uh oh!

DiegoTavares Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DiegoTavares Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DiegoTavares Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DiegoTavares Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DiegoTavares Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DiegoTavares Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DiegoTavares Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DiegoTavares commented Jan 6, 2026

Uh oh!

ramonfigueiredo commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ramonfigueiredo commented Jan 5, 2026 •

edited

Loading

ramonfigueiredo commented Jan 6, 2026 •

edited

Loading