Skip to content

Conversation

@ramonfigueiredo
Copy link
Collaborator

@ramonfigueiredo ramonfigueiredo commented Jan 5, 2026

Link the Issue(s) this Pull Request is related to.

Summarize your change.
Fix intermittent "Connection reset by peer" and "gRPC connection interrupted" errors that cause CueGUI to stop updating job and frame details.

Changes:

  • Configure gRPC keepalive to preserve long-lived connections through load balancers and firewalls (30s ping interval, 10s timeout)
  • Add channel health monitoring with automatic reconnection after 3 consecutive failures
  • Integrate health tracking in FrameMonitorTree, FrameMonitor, and ThreadPool to detect and recover from stale connections

These changes prevent idle connections from being dropped by network infrastructure and ensure graceful recovery when connection issues occur.

@ramonfigueiredo ramonfigueiredo changed the title [CueGUI] Add gRPC keepalive and automatic channel recovery [cuegui] Add gRPC keepalive and automatic channel recovery Jan 5, 2026
@ramonfigueiredo ramonfigueiredo self-assigned this Jan 5, 2026
@ramonfigueiredo ramonfigueiredo force-pushed the fix/cuegui-grpc-connection-stability branch 2 times, most recently from 9162eca to d0316f3 Compare January 6, 2026 02:17
Fix intermittent "Connection reset by peer" and "gRPC connection interrupted" errors that cause CueGUI to stop updating job and frame details.

Changes:
- Configure gRPC keepalive to preserve long-lived connections through load balancers and firewalls (30s ping interval, 10s timeout)
- Add channel health monitoring with automatic reconnection after 3 consecutive failures
- Integrate health tracking in FrameMonitorTree, FrameMonitor, and ThreadPool to detect and recover from stale connections
- Add unit tests for connection health tracking functionality

These changes prevent idle connections from being dropped by network infrastructure and ensure graceful recovery when connection issues occur.
@ramonfigueiredo ramonfigueiredo force-pushed the fix/cuegui-grpc-connection-stability branch from d0316f3 to 9a03ea4 Compare January 6, 2026 02:37
@ramonfigueiredo ramonfigueiredo marked this pull request as ready for review January 6, 2026 02:51
@ramonfigueiredo
Copy link
Collaborator Author

@DiegoTavares / @lithorus
Ready for review!

Copy link
Collaborator

@DiegoTavares DiegoTavares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this PR is going the right direction.

  • The some of gprc'a channel configuration attributes need to be set on both server and client. This PR doesn't touch the server.
  • The grpc reconnection design is fallible and not broad enough to handle all cases.
  • The grpc channel health tracking mechanism pollutes the code with a static call to pycue's Cuebot class, which in my opinion doesn't make sense when the Cuebot class has everything it needs to track success/failure calls independently.

('grpc.keepalive_timeout_ms', keepalive_timeout_ms),
('grpc.keepalive_permit_without_calls', keepalive_permit_without_calls),
# Allow client to send keepalive pings even without data
('grpc.http2.max_pings_without_data', 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recommend removing empty ping limitations. If you open a python shell and import pycue for example, a connection will be created, if you don't sent any data, the channel will live forever until the shell is closed. This has the potential to overwhelm the server with too many opened empty channels.

The default value is 2, maybe we can increase it to 10. But given how noisy cuegui's communication with Cuebot is, I doubt there's a period of inactivity where pings are being sent without payload.

# Minimum time between pings (allows more frequent pings)
('grpc.http2.min_time_between_pings_ms', 10000),
# Don't limit ping strikes (server may reject too many pings)
('grpc.http2.min_ping_interval_without_data_ms', 5000),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly a server side configuration, and it needs to be set in both server and client. Reducing the empty-ping interval might have the opposite effect of what you expect.

Read grpc.config

Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings as described in A8-client-side-keepalive.md. Some scenarios where this can happen are

  • if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false while the client has set this to true resulting in keepalive pings being sent even when there is no call in flight.
  • if the client's GRPC_ARG_KEEPALIVE_TIME_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.

return False

@staticmethod
def checkChannelHealth():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find where this method is being called outside of a unit test context. Am I missing something?

Comment on lines +525 to +526
# Record successful call for connection health tracking
Cuebot.recordSuccessfulCall()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a static method call to act as a channel health-check is error prune and not recommended. What is the rationale behind having this on frameMonitorTree and not LayerMonitorTree for example? It looks like a patch for a symptom and not the facing the actual illness.

If you want to implement a logic to keep track of successful calls, please use the class RetryOnRpcErrorClientInterceptor which intercepts every call to the grpc channel.

Comment on lines +534 to +536
# Record failed call and potentially reset the channel
if Cuebot.recordFailedCall():
logger.info("Channel reset due to connection issues, retrying")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this check to RetryOnRpcErrorClientInterceptor as explained above.

Comment on lines +78 to +82
DEFAULT_KEEPALIVE_TIME_MS = 30000 # Send keepalive ping every 30 seconds
DEFAULT_KEEPALIVE_TIMEOUT_MS = 10000 # Wait 10 seconds for keepalive response
DEFAULT_KEEPALIVE_PERMIT_WITHOUT_CALLS = True # Send keepalive even when no active RPCs
DEFAULT_MAX_CONNECTION_IDLE_MS = 0 # Disable max idle time (keep connection open)
DEFAULT_MAX_CONNECTION_AGE_MS = 0 # Disable max connection age
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all these constants are being used when creating the channel. Please sanitize.

Comment on lines +222 to +223
keepalive_permit_without_calls = Cuebot.Config.get(
'cuebot.keepalive_permit_without_calls', DEFAULT_KEEPALIVE_PERMIT_WITHOUT_CALLS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config has to be set on both server and client to have effect. As this is not configured on the server, it has no effect.

@DiegoTavares
Copy link
Collaborator

One way to implement this is to pass a control variable to RetryOnRpcErrorClientInterceptor to be used to flag that an intercepted call failed with a Status code that signifies the channel needs to be reset. Then on getStub you check if the control flag has been activated, and if so, reset the channel (resetChannel).

@ramonfigueiredo ramonfigueiredo marked this pull request as draft January 6, 2026 17:24
@ramonfigueiredo
Copy link
Collaborator Author

ramonfigueiredo commented Jan 6, 2026

Thanks for the review, @DiegoTavares .

I'll submit a revised solution incorporating your recommendations shortly, most likely as a new PR. I’m closing this one for now.

Since the issue is difficult to reproduce, I am keeping investigating for better solutions. I'll keep you posted on any progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants