-
Notifications
You must be signed in to change notification settings - Fork 235
[cuegui] Add gRPC keepalive and automatic channel recovery #2138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cuegui] Add gRPC keepalive and automatic channel recovery #2138
Conversation
9162eca to
d0316f3
Compare
Fix intermittent "Connection reset by peer" and "gRPC connection interrupted" errors that cause CueGUI to stop updating job and frame details. Changes: - Configure gRPC keepalive to preserve long-lived connections through load balancers and firewalls (30s ping interval, 10s timeout) - Add channel health monitoring with automatic reconnection after 3 consecutive failures - Integrate health tracking in FrameMonitorTree, FrameMonitor, and ThreadPool to detect and recover from stale connections - Add unit tests for connection health tracking functionality These changes prevent idle connections from being dropped by network infrastructure and ensure graceful recovery when connection issues occur.
d0316f3 to
9a03ea4
Compare
|
@DiegoTavares / @lithorus |
DiegoTavares
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this PR is going the right direction.
- The some of gprc'a channel configuration attributes need to be set on both server and client. This PR doesn't touch the server.
- The grpc reconnection design is fallible and not broad enough to handle all cases.
- The grpc channel health tracking mechanism pollutes the code with a static call to pycue's Cuebot class, which in my opinion doesn't make sense when the Cuebot class has everything it needs to track success/failure calls independently.
| ('grpc.keepalive_timeout_ms', keepalive_timeout_ms), | ||
| ('grpc.keepalive_permit_without_calls', keepalive_permit_without_calls), | ||
| # Allow client to send keepalive pings even without data | ||
| ('grpc.http2.max_pings_without_data', 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recommend removing empty ping limitations. If you open a python shell and import pycue for example, a connection will be created, if you don't sent any data, the channel will live forever until the shell is closed. This has the potential to overwhelm the server with too many opened empty channels.
The default value is 2, maybe we can increase it to 10. But given how noisy cuegui's communication with Cuebot is, I doubt there's a period of inactivity where pings are being sent without payload.
| # Minimum time between pings (allows more frequent pings) | ||
| ('grpc.http2.min_time_between_pings_ms', 10000), | ||
| # Don't limit ping strikes (server may reject too many pings) | ||
| ('grpc.http2.min_ping_interval_without_data_ms', 5000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly a server side configuration, and it needs to be set in both server and client. Reducing the empty-ping interval might have the opposite effect of what you expect.
Read grpc.config
Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings as described in A8-client-side-keepalive.md. Some scenarios where this can happen are
- if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false while the client has set this to true resulting in keepalive pings being sent even when there is no call in flight.
- if the client's GRPC_ARG_KEEPALIVE_TIME_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.
| return False | ||
|
|
||
| @staticmethod | ||
| def checkChannelHealth(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find where this method is being called outside of a unit test context. Am I missing something?
| # Record successful call for connection health tracking | ||
| Cuebot.recordSuccessfulCall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a static method call to act as a channel health-check is error prune and not recommended. What is the rationale behind having this on frameMonitorTree and not LayerMonitorTree for example? It looks like a patch for a symptom and not the facing the actual illness.
If you want to implement a logic to keep track of successful calls, please use the class RetryOnRpcErrorClientInterceptor which intercepts every call to the grpc channel.
| # Record failed call and potentially reset the channel | ||
| if Cuebot.recordFailedCall(): | ||
| logger.info("Channel reset due to connection issues, retrying") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this check to RetryOnRpcErrorClientInterceptor as explained above.
| DEFAULT_KEEPALIVE_TIME_MS = 30000 # Send keepalive ping every 30 seconds | ||
| DEFAULT_KEEPALIVE_TIMEOUT_MS = 10000 # Wait 10 seconds for keepalive response | ||
| DEFAULT_KEEPALIVE_PERMIT_WITHOUT_CALLS = True # Send keepalive even when no active RPCs | ||
| DEFAULT_MAX_CONNECTION_IDLE_MS = 0 # Disable max idle time (keep connection open) | ||
| DEFAULT_MAX_CONNECTION_AGE_MS = 0 # Disable max connection age |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all these constants are being used when creating the channel. Please sanitize.
| keepalive_permit_without_calls = Cuebot.Config.get( | ||
| 'cuebot.keepalive_permit_without_calls', DEFAULT_KEEPALIVE_PERMIT_WITHOUT_CALLS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config has to be set on both server and client to have effect. As this is not configured on the server, it has no effect.
|
One way to implement this is to pass a control variable to |
|
Thanks for the review, @DiegoTavares . I'll submit a revised solution incorporating your recommendations shortly, most likely as a new PR. I’m closing this one for now. Since the issue is difficult to reproduce, I am keeping investigating for better solutions. I'll keep you posted on any progress. |
Link the Issue(s) this Pull Request is related to.
Summarize your change.
Fix intermittent "Connection reset by peer" and "gRPC connection interrupted" errors that cause CueGUI to stop updating job and frame details.
Changes:
These changes prevent idle connections from being dropped by network infrastructure and ensure graceful recovery when connection issues occur.