Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rls: Fix flaky test Test/ControlChannelConnectivityStateMonitoring #8055

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

eshitachandwani
Copy link
Member

fixes #5468
The test flakes because

    • Test creates a ClientConn with service config setting the LB policy to "rls"
    • RLS LB policy initializes the control channel to the RLS server. As part of this, it spawns a goroutine to monitor the control channel connectivity state changes.
    • But by the time the first RPC is successfully made, the above goroutine has not gotten a chance to run yet.
    • And at this time, the test stops the RLS server. This moves the control channel to IDLE and it is only now that the monitoring goroutine gets to run, and it has already missed the first transition to READY.

FIX : Use channel to make sure the go routine starts

  1. Our current state change API is lossy because state changes can be lost between the former returning and the caller invoking GetState
    FIX : The fix is to use grpcsync.pubsub to subscribe to the state changes so that we do not loose state changes.

RELEASE NOTES: N/A

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 90.32258% with 3 lines in your changes missing coverage. Please review.

Project coverage is 82.26%. Comparing base (e0d191d) to head (b25261c).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
balancer/rls/control_channel.go 90.32% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8055      +/-   ##
==========================================
- Coverage   82.29%   82.26%   -0.04%     
==========================================
  Files         387      387              
  Lines       39065    39083      +18     
==========================================
+ Hits        32150    32152       +2     
- Misses       5584     5612      +28     
+ Partials     1331     1319      -12     
Files with missing lines Coverage Δ
balancer/rls/control_channel.go 88.66% <90.32%> (-4.47%) ⬇️

... and 31 files with indirect coverage changes

@easwars
Copy link
Contributor

easwars commented Feb 3, 2025

Can you try 10K or 1M runs in forge before and after the fix to ensure that flakes are eliminated by the fix?

func (c *ccStateSubscriber) OnMessage(msg any) {
st, ok := msg.(connectivity.State)
if !ok {
return // Ignore invalid messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error we don't expect to happen in practice and if it does, it indicates a severe programming error. I would be OK to add a panic here that includes the type being received.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Done.

stateSubscriber := &ccStateSubscriber{
state: buffer.NewUnbounded(),
}
unsubscribe := internal.SubscribeToConnectivityStateChanges.(func(cc *grpc.ClientConn, s grpcsync.Subscriber) func())(cc.cc, stateSubscriber)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a new subscriber is added to the pubsub, it receives the most recent message posted on the pubsub. This means that if there were N messages posted on the pubsub when a new subscriber is added, the subscriber only receives the most recently posted message. This might not be good enough for our purposes here. So, I suggest making the following changes.

  • Get rid of the ccStateSubscriber. Instead store the buffer.Unbounded as a field of controlChannel.
  • Initialize the unbounded buffer when the control channel is created in newControlChannel.
  • Change grpc.Dial to grpc.NewClient in newControlChannel.
  • Register the subscriber right after creating the ClientConn to the RLS server, but before calling Connect on it. This will ensure that the subscriber will receive every single state change on the ClientConn.
    • Implement the OnMessage method on the controlChannel type and pass it to the call to internal.SubscribeToConnectivityStateChanges

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 178 to 179
unsubscribe()
stateSubscriber.state.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would move to controlChannel.close.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for {
// Wait for the control channel to become READY.
for s := cc.cc.GetState(); s != connectivity.Ready; s = cc.cc.GetState() {
var s any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define this variable to be of the concrete type connectivity.State instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer.Get() function used here as <-stateSubscriber.state.Get() returns a channel of type any we cannot do a type assertion on a channel directly. Also, we already do a type assertion in the OnMessage function , so IMO it should be okay. If we have to do the type assertion here, we will have to do it on the s value after the Get() function each time (we cna make a helper function that does get and type assertion for reuse). What would you suggest?

@@ -176,11 +197,15 @@ func (cc *controlChannel) monitorConnectivityState() {
first = false

// Wait for the control channel to move out of READY.
cc.cc.WaitForStateChange(ctx, connectivity.Ready)
if cc.cc.GetState() == connectivity.Shutdown {
for s = <-stateSubscriber.state.Get(); s == connectivity.Ready; s = <-stateSubscriber.state.Get() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a for loop here? We know for a fact that we are in READY. So, the first time we actually read anything out of the unbounded buffer, we can be sure that we have moved out of READY. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: 1/10k: ControlChannelConnectivityStateMonitoring
3 participants