-
Notifications
You must be signed in to change notification settings - Fork 14.8k
KAFKA-16024: SaslPlaintextConsumerTest#testCoordinatorFailover is flaky #20774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 78828b0.
core/src/test/scala/integration/kafka/api/AbstractConsumerTest.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Run 100 times on local, no failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandboat In AbstractConsumerTest#brokerPropertyOverrides, it sets controlled.shutdown.enable=false. I think that is why we need to wait up to broker.session.timeout.ms for a new election. The initial intention was to speed up shutdown. If we modify the value to true for this case, can we avoid the flaky test?
@FrankYang0529, good point! Interestingly, setting this ends up slowing down the test instead. I think it's ok to set controlled.shutdown.enable=true in this test case, of course. I have to run another 500 runs to make sure flaky won't happen again. |
|
I ran kafka.api.SaslSslConsumerTest.testCoordinatorFailover 500 times locally without any failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the patch.
broker.session.timeout.msdefaults to 9s. When a broker goesoffline, the group coordinator may take up to this long to be
re-elected.
leaves very little buffer. If the metadata hasn’t refreshed yet, the
consumer may still send an OFFSET_COMMIT request to the offline
coordinator, leading to transient failures.
This patch enable
controlled.shutdown.enableto allow the broker tonotify the controller before shutting down. This speeds up the test by
triggering an immediate failover instead of waiting for the broker
session timeout (default: 9s) to expire.
Reviewers: TaiJuWu [email protected], PoAn Yang [email protected]