Skip to content

Conversation

@brandboat
Copy link
Member

@brandboat brandboat commented Oct 25, 2025

  • broker.session.timeout.ms defaults to 9s. When a broker goes
    offline, the group coordinator may take up to this long to be
    re-elected.
  • The commit callback retry timeout is currently 10 seconds, which
    leaves very little buffer. If the metadata hasn’t refreshed yet, the
    consumer may still send an OFFSET_COMMIT request to the offline
    coordinator, leading to transient failures.

This patch enable controlled.shutdown.enable to allow the broker to
notify the controller before shutting down. This speeds up the test by
triggering an immediate failover instead of waiting for the broker
session timeout (default: 9s) to expire.

Reviewers: TaiJuWu [email protected], PoAn Yang [email protected]

@github-actions github-actions bot added core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Oct 25, 2025
@brandboat brandboat changed the title (WIP) KAFKA-16024: SaslPlaintextConsumerTest#testCoordinatorFailover is flaky KAFKA-16024: SaslPlaintextConsumerTest#testCoordinatorFailover is flaky Oct 27, 2025
This reverts commit 78828b0.
Copy link
Collaborator

@TaiJuWu TaiJuWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Run 100 times on local, no failure.

Copy link
Member

@FrankYang0529 FrankYang0529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandboat In AbstractConsumerTest#brokerPropertyOverrides, it sets controlled.shutdown.enable=false. I think that is why we need to wait up to broker.session.timeout.ms for a new election. The initial intention was to speed up shutdown. If we modify the value to true for this case, can we avoid the flaky test?

@brandboat
Copy link
Member Author

The initial intention was to speed up shutdown.

@FrankYang0529, good point! Interestingly, setting this ends up slowing down the test instead. I think it's ok to set controlled.shutdown.enable=true in this test case, of course. I have to run another 500 runs to make sure flaky won't happen again.

@brandboat brandboat marked this pull request as draft October 29, 2025 09:09
@brandboat brandboat marked this pull request as ready for review October 29, 2025 10:33
@brandboat
Copy link
Member Author

I ran kafka.api.SaslSslConsumerTest.testCoordinatorFailover 500 times locally without any failures. ./gradlew core:test --tests kafka.api.SaslSslConsumerTest.testCoordinatorFailover runtime dropped from 31s → 19s on my local dev.

Copy link
Member

@FrankYang0529 FrankYang0529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the patch.

@FrankYang0529 FrankYang0529 merged commit 2daf5be into apache:trunk Oct 29, 2025
31 of 32 checks passed
@brandboat brandboat deleted the KAFKA-16024 branch October 29, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker small Small PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants