Coordinator & Tablet server failover at same time may cause follower blocking in fetching #629
Closed
1 of 2 tasks
Labels
Milestone
Search before asking
Fluss version
0.6.0 (latest release)
Please describe the bug 🐞
One bucket with 3 replicas, server0, server1, server 3. server 0 is leader.
1: coordinator server fail. server0, server1 also fail.
2: server0, server1 restore but ip/connection is reset, the follower server3 fetch from old connection/ip of server 0
3: coordinator server restore, found nothing changes, no leader re-election, send notifyLeader to server3
4: since leader is not changed, it's expected to skip become follower, then it won't reset fetcher from server 0, then it hangs forever which cause the follower can't join isr
Note in step 4, it's expected to skip although current code won't skip to become follower. See issue #620
Solution
We may should follower kafka ways, in step 4, the fetch request should fail since connection is broker, and then it'll retry to reconnect from server0.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: