Coordinator & Tablet server failover at same time may cause follower blocking in fetching #629

luoyuxia · 2025-03-18T12:41:30Z

Search before asking

I searched in the issues and found nothing similar.

Fluss version

0.6.0 (latest release)

Please describe the bug 🐞

One bucket with 3 replicas, server0, server1, server 3. server 0 is leader.
1: coordinator server fail. server0, server1 also fail.
2: server0, server1 restore but ip/connection is reset, the follower server3 fetch from old connection/ip of server 0
3: coordinator server restore, found nothing changes, no leader re-election, send notifyLeader to server3
4: since leader is not changed, it's expected to skip become follower, then it won't reset fetcher from server 0, then it hangs forever which cause the follower can't join isr

Note in step 4, it's expected to skip although current code won't skip to become follower. See issue #620

Solution

We may should follower kafka ways, in step 4, the fetch request should fail since connection is broker, and then it'll retry to reconnect from server0.

Are you willing to submit a PR?

I'm willing to submit a PR!

luoyuxia · 2025-03-18T12:42:54Z

cc @swuferhong

swuferhong · 2025-03-19T01:13:41Z

@luoyuxia I also find all synchronous calls CompletableFuture.get() in server, like RemoteLeaderEndpoint#fetchLocalLogEndOffset().get(), RemoteLeaderEndpoint#fetchLocalLogStartOffset().get(), RemoteLeaderEndpoint#fetchLog().get(), and LogTieringTask#commitRemoteLogManifest() could potentially block (hang forever) indefinitely if the CoordinatorServer or other TabletServer shutdown or ip changed.

The reasons for this is follow as you said:

Our network client doesn't have a timeout mechanism, which may cause hang forever.
Currently, our Gateway lacks an update mechanism in server.

We may need to resolve it together, so I create a father issue for this error #632 to trace this. cc @wuchong

luoyuxia · 2025-03-27T07:35:21Z

Should be fixed via #633

luoyuxia mentioned this issue Mar 18, 2025

Coordinator failover will cause isr jitter #620

Closed

2 tasks

luoyuxia added component=server component=kv labels Mar 18, 2025

luoyuxia added this to the v0.7 milestone Mar 18, 2025

luoyuxia closed this as completed Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinator & Tablet server failover at same time may cause follower blocking in fetching #629

Coordinator & Tablet server failover at same time may cause follower blocking in fetching #629

luoyuxia commented Mar 18, 2025 •

edited

Loading

luoyuxia commented Mar 18, 2025

swuferhong commented Mar 19, 2025

luoyuxia commented Mar 27, 2025

Coordinator & Tablet server failover at same time may cause follower blocking in fetching #629

Coordinator & Tablet server failover at same time may cause follower blocking in fetching #629

Comments

luoyuxia commented Mar 18, 2025 • edited Loading

Search before asking

Fluss version

Please describe the bug 🐞

Solution

Are you willing to submit a PR?

luoyuxia commented Mar 18, 2025

swuferhong commented Mar 19, 2025

luoyuxia commented Mar 27, 2025

luoyuxia commented Mar 18, 2025 •

edited

Loading