Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator & Tablet server failover at same time may cause follower blocking in fetching #629

Closed
1 of 2 tasks
luoyuxia opened this issue Mar 18, 2025 · 3 comments
Closed
1 of 2 tasks

Comments

@luoyuxia
Copy link
Collaborator

luoyuxia commented Mar 18, 2025

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.6.0 (latest release)

Please describe the bug 🐞

One bucket with 3 replicas, server0, server1, server 3. server 0 is leader.
1: coordinator server fail. server0, server1 also fail.
2: server0, server1 restore but ip/connection is reset, the follower server3 fetch from old connection/ip of server 0
3: coordinator server restore, found nothing changes, no leader re-election, send notifyLeader to server3
4: since leader is not changed, it's expected to skip become follower, then it won't reset fetcher from server 0, then it hangs forever which cause the follower can't join isr

Note in step 4, it's expected to skip although current code won't skip to become follower. See issue #620

Solution

We may should follower kafka ways, in step 4, the fetch request should fail since connection is broker, and then it'll retry to reconnect from server0.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@luoyuxia
Copy link
Collaborator Author

cc @swuferhong

@swuferhong
Copy link
Collaborator

@luoyuxia I also find all synchronous calls CompletableFuture.get() in server, like RemoteLeaderEndpoint#fetchLocalLogEndOffset().get(), RemoteLeaderEndpoint#fetchLocalLogStartOffset().get(), RemoteLeaderEndpoint#fetchLog().get(), and LogTieringTask#commitRemoteLogManifest() could potentially block (hang forever) indefinitely if the CoordinatorServer or other TabletServer shutdown or ip changed.

The reasons for this is follow as you said:

  1. Our network client doesn't have a timeout mechanism, which may cause hang forever.
  2. Currently, our Gateway lacks an update mechanism in server.

We may need to resolve it together, so I create a father issue for this error #632 to trace this. cc @wuchong

@luoyuxia
Copy link
Collaborator Author

Should be fixed via #633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants