Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSANDRA-19580: Fix replacing node stuck in hibernation state (5.0) #3974

Open
wants to merge 1 commit into
base: cassandra-5.0
Choose a base branch
from

Conversation

szymon-miezal
Copy link
Contributor

When a node is replaced, it announces itself as hibernated (one of the silent shutdown states). If the replacement fails, other nodes continue to see the replacing node in this state. As a result, the replacing node does not receive gossip messages from the seed upon subsequent startup, leading to an exception.

This patch adds an explicit shutdown announcement via gossip to ensure other nodes correctly recognize the replacing node’s state, preventing it from getting stuck without receiving gossip messages. nodes know that the node was explicitly shutdown - as it was due to the exception. That allows other nodes (seeds in particular) to contact the replacing node at its next startup, thus allowing to retry the replacement.

When a node is replaced, it announces itself as hibernated (one of the silent shutdown states).
If the replacement fails, other nodes continue to see the replacing node in this state.
As a result, the replacing node does not receive gossip messages from the seed upon subsequent startup, leading to an exception.

This patch adds an explicit shutdown announcement via gossip to let other nodes know that the node was explicitly shutdown - as it was due to the exception.
This allows other nodes (seeds in particular) to contact the replacing node at its next startup, allowing it to retry the replacement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant