Skip to content
This repository has been archived by the owner on Sep 25, 2020. It is now read-only.

Allow Nodes to mark other nodes Rejected only if not itself in Suspect / Faulty lists #84

Open
robins opened this issue May 27, 2015 · 2 comments

Comments

@robins
Copy link

robins commented May 27, 2015

While Issue #42 is still pending ... just wanted to check whether RingPop currently debars all 'Reject M' messages from a Node N, where Node N itself is in a suspect / Faulty list of all others...?

(Use case would be when partitioned sets, mark nodes in the other partitions as Faulty, and lets assume that the network restores, the Reject messages would pass over to the other partition thereby marking Alive nodes in the correspondingly opposite-partitions).

(Have just come from watching RingPop@Rackspace video & reading SWIM paper ... so pardon if I am missing the elephant in the room)

@jwolski
Copy link
Contributor

jwolski commented Jun 21, 2015

@robins Thanks for your question. As is implemented now, faulty members are not pinged. Ringpop would not expect a ping to be sent from such a member. And if one arrived, Ringpop would not do the right thing (by marking the send as alive in it's membership).

The resolution to issue #42 will periodically ping faulty members, likely at a lower-rate than the normal protocol period, and the sender will have to assert its aliveness that way.

Does that answer your question? I was confused by what you meant about 'Reject M' messages. But I tried my best to answer. Let me know!

@robins
Copy link
Author

robins commented Jun 22, 2015

Thanks @jwolski ... but I believe I couldn't explain myself earlier.

I'd try to elaborate as a worst-case scenario. The issue here isn't as much about whether faulty members are pinged, but more about whether 'requests / messages' (not pings) from faulty members are processed or not...

Lets assume that owing to network issues, 20 nodes got split into two clusters (sets) A (with nodes 1-10) and B (with 11-20). If the network has been disconnected for enough time, all nodes in Set B would be ready to mark Nodes 1-10 (in Set A) as faulty.... and vice-versa. Now just before that announce, if the network came back alive, we're essentially going to have a bloodbath when Nodes in Set A announce that Nodes 11-20 are faulty and vice-versa... If nothing else, we're going to see a huge (unnecessary) drop in alive nodes during such network-reconnects.

As the title suggests, this could be avoided / mitigated if (just like pings) Reject messages are not processed from members currently in the Faulty list.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants