Allow Nodes to mark other nodes Rejected only if not itself in Suspect / Faulty lists #84

robins · 2015-05-27T10:03:21Z

While Issue #42 is still pending ... just wanted to check whether RingPop currently debars all 'Reject M' messages from a Node N, where Node N itself is in a suspect / Faulty list of all others...?

(Use case would be when partitioned sets, mark nodes in the other partitions as Faulty, and lets assume that the network restores, the Reject messages would pass over to the other partition thereby marking Alive nodes in the correspondingly opposite-partitions).

(Have just come from watching RingPop@Rackspace video & reading SWIM paper ... so pardon if I am missing the elephant in the room)

jwolski · 2015-06-21T21:34:49Z

@robins Thanks for your question. As is implemented now, faulty members are not pinged. Ringpop would not expect a ping to be sent from such a member. And if one arrived, Ringpop would not do the right thing (by marking the send as alive in it's membership).

The resolution to issue #42 will periodically ping faulty members, likely at a lower-rate than the normal protocol period, and the sender will have to assert its aliveness that way.

Does that answer your question? I was confused by what you meant about 'Reject M' messages. But I tried my best to answer. Let me know!

robins · 2015-06-22T01:26:23Z

Thanks @jwolski ... but I believe I couldn't explain myself earlier.

I'd try to elaborate as a worst-case scenario. The issue here isn't as much about whether faulty members are pinged, but more about whether 'requests / messages' (not pings) from faulty members are processed or not...

Lets assume that owing to network issues, 20 nodes got split into two clusters (sets) A (with nodes 1-10) and B (with 11-20). If the network has been disconnected for enough time, all nodes in Set B would be ready to mark Nodes 1-10 (in Set A) as faulty.... and vice-versa. Now just before that announce, if the network came back alive, we're essentially going to have a bloodbath when Nodes in Set A announce that Nodes 11-20 are faulty and vice-versa... If nothing else, we're going to see a huge (unnecessary) drop in alive nodes during such network-reconnects.

As the title suggests, this could be avoided / mitigated if (just like pings) Reject messages are not processed from members currently in the Faulty list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Nodes to mark other nodes Rejected only if not itself in Suspect / Faulty lists #84

Allow Nodes to mark other nodes Rejected only if not itself in Suspect / Faulty lists #84

robins commented May 27, 2015

jwolski commented Jun 21, 2015

robins commented Jun 22, 2015

Allow Nodes to mark other nodes Rejected only if not itself in Suspect / Faulty lists #84

Allow Nodes to mark other nodes Rejected only if not itself in Suspect / Faulty lists #84

Comments

robins commented May 27, 2015

jwolski commented Jun 21, 2015

robins commented Jun 22, 2015