Skip to content

Handoff and TCP RECV timeout #994

@martinsumner

Description

@martinsumner

Summary of Confusion/Problem:

Handoffs on busy riak_kv clusters often fail, with the reported issue being TCP RECV timeouts.

The code of handoff includes references to support very old versions:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L129-L133

Also there is reference to a re-write to resolve confusion happening soon:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L198-L202

Handoffs timing out are disruptive, as the handoff must start from the beginning (in terms of the fold over the vnode), and re-send all of the data at the next attempt.

When leaving/joining this can be controlled through handoff concurrency, but this isn't perfect, and still can lead to continuous failures.

Hinted handoffs can be even more problematic, particular when receiving vnodes are already subject to high loads. There is very little tuning/testing that can be done to predict what a safe transfer limit might be. There is also a vicious circle that can form - as if hinted handoffs fail, the receiving cluster may be subject to increasing amounts of anti-entropy and read-repair load that will cause hinted handoffs to continue to fail.

This issue is exacerbated that the timeout's reported as TCP RECV timeouts are not necessarily low-level network timeouts, but are more likely as a result of the receiver failing to respond to a OSI L7 SYNC message. There is also a confusion of timeout settings:

Calling the function get_handoff_receive_timeout/0 returns the handoff_timeout, which defaults to the TCP_TIMEOUT:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L514-L515

There is also a different handoff_receive timeout on the sender, based on a receiver setting:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L293_L298

Where the default is to be a third of the receiver setting, if set, but by default it is not set, and does not default to a third of the default chosen when unset on the receiver side:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_receiver.erl#L78

How are these timeouts to be tuned to prevent timeouts? A lower handoff_receive_timeout on the sender side, will lead to more frequent SYNCs - and hence make a backlog leading to a timeout less likely; but have different consequences within the receiver.

Proposal:

There are some broad changes proposed:

  • Remove all legacy configuration messages, and legacy options (e.g. non-batched sending).

  • Less haste even if that means less speed - is the cost of a round-trip worth paying every batch if it prevents time outs? i.e. don't send a batch until you have received a sync to confirm the last batch is processed. Completing a transfer first time will take less time than running faster and having to repeat a transfer. Speed is dependent on coordination.

  • Make timeout configurations easier to reason about;

  • Log process in transfers, so proactive running of riak-admin transfers is not required to see progress, or to look-back on progress post-event.

  • Potentially compress batches before sending (i.e. using term_to_binary with the compressed option for compatibility).

The aim must be to make this changes without adding another generation of forward/backward compatibility checking. Avoid making any change that requires version checking.

Related:

There was some debate, some time ago, about the wisdom of performing read-repairs to fallbacks. There was intention at some stage to add a configuration option to read repair to primaries only. This is related to hinted handoff performance - as the volume of changes which are handed off during a hinted handoff is adversely impacted by read-repairs. i.e. is a node recovers with data in tact, it will receive in handoff not just the data missed since it went down, but all the data read since it went down.

There is some similarity between this issue and basho/riak_repl#817. There may be factor here in changes to gen_tcp within OTP, especially with regards to timeout defaults. Using the {active, once} setting and allowing for large backlogs to form can lead to undexpected and hard-to-troubleshoot failures. It is better to avoid the consequences of backlogs buffering in TCP receive windows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions