Synchronising state over federation heuristic

This issue discusses how Synapse synchronises room state after an outage. The current (v1.137.0) heuristic is:
 - Hit `/state_ids` to get the total number of state events S and auth events A.
 - Calculate how many events are missing M
 - `if M*10 >= (S+A)` request the complete state via `/state`, else use `/event`.

This heuristic basically says if we are missing >= 10% of events in the `/state_ids` we will use `/state`.

The tradeoff here is between bandwidth and time. `/state` returns everything which is great on time, but terrible on bandwidth. `/event` returns only exactly the missing events but is slow as we do this concurrently (max 5 in parallel).

Unfortunately, there are problems with `/event`. Previously, we were far too lax in just continuing on when `/event` failed, causing room state to diverge and critical invariants to be broken. See:
 - https://github.com/element-hq/synapse/pull/18746
 - https://github.com/element-hq/synapse/pull/18877

However, fixing https://github.com/element-hq/synapse/pull/18877 means we make `/event` terribly brittle, as just a single network failure on the hundreds of missing events will cause the entire house of cards to collapse and we will fail to make forward progress. This is exacerbated by https://github.com/element-hq/synapse/blob/v1.137.0/synapse/federation/federation_client.py#L94 which means a single network failure will cause subsequent requests for that event ID to fail for at least 1 minute, meaning any events which depend on that event will also fail to be processed for at least 1 minute.

This makes federation catchups extremely slow. Worst case, we never make forward progress because the connection is so patchy and we can never see all the `/event` requests succeed at the same time.

There's a few options here, ranging from simple to complex:
 - more aggressively use `/state` (e.g as a fallback if we fall to persist all events via `/event`). This is implementable today, at the cost of more bandwidth being used.
 - Add in a bulk `/event` endpoint, ensuring that we can't get these partial failures. This requires an MSC and spec change.
 - Go nuclear and use a set reconciliation algorithm. This would be an entire project to do.

For now, https://github.com/element-hq/synapse/pull/18877 will be adjusted to hit `/state` to ensure we make forward progress.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Synchronising state over federation heuristic #18879

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Synchronising state over federation heuristic #18879

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions