-
Notifications
You must be signed in to change notification settings - Fork 417
Description
This issue discusses how Synapse synchronises room state after an outage. The current (v1.137.0) heuristic is:
- Hit
/state_idsto get the total number of state events S and auth events A. - Calculate how many events are missing M
if M*10 >= (S+A)request the complete state via/state, else use/event.
This heuristic basically says if we are missing >= 10% of events in the /state_ids we will use /state.
The tradeoff here is between bandwidth and time. /state returns everything which is great on time, but terrible on bandwidth. /event returns only exactly the missing events but is slow as we do this concurrently (max 5 in parallel).
Unfortunately, there are problems with /event. Previously, we were far too lax in just continuing on when /event failed, causing room state to diverge and critical invariants to be broken. See:
- Fix a bug which could corrupt auth chains #18746
- Don't persist known bad room state from /state_ids #18877
However, fixing #18877 means we make /event terribly brittle, as just a single network failure on the hundreds of missing events will cause the entire house of cards to collapse and we will fail to make forward progress. This is exacerbated by https://github.com/element-hq/synapse/blob/v1.137.0/synapse/federation/federation_client.py#L94 which means a single network failure will cause subsequent requests for that event ID to fail for at least 1 minute, meaning any events which depend on that event will also fail to be processed for at least 1 minute.
This makes federation catchups extremely slow. Worst case, we never make forward progress because the connection is so patchy and we can never see all the /event requests succeed at the same time.
There's a few options here, ranging from simple to complex:
- more aggressively use
/state(e.g as a fallback if we fall to persist all events via/event). This is implementable today, at the cost of more bandwidth being used. - Add in a bulk
/eventendpoint, ensuring that we can't get these partial failures. This requires an MSC and spec change. - Go nuclear and use a set reconciliation algorithm. This would be an entire project to do.
For now, #18877 will be adjusted to hit /state to ensure we make forward progress.