Skip to content

proposal: control service synchronization #4876

@tzaeschke

Description

@tzaeschke

It seems that path databases of control services (CS) are not synchronized when an AS is running multiple CS.

Impact

  • Most clients are implemented such that they contact only one CS when asking for paths. If the CSes are not synchronized, the client will not see all path or even nor paths at all (if the CS never received any paths)
  • If one CS acts as a fallback then won't work because it never got the paths from the other CS.
  • In core ASes, the CSes may be queried by other CSes in other ASes. Many paths will not be made available if CSes do not contain all paths. Especially core ASes are likely to have multiple CS instances.

History

As far as I understand, CS synchronization was originally done with ZooKeeper, see #50. However, ZooKeeper was later removed, see #1025, and CS synchronization was dropped.

Sketch of solution

Any implementation should:

  • synchronize paths fast, within seconds or minutes.
  • If one CS crashes, it should be able to recover from missing updates or even be able to rebuild the whole path database
  • Path that are received through CS synchronization within an AS probably do not need to be verified.

Proposal

One solution is for each CS to regularly poll other CS for updates. To avoid timing issues, we assume that all CS's clocks are reasonably in sync (We could also use a system with timestamps for each other CS, but that seems like overkill).
Every CS should locally keep track in the database of when it last synced with another CS. This timestamp can be used in queries to indicate for how long ago PSBs should be sent in a sync request. This also allows rebuilding path databases after a crash or prolonged downtime.

See also discussion here: https://scionproto.slack.com/archives/C8ADA9CEP/p1767972450040339

Alternatively, it should be possible to use a protobuf subscription/stream instead of polling. The stream would start with a request to send all PCBs after a given timestamp. After initial sync is achieved, the stream would remain open for the sender to immediately send new PCBs to all subscriber.

To be designed:

  • Full topo model? Does every CS poll/subscribe to every other CS? That doesn't scale well but would be best for availability in case one CS disappears.
  • Star topo model / chain topo model? Does every CS poll/subscribe to only one other SC? Which one? What happens if that one goes down?
  • We also need to consider that for initial sync (after crash) one CS should receive ALL PCBs from its peer. However, after initial sync, peers should only send PCBs that it received from outside the AS (star topo, full topo) or from a parent (chain topo). This should avoid sending PCBs unnecessarily.**Initially: ** As a simple alternative it may be fine to forward all PCBs (after initial sync is finished) regardless of where they come from, as long as they are not already present in the local path DB; this would create some network overhead but considerably simplify the architecture.

I am happy to discuss this in detail later.

Metadata

Metadata

Assignees

No one assigned

    Labels

    i/proposalA new idea requiring additional input and discussion

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions