Skip to content

control/netmap: reconnect if a keepalive hasn't been received recently #210

@dylan-tailscale

Description

@dylan-tailscale

The control plane purposely disables layer 4 TCP keepalives1 on control connections in favor of layer 7 keepalives via the MapResponse::keep_alive field2. As a result, the control client can end up in a situation where the TCP stream is still ESTABLISHED as far as it knows, even though the control plane has effectively dropped the connection. This happens most often when the machine running a tailscale-rs device goes to sleep/suspends, then wakes up; ss/netstat show the TCP control plane connection as established, but we're receiving no netmap messages, because the control plane has written us off as dead. This means any sleep/suspend that lasts longer than the control plane's layer 7 keepalive interval can break a tailscale-rs device, requiring a process restart.

To fix this, the control client needs to track the time since the last MapResponse message was received3. If the control client hasn't received a message in ~120 seconds, it should tear down the existing TCP connection and start a new one.

Footnotes

  1. https://www.rfc-editor.org/info/rfc9293/#name-tcp-keep-alives

  2. This was done for battery life reasons; in ~2021, some mobile devices would wake the main processor to handle receiving/sending TCP keepalives, which depletes battery life. By disabling layer 4 keepalives for control connections and moving the keepalive concept to layer 7, the control plane helps preserve battery life on those devices, but requires all clients to manually track connection liveness and reconnect after extended idle time. There's been some internal discussion about re-introducing layer 4 keepalives to the control connection, as most mobile devices have a low-power processor/DSP that can handle TCP keepalives without waking the whole device/depleting battery life excessively. However, it still may not be possible if system APIs don't (sanely) expose TCP keepalive state to the application, and requires some up-front research/design.

  3. Per the tailscaled code, any valid message received resets the watchdog timer, not just ones with MapResponse.KeepAlive set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttech debtFixing, refactoring, or otherwise paying down tech debt

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions