Skip to content

fix(server): replace stall-timeout with TCP keepalive (stop aborting slow clients)#33

Merged
Zorlin merged 1 commit into
mainfrom
fix/remove-stream-stall-timeout
Jun 21, 2026
Merged

fix(server): replace stall-timeout with TCP keepalive (stop aborting slow clients)#33
Zorlin merged 1 commit into
mainfrom
fix/remove-stream-stall-timeout

Conversation

@Zorlin

@Zorlin Zorlin commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Why

#31 added a progress-based stall timeout. It could not tell a dead peer
from a slow one — both make zero read progress — so it aborted legitimate
slow clients mid-download. k8s08 (BIOS/i386 iPXE on a slow box) read
initramfs to ~40–54 %, iPXE paused >10 s, the timer slammed the connection,
iPXE re-fetched from zero, re-paused: an infinite restart loop — it could
never finish. Confirmed in the logs: Client stalled … aborting firing every
~10 s on k8s08's /boot/i386/initramfs, while 51 clean x86_64 serves and the
other machines' installs sailed through.

Fix

  1. Revert the progress-timeout — restore plain tx.send backpressure. The
    stream now paces to the client's own read rate, so a slow client drains
    slowly and completes.
  2. TCP keepalive + TCP_NODELAY on every accepted connection via axum 0.8's
    ListenerExt::tap_io (socket2 SockRef, 60 s idle → probe every 15 s).

Keepalive is the correct dead-peer detector: it probes the socket and only
closes it when the peer is truly unresponsive (no ACKs). A slow client still
ACKs, so it is never harmed — the exact property a progress timer lacks.
Idle-vanished peers are reaped in ~1–2 min instead of the OS default (~hours);
stalled in-flight streams are reaped by TCP retransmission.

Verified

🤖 Generated with Claude Code

The progress-based stall timeout (#31) could not tell a dead peer from a
slow one — both make no read progress — so it aborted legitimate slow
clients mid-download. k8s08 (BIOS/i386 iPXE on a slow box) read initramfs
to ~40-54%, iPXE paused >10s, the timer slammed the connection, iPXE
re-fetched from zero, re-paused: an infinite restart loop that never
finished.

Restore plain tx.send backpressure (the stream paces to the client's own
read rate, so a slow client drains slowly and completes) and add TCP
keepalive + TCP_NODELAY on every accepted connection via axum 0.8's
ListenerExt::tap_io. Keepalive is the correct dead-peer detector: it
probes the socket and only closes it when the peer is truly unresponsive
(no ACKs) — a slow client still ACKs, so it is never harmed. Idle-vanished
peers are reaped in ~1-2min instead of the OS default of hours; stalled
in-flight streams are reaped by TCP retransmission.

Co-Authored-By: Claude <noreply@anthropic.com>
@Zorlin Zorlin merged commit fb3e84d into main Jun 21, 2026
4 checks passed
@Zorlin Zorlin deleted the fix/remove-stream-stall-timeout branch June 22, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant