fix(server): replace stall-timeout with TCP keepalive (stop aborting slow clients)#33
Merged
Merged
Conversation
The progress-based stall timeout (#31) could not tell a dead peer from a slow one — both make no read progress — so it aborted legitimate slow clients mid-download. k8s08 (BIOS/i386 iPXE on a slow box) read initramfs to ~40-54%, iPXE paused >10s, the timer slammed the connection, iPXE re-fetched from zero, re-paused: an infinite restart loop that never finished. Restore plain tx.send backpressure (the stream paces to the client's own read rate, so a slow client drains slowly and completes) and add TCP keepalive + TCP_NODELAY on every accepted connection via axum 0.8's ListenerExt::tap_io. Keepalive is the correct dead-peer detector: it probes the socket and only closes it when the peer is truly unresponsive (no ACKs) — a slow client still ACKs, so it is never harmed. Idle-vanished peers are reaped in ~1-2min instead of the OS default of hours; stalled in-flight streams are reaped by TCP retransmission. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
#31 added a progress-based stall timeout. It could not tell a dead peer
from a slow one — both make zero read progress — so it aborted legitimate
slow clients mid-download. k8s08 (BIOS/i386 iPXE on a slow box) read
initramfs to ~40–54 %, iPXE paused >10 s, the timer slammed the connection,
iPXE re-fetched from zero, re-paused: an infinite restart loop — it could
never finish. Confirmed in the logs:
Client stalled … abortingfiring every~10 s on k8s08's
/boot/i386/initramfs, while 51 cleanx86_64serves and theother machines' installs sailed through.
Fix
tx.sendbackpressure. Thestream now paces to the client's own read rate, so a slow client drains
slowly and completes.
TCP_NODELAYon every accepted connection via axum 0.8'sListenerExt::tap_io(socket2SockRef, 60 s idle → probe every 15 s).Keepalive is the correct dead-peer detector: it probes the socket and only
closes it when the peer is truly unresponsive (no ACKs). A slow client still
ACKs, so it is never harmed — the exact property a progress timer lacks.
Idle-vanished peers are reaped in ~1–2 min instead of the OS default (~hours);
stalled in-flight streams are reaped by TCP retransmission.
Verified
startup.
intent (dead-peer cleanup) is now handled correctly by keepalive.
🤖 Generated with Claude Code