fix(server): replace stall-timeout with TCP keepalive (stop aborting slow clients) by Zorlin · Pull Request #33 · riffcc/dragonfly

Zorlin · 2026-06-21T17:59:56Z

Why

#31 added a progress-based stall timeout. It could not tell a dead peer
from a slow one — both make zero read progress — so it aborted legitimate
slow clients mid-download. k8s08 (BIOS/i386 iPXE on a slow box) read
initramfs to ~40–54 %, iPXE paused >10 s, the timer slammed the connection,
iPXE re-fetched from zero, re-paused: an infinite restart loop — it could
never finish. Confirmed in the logs: Client stalled … aborting firing every
~10 s on k8s08's /boot/i386/initramfs, while 51 clean x86_64 serves and the
other machines' installs sailed through.

Fix

Revert the progress-timeout — restore plain tx.send backpressure. The
stream now paces to the client's own read rate, so a slow client drains
slowly and completes.
TCP keepalive + TCP_NODELAY on every accepted connection via axum 0.8's
ListenerExt::tap_io (socket2 SockRef, 60 s idle → probe every 15 s).

Keepalive is the correct dead-peer detector: it probes the socket and only
closes it when the peer is truly unresponsive (no ACKs). A slow client still
ACKs, so it is never harmed — the exact property a progress timer lacks.
Idle-vanished peers are reaped in ~1–2 min instead of the OS default (~hours);
stalled in-flight streams are reaped by TCP retransmission.

Verified

190 tests pass; clippy + fmt clean.
Deployed to prod: API 200/200, initramfs full 200/26910594 B in 0.15 s, clean
startup.
Supersedes the behavioral change of fix(server): abort stalled artifact transfers (slowloris / flaky client) #31 (the stall-timeout is gone); fix(server): abort stalled artifact transfers (slowloris / flaky client) #31's
intent (dead-peer cleanup) is now handled correctly by keepalive.

🤖 Generated with Claude Code

The progress-based stall timeout (#31) could not tell a dead peer from a slow one — both make no read progress — so it aborted legitimate slow clients mid-download. k8s08 (BIOS/i386 iPXE on a slow box) read initramfs to ~40-54%, iPXE paused >10s, the timer slammed the connection, iPXE re-fetched from zero, re-paused: an infinite restart loop that never finished. Restore plain tx.send backpressure (the stream paces to the client's own read rate, so a slow client drains slowly and completes) and add TCP keepalive + TCP_NODELAY on every accepted connection via axum 0.8's ListenerExt::tap_io. Keepalive is the correct dead-peer detector: it probes the socket and only closes it when the peer is truly unresponsive (no ACKs) — a slow client still ACKs, so it is never harmed. Idle-vanished peers are reaped in ~1-2min instead of the OS default of hours; stalled in-flight streams are reaped by TCP retransmission. Co-Authored-By: Claude <noreply@anthropic.com>

Zorlin merged commit fb3e84d into main Jun 21, 2026
4 checks passed

Zorlin deleted the fix/remove-stream-stall-timeout branch June 22, 2026 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(server): replace stall-timeout with TCP keepalive (stop aborting slow clients)#33

fix(server): replace stall-timeout with TCP keepalive (stop aborting slow clients)#33
Zorlin merged 1 commit into
mainfrom
fix/remove-stream-stall-timeout

Zorlin commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Zorlin commented Jun 21, 2026

Why

Fix

Verified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant