fix(server): abort stalled artifact transfers (slowloris / flaky client)#31
Merged
Conversation
read_file_as_stream's tx.send(chunk).await had no timeout, and the HTTP server has no app-level write timeout. A client that stops reading mid-transfer - a network blip (k8s03's initramfs stuck at 57%), a half-open socket, or a slowloris - pinned the streaming task, its open file handle, and buffered memory until the OS TCP timeout (~hours). Confirmed with a slowloris probe: 25 glacial 1 KB/s clients held 31 :3000 connections indefinitely. Add STREAM_STALL_TIMEOUT (60s) + a send_chunk helper that aborts the transfer when the client makes zero progress for the window. Slow-but- progressing clients never trip it (a send only stalls when no channel slot frees). Aborting closes the connection so iPXE retries instead of hanging at a partial percentage, and bounds slowloris resource hold. Tests: send_chunk Sent / ReceiverGone / Stalled (the last uses a paused tokio clock to reproduce the stall with no real sleeps). Co-Authored-By: Claude <noreply@anthropic.com>
For a customer imaging a machine, 60s of zero-progress is too long to sit before the transfer aborts and iPXE retries. 10s still only trips on a client making zero progress (a slow-but-progressing client never stalls a send), and 10s of no reads is an unambiguous stall on a local-bridge link where healthy transfers run 80+ MB/s. Bounds the slowloris hold tighter too. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
k8s03got stuck at 57% on/boot/i386/initramfs(kernel loaded fine).Diagnosis:
read_file_as_stream'stx.send(chunk).awaithas no timeout,and the HTTP server has no app-level write timeout. So a client that stops
reading mid-transfer — a network blip, a half-open socket, or a slowloris —
pins the streaming task, its open file handle, and buffered memory until the
OS TCP timeout (~hours). iPXE's download bar freezes at whatever % it
reached, and the server just waits.
Confirmed with a slowloris probe against prod: 25 glacial 1 KB/s clients
held 31 connections to :3000 indefinitely (each holding a streaming task +
file handle + ~2 MB buffer). Healthy transfers still zipped through (parked
.awaits don't block workers), but the resource hold is the vector.Fix
STREAM_STALL_TIMEOUT(60 s) + asend_chunkhelper that aborts the transferwhen the client makes zero progress for the whole window. Slow-but-progressing
clients never trip it —
tx.sendonly stalls when no channel slot frees, i.e.the client read nothing. Aborting closes the connection so iPXE retries instead
of hanging at a partial %, and bounds the slowloris resource hold.
Wired into both the full-file chunk loop and the single-chunk range path.
Tests
send_chunk:Sent(receiver drains),ReceiverGone(receiver dropped), andStalled— the last uses a paused tokio clock (start_paused+time::advance)to reproduce the exact stall condition deterministically, with no real sleeps.
193 tests pass; clippy clean; fmt clean.
Verified
Deployed to prod (10.7.1.100): API 200/200, initramfs Range → 206, suffix → 206,
full fetch 200 / 26910594 B in 0.09 s. No regression.
🤖 Generated with Claude Code