Skip to content

fix(server): abort stalled artifact transfers (slowloris / flaky client)#31

Merged
Zorlin merged 2 commits into
mainfrom
fix/stream-stall-timeout
Jun 21, 2026
Merged

fix(server): abort stalled artifact transfers (slowloris / flaky client)#31
Zorlin merged 2 commits into
mainfrom
fix/stream-stall-timeout

Conversation

@Zorlin

@Zorlin Zorlin commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Why

k8s03 got stuck at 57% on /boot/i386/initramfs (kernel loaded fine).
Diagnosis: read_file_as_stream's tx.send(chunk).await has no timeout,
and the HTTP server has no app-level write timeout. So a client that stops
reading mid-transfer — a network blip, a half-open socket, or a slowloris —
pins the streaming task, its open file handle, and buffered memory until the
OS TCP timeout (~hours). iPXE's download bar freezes at whatever % it
reached, and the server just waits.

Confirmed with a slowloris probe against prod: 25 glacial 1 KB/s clients
held 31 connections to :3000 indefinitely
(each holding a streaming task +
file handle + ~2 MB buffer). Healthy transfers still zipped through (parked
.awaits don't block workers), but the resource hold is the vector.

Fix

STREAM_STALL_TIMEOUT (60 s) + a send_chunk helper that aborts the transfer
when the client makes zero progress for the whole window. Slow-but-progressing
clients never trip it — tx.send only stalls when no channel slot frees, i.e.
the client read nothing. Aborting closes the connection so iPXE retries instead
of hanging at a partial %, and bounds the slowloris resource hold.

Wired into both the full-file chunk loop and the single-chunk range path.

Tests

send_chunk: Sent (receiver drains), ReceiverGone (receiver dropped), and
Stalled — the last uses a paused tokio clock (start_paused + time::advance)
to reproduce the exact stall condition deterministically, with no real sleeps.
193 tests pass; clippy clean; fmt clean.

Verified

Deployed to prod (10.7.1.100): API 200/200, initramfs Range → 206, suffix → 206,
full fetch 200 / 26910594 B in 0.09 s. No regression.

🤖 Generated with Claude Code

Zorlin and others added 2 commits June 21, 2026 18:10
read_file_as_stream's tx.send(chunk).await had no timeout, and the HTTP
server has no app-level write timeout. A client that stops reading
mid-transfer - a network blip (k8s03's initramfs stuck at 57%), a
half-open socket, or a slowloris - pinned the streaming task, its open
file handle, and buffered memory until the OS TCP timeout (~hours).
Confirmed with a slowloris probe: 25 glacial 1 KB/s clients held 31
:3000 connections indefinitely.

Add STREAM_STALL_TIMEOUT (60s) + a send_chunk helper that aborts the
transfer when the client makes zero progress for the window. Slow-but-
progressing clients never trip it (a send only stalls when no channel
slot frees). Aborting closes the connection so iPXE retries instead of
hanging at a partial percentage, and bounds slowloris resource hold.

Tests: send_chunk Sent / ReceiverGone / Stalled (the last uses a
paused tokio clock to reproduce the stall with no real sleeps).

Co-Authored-By: Claude <noreply@anthropic.com>
For a customer imaging a machine, 60s of zero-progress is too long to sit
before the transfer aborts and iPXE retries. 10s still only trips on a
client making zero progress (a slow-but-progressing client never stalls a
send), and 10s of no reads is an unambiguous stall on a local-bridge link
where healthy transfers run 80+ MB/s. Bounds the slowloris hold tighter too.

Co-Authored-By: Claude <noreply@anthropic.com>
@Zorlin Zorlin merged commit bd5b256 into main Jun 21, 2026
4 checks passed
@Zorlin Zorlin deleted the fix/stream-stall-timeout branch June 21, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant