fix(server): eliminate blocking I/O from async paths (boot-storm stalls) by Zorlin · Pull Request #30 · riffcc/dragonfly

Zorlin · 2026-06-21T16:49:24Z

Root cause

Under a boot storm (10+ machines imaging concurrently), the daemon would
stall on clients — initramfs transfers hanging partway — even though no
single handler was "slow". The cause is synchronous I/O running directly
on tokio worker threads.

tokio's multi-threaded runtime has only ~CPU-count workers. Any blocking
syscall on a worker (std::fs::*, Path::exists(), block_in_place,
sync DNS) pins that worker for its duration. With ten machines all needing
the runtime, one machine hitting a multi-ms blocker stalls everyone
else's transfers — including initramfs. The artifact handlers themselves
stream correctly (verified), but a blocker elsewhere on the runtime
starves them.

What this changes

Eliminates blocking I/O from every async serving path reachable during
imaging (3 commits):

HTTP artifact & ISO/admin handlers (api.rs) — every Path::exists(),
std::fs::metadata/write/create_dir_all/read_dir converted to async
(tokio::fs::try_exists/metadata/write/create_dir_all); read_dir
listings moved to spawn_blocking. Covers: spark/efi/grub/memtest/ipxe-efi,
static, pxelinux, boot_asset, debian boot_asset, ipxe_artifact,
os_image, stream cache, mage/ipxe download + apkovl generation (the
std::fs::write of a full 295 MB modloop was the single worst offender),
ISO list/boot-mode/delete, dev-mode toggle, credential-delete.
Proxmox discovery reverse-DNS (discovery.rs) —
block_in_place(dns_lookup) (one per host, seconds when unreachable, on a
worker) → concurrent spawn_blocking on the blocking pool.
UI/auth password-file reads (ui.rs, auth.rs) — async
read_to_string/try_exists/remove_file.

Verified

cargo clippy -p dragonfly-server clean; cargo fmt --check clean.
cargo test -p dragonfly-server — 190 passed, 0 failed.
Deployed to prod (10.7.1.100): API 200/200; initramfs Range → 206 (1024 B),
suffix range → 206 (2048 B); full fetch 200 / 26910594 B; 15-concurrent
completes in 1 s with all bytes. No regression.

Out of scope (not reached during imaging)

Setup/startup-only functions remain synchronous by design — they run at
startup or explicit reconfiguration, never during steady-state imaging, and
converting their std::process::Command/tar logic is a separate riskier
change: mode::configure_* / deploy_k3s_*, ha::ensure_rqlite_binary,
lib::run startup. Flagging for a follow-up if zero blocking I/O during
reconfiguration is also wanted.

🤖 Generated with Claude Code

These handlers ran synchronous filesystem calls (Path::exists, std::fs::metadata/write/create_dir_all/read_dir) directly on the tokio worker thread. tokio has only ~CPU-count workers, so under a boot storm a single multi-ms sync op pins a worker; with ten machines imaging at once, one blocker stalls every other client's transfer (initramfs included) - the daemon hangs on clients even though no single handler is slow. Convert the whole serving path to async I/O: - artifact handlers: spark/efi/grub/memtest/ipxe-efi, static, pxelinux, boot_asset, debian boot_asset, ipxe_artifact, os_image, stream cache - mage/ipxe download + apkovl generation (std::fs::write of full files was the worst offender - modloop is 295MB written synchronously) - ISO list -> spawn_blocking(read_dir+metadata); boot-mode, delete - dev-mode toggle + credential-delete admin handlers exists() -> tokio::fs::try_exists(); read_dir -> spawn_blocking. Co-Authored-By: Claude <noreply@anthropic.com>

discover_proxmox_handler ran tokio::task::block_in_place(dns_lookup) for every discovered host, on a worker thread. Reverse-DNS stalls for seconds when a host is unreachable, pinning the worker the whole time - and discovery runs concurrently with imaging. Offload each lookup to the blocking pool (spawn_blocking) and run them concurrently via join_all, so discovery can never stall artifact transfers for the rest of the daemon. Co-Authored-By: Claude <noreply@anthropic.com>

settings_page / settings_page_section / update_settings / generate_default_credentials read and removed initial_password.txt via synchronous std::fs on the worker. Convert to tokio::fs::read_to_string / try_exists / remove_file so an open dashboard or a login during a boot storm cannot pin a worker. Co-Authored-By: Claude <noreply@anthropic.com>

Zorlin and others added 3 commits June 21, 2026 17:46

Zorlin merged commit 3044c5e into main Jun 21, 2026
4 checks passed

Zorlin deleted the fix/blocking-artifact-serving branch June 22, 2026 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(server): eliminate blocking I/O from async paths (boot-storm stalls)#30

fix(server): eliminate blocking I/O from async paths (boot-storm stalls)#30
Zorlin merged 3 commits into
mainfrom
fix/blocking-artifact-serving

Zorlin commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Zorlin commented Jun 21, 2026

Root cause

What this changes

Verified

Out of scope (not reached during imaging)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant