Skip to content

fix(server): eliminate blocking I/O from async paths (boot-storm stalls)#30

Merged
Zorlin merged 3 commits into
mainfrom
fix/blocking-artifact-serving
Jun 21, 2026
Merged

fix(server): eliminate blocking I/O from async paths (boot-storm stalls)#30
Zorlin merged 3 commits into
mainfrom
fix/blocking-artifact-serving

Conversation

@Zorlin

@Zorlin Zorlin commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Root cause

Under a boot storm (10+ machines imaging concurrently), the daemon would
stall on clients — initramfs transfers hanging partway — even though no
single handler was "slow". The cause is synchronous I/O running directly
on tokio worker threads
.

tokio's multi-threaded runtime has only ~CPU-count workers. Any blocking
syscall on a worker (std::fs::*, Path::exists(), block_in_place,
sync DNS) pins that worker for its duration. With ten machines all needing
the runtime, one machine hitting a multi-ms blocker stalls everyone
else's transfers
— including initramfs. The artifact handlers themselves
stream correctly (verified), but a blocker elsewhere on the runtime
starves them.

What this changes

Eliminates blocking I/O from every async serving path reachable during
imaging (3 commits):

  1. HTTP artifact & ISO/admin handlers (api.rs) — every Path::exists(),
    std::fs::metadata/write/create_dir_all/read_dir converted to async
    (tokio::fs::try_exists/metadata/write/create_dir_all); read_dir
    listings moved to spawn_blocking. Covers: spark/efi/grub/memtest/ipxe-efi,
    static, pxelinux, boot_asset, debian boot_asset, ipxe_artifact,
    os_image, stream cache, mage/ipxe download + apkovl generation (the
    std::fs::write of a full 295 MB modloop was the single worst offender),
    ISO list/boot-mode/delete, dev-mode toggle, credential-delete.
  2. Proxmox discovery reverse-DNS (discovery.rs) —
    block_in_place(dns_lookup) (one per host, seconds when unreachable, on a
    worker) → concurrent spawn_blocking on the blocking pool.
  3. UI/auth password-file reads (ui.rs, auth.rs) — async
    read_to_string/try_exists/remove_file.

Verified

  • cargo clippy -p dragonfly-server clean; cargo fmt --check clean.
  • cargo test -p dragonfly-server190 passed, 0 failed.
  • Deployed to prod (10.7.1.100): API 200/200; initramfs Range → 206 (1024 B),
    suffix range → 206 (2048 B); full fetch 200 / 26910594 B; 15-concurrent
    completes in 1 s with all bytes. No regression.

Out of scope (not reached during imaging)

Setup/startup-only functions remain synchronous by design — they run at
startup or explicit reconfiguration, never during steady-state imaging, and
converting their std::process::Command/tar logic is a separate riskier
change: mode::configure_* / deploy_k3s_*, ha::ensure_rqlite_binary,
lib::run startup. Flagging for a follow-up if zero blocking I/O during
reconfiguration is also wanted.

🤖 Generated with Claude Code

Zorlin and others added 3 commits June 21, 2026 17:46
These handlers ran synchronous filesystem calls (Path::exists,
std::fs::metadata/write/create_dir_all/read_dir) directly on the tokio
worker thread. tokio has only ~CPU-count workers, so under a boot storm a
single multi-ms sync op pins a worker; with ten machines imaging at once,
one blocker stalls every other client's transfer (initramfs included) -
the daemon hangs on clients even though no single handler is slow.

Convert the whole serving path to async I/O:
- artifact handlers: spark/efi/grub/memtest/ipxe-efi, static, pxelinux,
  boot_asset, debian boot_asset, ipxe_artifact, os_image, stream cache
- mage/ipxe download + apkovl generation (std::fs::write of full files was
  the worst offender - modloop is 295MB written synchronously)
- ISO list -> spawn_blocking(read_dir+metadata); boot-mode, delete
- dev-mode toggle + credential-delete admin handlers

exists() -> tokio::fs::try_exists(); read_dir -> spawn_blocking.

Co-Authored-By: Claude <noreply@anthropic.com>
discover_proxmox_handler ran tokio::task::block_in_place(dns_lookup) for
every discovered host, on a worker thread. Reverse-DNS stalls for seconds
when a host is unreachable, pinning the worker the whole time - and
discovery runs concurrently with imaging. Offload each lookup to the
blocking pool (spawn_blocking) and run them concurrently via join_all, so
discovery can never stall artifact transfers for the rest of the daemon.

Co-Authored-By: Claude <noreply@anthropic.com>
settings_page / settings_page_section / update_settings /
generate_default_credentials read and removed initial_password.txt via
synchronous std::fs on the worker. Convert to tokio::fs::read_to_string /
try_exists / remove_file so an open dashboard or a login during a boot
storm cannot pin a worker.

Co-Authored-By: Claude <noreply@anthropic.com>
@Zorlin Zorlin merged commit 3044c5e into main Jun 21, 2026
4 checks passed
@Zorlin Zorlin deleted the fix/blocking-artifact-serving branch June 22, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant