fix(server): eliminate blocking I/O from async paths (boot-storm stalls)#30
Merged
Conversation
These handlers ran synchronous filesystem calls (Path::exists, std::fs::metadata/write/create_dir_all/read_dir) directly on the tokio worker thread. tokio has only ~CPU-count workers, so under a boot storm a single multi-ms sync op pins a worker; with ten machines imaging at once, one blocker stalls every other client's transfer (initramfs included) - the daemon hangs on clients even though no single handler is slow. Convert the whole serving path to async I/O: - artifact handlers: spark/efi/grub/memtest/ipxe-efi, static, pxelinux, boot_asset, debian boot_asset, ipxe_artifact, os_image, stream cache - mage/ipxe download + apkovl generation (std::fs::write of full files was the worst offender - modloop is 295MB written synchronously) - ISO list -> spawn_blocking(read_dir+metadata); boot-mode, delete - dev-mode toggle + credential-delete admin handlers exists() -> tokio::fs::try_exists(); read_dir -> spawn_blocking. Co-Authored-By: Claude <noreply@anthropic.com>
discover_proxmox_handler ran tokio::task::block_in_place(dns_lookup) for every discovered host, on a worker thread. Reverse-DNS stalls for seconds when a host is unreachable, pinning the worker the whole time - and discovery runs concurrently with imaging. Offload each lookup to the blocking pool (spawn_blocking) and run them concurrently via join_all, so discovery can never stall artifact transfers for the rest of the daemon. Co-Authored-By: Claude <noreply@anthropic.com>
settings_page / settings_page_section / update_settings / generate_default_credentials read and removed initial_password.txt via synchronous std::fs on the worker. Convert to tokio::fs::read_to_string / try_exists / remove_file so an open dashboard or a login during a boot storm cannot pin a worker. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
Under a boot storm (10+ machines imaging concurrently), the daemon would
stall on clients — initramfs transfers hanging partway — even though no
single handler was "slow". The cause is synchronous I/O running directly
on tokio worker threads.
tokio's multi-threaded runtime has only ~CPU-count workers. Any blocking
syscall on a worker (
std::fs::*,Path::exists(),block_in_place,sync DNS) pins that worker for its duration. With ten machines all needing
the runtime, one machine hitting a multi-ms blocker stalls everyone
else's transfers — including initramfs. The artifact handlers themselves
stream correctly (verified), but a blocker elsewhere on the runtime
starves them.
What this changes
Eliminates blocking I/O from every async serving path reachable during
imaging (3 commits):
api.rs) — everyPath::exists(),std::fs::metadata/write/create_dir_all/read_dirconverted to async(
tokio::fs::try_exists/metadata/write/create_dir_all);read_dirlistings moved to
spawn_blocking. Covers: spark/efi/grub/memtest/ipxe-efi,static, pxelinux,
boot_asset, debian boot_asset,ipxe_artifact,os_image, stream cache, mage/ipxe download + apkovl generation (thestd::fs::writeof a full 295 MB modloop was the single worst offender),ISO list/boot-mode/delete, dev-mode toggle, credential-delete.
discovery.rs) —block_in_place(dns_lookup)(one per host, seconds when unreachable, on aworker) → concurrent
spawn_blockingon the blocking pool.ui.rs,auth.rs) — asyncread_to_string/try_exists/remove_file.Verified
cargo clippy -p dragonfly-serverclean;cargo fmt --checkclean.cargo test -p dragonfly-server— 190 passed, 0 failed.suffix range → 206 (2048 B); full fetch 200 / 26910594 B; 15-concurrent
completes in 1 s with all bytes. No regression.
Out of scope (not reached during imaging)
Setup/startup-only functions remain synchronous by design — they run at
startup or explicit reconfiguration, never during steady-state imaging, and
converting their
std::process::Command/tar logic is a separate riskierchange:
mode::configure_*/deploy_k3s_*,ha::ensure_rqlite_binary,lib::runstartup. Flagging for a follow-up if zero blocking I/O duringreconfiguration is also wanted.
🤖 Generated with Claude Code