Skip to content

fix: fail fast on insufficient disk space instead of retry loop#1534

Open
ianbmacdonald wants to merge 4 commits intolemonade-sdk:mainfrom
ianbmacdonald:fix/disk-full-download-retry
Open

fix: fail fast on insufficient disk space instead of retry loop#1534
ianbmacdonald wants to merge 4 commits intolemonade-sdk:mainfrom
ianbmacdonald:fix/disk-full-download-retry

Conversation

@ianbmacdonald
Copy link
Copy Markdown
Collaborator

@ianbmacdonald ianbmacdonald commented Apr 4, 2026

Summary

  • Adds a pre-flight disk space check before downloading model files, comparing available space against the total download size. Fails immediately with a clear error message (e.g., "download requires 16.8 GB but only 0.0 GB is available") instead of starting the download.
  • Detects CURLE_WRITE_ERROR (code 23) + low available disk as a fatal non-retryable condition, preventing the retry loop from deleting the partial file and re-downloading from scratch repeatedly.
  • Adds a disk_full flag to DownloadResult to short-circuit the retry loop.

Problem

When a large model download (e.g., 16.8 GB Gemma 4) fills the disk, curl returns error 23 (CURLE_WRITE_ERROR). The existing retry logic treated this as a non-resumable error: it deleted the ~13 GB partial file and retried from zero — hitting the same wall each time in an infinite loop, burning bandwidth and I/O. In a session today, one host was redownloading a 16GB Gemma4 model at 1Gbps for an hour, as I walked away after the download started, and it just looped .. 2.5min per 16GB a LOT of times .. copied a lemonade config with the wrong HF_HOME from another device into lemonade's conf.d

Test plan

  • Tested on a machine with insufficient disk space — error returned instantly with clear message
  • Verified the error surfaces correctly in both the Download Manager panel and toast notification
  • Verify normal downloads still work on machines with sufficient space
  • Verify resume behavior is unaffected for network interruptions

🤖 Looked a the diffs; 100% vibed with Claude including deploy and testing using debian package on Debian 13

@ianbmacdonald ianbmacdonald marked this pull request as ready for review April 4, 2026 03:47
@ianbmacdonald ianbmacdonald force-pushed the fix/disk-full-download-retry branch from 77567fc to 656eccb Compare April 5, 2026 14:18
@ianbmacdonald
Copy link
Copy Markdown
Collaborator Author

The latest changes address the main correctness gaps in the pre-flight disk check:

50a92a1 now subtracts bytes already on disk from the required space calculation, so resumed downloads are checked against remaining bytes rather than total model size.
12b51e5 guards the remaining-bytes calculation against size_t underflow and caps partial-file credit to the manifest size, which matters when manifest entries have size=0 or incomplete size metadata.

@ianbmacdonald
Copy link
Copy Markdown
Collaborator Author

Did not really come up with a way to test this in CI synthetically. It was tested in the real scenario which prompted the fix. Agents suggested it might be done with a helper, wrapped with a free-space query, with synthetic manifests and don-disk byte counts, but seems like overkill to try and fake a clearly signalled scenario in a bunch of different CI environments.

One integration note: if this branch is later rebased onto or folded into #1412, the pre-flight accounting in download_from_manifest() will need one small adaptation. #1412 writes per-file download_path values into the manifest for multi-repo downloads, so the disk-space pre-check needs to mirror the real download loop and use:

file_desc.value("download_path", download_path)

when computing completed-file and .partial credit, rather than assuming everything lives under the manifest’s top-level download_path. Otherwise it will undercount bytes already present in secondary repo snapshots.

@jeremyfowers
Copy link
Copy Markdown
Member

@claude review. Will this work on Windows, macOS, and Linux? Are there any drawbacks to merging this?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@ianbmacdonald ianbmacdonald force-pushed the fix/disk-full-download-retry branch from 12b51e5 to baa0e6a Compare April 8, 2026 03:29
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

ianbmacdonald and others added 3 commits April 8, 2026 17:33
When a download fails with CURLE_WRITE_ERROR (code 23) due to disk
full, the retry logic would delete the partial file and restart from
zero — repeating indefinitely and wasting bandwidth.

This adds:
- Pre-flight disk space check before downloading, comparing available
  space against total download size
- Detection of CURLE_WRITE_ERROR + low disk as a fatal (non-retryable)
  condition with a clear error message
- disk_full flag on DownloadResult to short-circuit the retry loop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pre-flight check compared the full manifest size against available
space, ignoring completed files and .partial resume data already on disk.
This caused false "Insufficient disk space" errors when resuming after
freeing just enough space for the remaining bytes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cap per-file credit to manifest size and clamp the subtraction to zero.
Prevents unsigned underflow when manifest contains size=0 entries but
partial files exist on disk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald ianbmacdonald force-pushed the fix/disk-full-download-retry branch from baa0e6a to a066920 Compare April 8, 2026 21:34
The pre-flight disk space check assumed all files live under the
manifest's top-level download_path. With lemonade-sdk#1412's multi-repo support,
each file entry can have its own download_path pointing to a
repo-specific snapshot directory. Mirror the actual download loop by
using file_desc.value("download_path", download_path) so completed
and partial files in secondary repo snapshots are correctly credited.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald
Copy link
Copy Markdown
Collaborator Author

Now that #1412 has merged, the integration note from the earlier comment has been addressed in 5e1e827.

The pre-flight disk space check in download_from_manifest() was using the manifest's top-level download_path for all files, which undercounts bytes already on disk for multi-repo models where secondary checkpoints (text_encoder, vae, etc.) live in their own repo-specific snapshot directories.

The fix mirrors the actual download loop by using file_desc.value("download_path", download_path) so completed and partial files in secondary repo snapshots are correctly credited against the space requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants