fix: fail fast on insufficient disk space instead of retry loop#1534
fix: fail fast on insufficient disk space instead of retry loop#1534ianbmacdonald wants to merge 4 commits intolemonade-sdk:mainfrom
Conversation
77567fc to
656eccb
Compare
|
The latest changes address the main correctness gaps in the pre-flight disk check: 50a92a1 now subtracts bytes already on disk from the required space calculation, so resumed downloads are checked against remaining bytes rather than total model size. |
|
Did not really come up with a way to test this in CI synthetically. It was tested in the real scenario which prompted the fix. Agents suggested it might be done with a helper, wrapped with a free-space query, with synthetic manifests and don-disk byte counts, but seems like overkill to try and fake a clearly signalled scenario in a bunch of different CI environments. One integration note: if this branch is later rebased onto or folded into #1412, the pre-flight accounting in download_from_manifest() will need one small adaptation. #1412 writes per-file download_path values into the manifest for multi-repo downloads, so the disk-space pre-check needs to mirror the real download loop and use: file_desc.value("download_path", download_path) when computing completed-file and .partial credit, rather than assuming everything lives under the manifest’s top-level download_path. Otherwise it will undercount bytes already present in secondary repo snapshots. |
|
@claude review. Will this work on Windows, macOS, and Linux? Are there any drawbacks to merging this? |
|
I'll analyze this and get back to you. |
12b51e5 to
baa0e6a
Compare
|
I'll analyze this and get back to you. |
When a download fails with CURLE_WRITE_ERROR (code 23) due to disk full, the retry logic would delete the partial file and restart from zero — repeating indefinitely and wasting bandwidth. This adds: - Pre-flight disk space check before downloading, comparing available space against total download size - Detection of CURLE_WRITE_ERROR + low disk as a fatal (non-retryable) condition with a clear error message - disk_full flag on DownloadResult to short-circuit the retry loop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pre-flight check compared the full manifest size against available space, ignoring completed files and .partial resume data already on disk. This caused false "Insufficient disk space" errors when resuming after freeing just enough space for the remaining bytes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cap per-file credit to manifest size and clamp the subtraction to zero. Prevents unsigned underflow when manifest contains size=0 entries but partial files exist on disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
baa0e6a to
a066920
Compare
The pre-flight disk space check assumed all files live under the manifest's top-level download_path. With lemonade-sdk#1412's multi-repo support, each file entry can have its own download_path pointing to a repo-specific snapshot directory. Mirror the actual download loop by using file_desc.value("download_path", download_path) so completed and partial files in secondary repo snapshots are correctly credited. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Now that #1412 has merged, the integration note from the earlier comment has been addressed in 5e1e827. The pre-flight disk space check in The fix mirrors the actual download loop by using |

Summary
CURLE_WRITE_ERROR(code 23) + low available disk as a fatal non-retryable condition, preventing the retry loop from deleting the partial file and re-downloading from scratch repeatedly.disk_fullflag toDownloadResultto short-circuit the retry loop.Problem
When a large model download (e.g., 16.8 GB Gemma 4) fills the disk, curl returns error 23 (
CURLE_WRITE_ERROR). The existing retry logic treated this as a non-resumable error: it deleted the ~13 GB partial file and retried from zero — hitting the same wall each time in an infinite loop, burning bandwidth and I/O. In a session today, one host was redownloading a 16GB Gemma4 model at 1Gbps for an hour, as I walked away after the download started, and it just looped .. 2.5min per 16GB a LOT of times .. copied a lemonade config with the wrongHF_HOMEfrom another device into lemonade'sconf.dTest plan
🤖 Looked a the diffs; 100% vibed with Claude including deploy and testing using debian package on Debian 13