fix: resume downloads across HuggingFace commit hash changes by ianbmacdonald · Pull Request #1536 · lemonade-sdk/lemonade

ianbmacdonald · 2026-04-04T04:05:43Z

Summary

Before creating a new snapshot directory, checks for existing in-progress downloads (.download_manifest.json + .partial files) in other snapshot directories under the same model cache path.
If multiple orphaned snapshots exist, picks the one with the most progress (largest total partial bytes) and cleans up the rest.
Fixes the manifest's download_path to point to the actual snapshot being resumed.
On completion, renames the old snapshot directory to the latest commit hash so the cache stays current.

Problem

When a model download is paused and resumed, the HuggingFace API returns the latest commit SHA — which may differ from when the download started (e.g., unsloth pushes frequently). The code creates a new snapshot directory for each commit hash, orphaning the old .partial files and restarting from zero. On a fast-moving repo, each pause/resume wastes the entire download.

In testing, 3 pause/resume cycles on Gemma 4 (16.8 GB) created 3 orphaned snapshot directories with 8.5 GB, 4.5 GB, and 4.2 GB of wasted partial downloads.

Test plan

Verified resume finds existing 8.5 GB partial in old snapshot and resumes from it
Verified snapshot renamed from old commit hash to new commit hash on completion
Verified no orphaned directories or partial files remain after completion
Verified model registers with correct path in the new snapshot directory
Verified fresh downloads (no existing partials) work normally
Verified with multiple orphaned snapshots: picks largest partial (1 GB over 100 MB), cleans up smaller orphan

🤖 Generated with Claude Code

ianbmacdonald · 2026-04-05T15:51:34Z

added some testing and addressed some agent review feedback

1921182 makes orphan detection and directory-based download-status checks recursive for .partial files, which is important for nested paths under a snapshot.

226cfb8 finishes that off by updating the parallel add_model_to_cache() path and fixing the nested-partial regression test so the manifest filename actually matches the nested partial path being resumed.

The new test shape uses a real HF-backed /pull, but it does not depend on a live upstream repo changing commit hash during the test window. Instead it seeds orphaned snapshot state locally and verifies selection / cleanup behavior from there. The main change is extra runtime/network.

One integration note for later: if this branch is rebased onto or merged after #1412, the orphan-resume logic in download_from_huggingface() will need a follow-up pass for #1412’s per-file download_path manifest format and multi-repo snapshot layout. The recursive .partial scan changes should carry over directly, but the resume/rename path handling is still written around a single main-repo snapshot to be compatible with current main

When a model download is paused and resumed, the HF API may return a different commit SHA if the upstream repo received new commits. This caused the download to create a new snapshot directory, orphaning the existing .partial files and restarting from zero. Before creating a new snapshot, check for any existing in-progress downloads (manifest + .partial files) in other snapshot directories. If found, resume from the existing partial files, then rename the snapshot to the latest commit hash on completion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a model download is paused and resumed, the HF API may return a different commit SHA if the upstream repo received new commits. This caused the download to create a new snapshot directory, orphaning the existing .partial files and restarting from zero. Before creating a new snapshot, check for any existing in-progress downloads (manifest + .partial files) in other snapshot directories. If multiple orphans exist, pick the one with the most progress (largest total partial bytes), clean up the rest, and resume. On completion, rename the snapshot to the latest commit hash. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use recursive_directory_iterator instead of directory_iterator in both the orphan snapshot scanner and the download status check. Without this, nested .partial files (e.g. split_files/vae/model.safetensors.partial) are invisible — orphans with only nested partials are ignored during resume, and models with nested partials falsely report downloaded=true. Add integration tests that plant orphaned snapshots with manifests and partial files, then pull via the real HF endpoint to verify: - Best orphan selection: largest partial bytes wins, smaller orphans are cleaned up - Nested-path detection: orphan with only subdirectory partials is found and reused instead of creating a duplicate snapshot Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix nested-path test: manifest "name" now matches the nested partial path so download_from_manifest() actually resumes it instead of downloading fresh and leaving the orphaned partial behind. - Fix add_model_to_cache() to use recursive_directory_iterator (third instance of the same non-recursive scan bug, missed in prior commit). - Add manifest_filename parameter to _make_orphan_snapshot helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… models The orphan-resume logic overwrote the top-level download_path in the manifest, which also clobbered per-file download_path entries for non-main repos (text_encoder, vae). Now only updates entries that match the old main snapshot path, leaving non-main repo paths intact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address review feedback on the orphan-resume-across-commits logic: - Count completed file sizes + partial sizes (not just partials) when picking the best orphaned snapshot to resume from - Move individual files into target snapshot dir instead of renaming the directory, avoiding collision with an existing current-hash snapshot - Fall through to normal download path after resume so the fresh HF API manifest catches any upstream file set changes - Add test_030_pull_resumes_orphaned_snapshot integration test that seeds an orphaned snapshot with a manifest and partial stubs, then verifies pull resumes and completes successfully Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation - Remove download_from_manifest(existing_manifest) call from orphan-resume path entirely. The stale manifest's URLs or file list may not match upstream. Instead, just relocate salvageable files and fall through to the normal path which builds a fresh manifest from the current HF API. - Add size verification in download_from_manifest: when a completed file exists on disk, compare its size against the manifest. If sizes differ (upstream updated the file between commits), remove and re-download it. - Fix test to read refs/main for authoritative snapshot hash instead of picking an arbitrary directory entry. - Add test_031_pull_resumes_orphaned_snapshot_multi_repo covering the multi-repo download_path handling with per-repo cache directories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the current-hash snapshot already exists during orphan relocation, apply a comparison-aware merge policy: - Keep a complete dest file over any orphan copy - When both are .partial, keep whichever is larger (more progress) - Only move orphan files that don't exist or are smaller at dest Add test_032_pull_orphan_collision_with_existing_snapshot that seeds both an orphaned snapshot and a partial current-hash snapshot, then verifies the merge policy preserves the better data from each. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The /delete endpoint wipes the entire models--org--repo directory for non-shared repos, destroying the seeded orphan snapshots before pull can find them. Since the pull endpoint always runs download_from_huggingface (do_not_upgrade defaults to false), the tests can just manipulate disk state and re-pull without needing /delete. This ensures: - test_030: orphan snapshot survives to be found by the scanner - test_031: no dependency on test ordering for repo sharing - test_032: both current-hash and orphan snapshots coexist on disk so the collision merge path is actually exercised Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ianbmacdonald · 2026-04-09T02:52:37Z

Now that #1412 has merged and this branch is rebased onto it, the integration notes from the earlier comment have been addressed across several commits:

Multi-repo manifest support (e8e71cf, 073520e):

Orphan-resume path now preserves per-file download_path entries for non-main repos (text_encoder, vae, etc.) — only updates entries that match the old main snapshot path
Orphan detection heuristic counts completed file sizes + partial sizes (not just partials) using manifest file entries with their per-repo paths

Stale manifest and content verification (24ae02e):

Removed download_from_manifest(existing_manifest) call from orphan-resume entirely — the stale manifest's URLs or file list may not match upstream after a commit change
Instead, just relocate salvageable files (completed + partials) and fall through to the normal download path, which builds a fresh manifest from the current HF API response
Added size verification in download_from_manifest: when a completed file exists on disk, its size is compared against the manifest and re-downloaded if mismatched

Snapshot collision handling (a282ba7):

Merge policy when relocating into an existing current-hash snapshot: keep complete files over partials, keep the larger .partial when both exist, only move what improves on dest

Test coverage (073520e, a282ba7, 70c58d0):

test_030: single-repo orphan-resume using refs/main for authoritative hash
test_031: multi-repo orphan-resume with per-file download_path entries
test_032: collision path where both orphan and current-hash snapshots coexist
Tests manipulate disk state and re-pull directly (no /delete calls, which would wipe the seeded orphan state)

ianbmacdonald marked this pull request as ready for review April 4, 2026 04:13

ianbmacdonald force-pushed the fix/download-resume-across-commits branch from f788eb2 to c895f54 Compare April 5, 2026 14:19

ianbmacdonald force-pushed the fix/download-resume-across-commits branch from 226cfb8 to e19e7b1 Compare April 8, 2026 03:29

ianbmacdonald and others added 4 commits April 8, 2026 17:33

ianbmacdonald force-pushed the fix/download-resume-across-commits branch from e19e7b1 to 7a39854 Compare April 8, 2026 21:34

ianbmacdonald and others added 5 commits April 8, 2026 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resume downloads across HuggingFace commit hash changes#1536

fix: resume downloads across HuggingFace commit hash changes#1536
ianbmacdonald wants to merge 9 commits intolemonade-sdk:mainfrom
ianbmacdonald:fix/download-resume-across-commits

ianbmacdonald commented Apr 4, 2026 •

edited

Loading

Uh oh!

ianbmacdonald commented Apr 5, 2026

Uh oh!

ianbmacdonald commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ianbmacdonald commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Test plan

Uh oh!

ianbmacdonald commented Apr 5, 2026

Uh oh!

ianbmacdonald commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ianbmacdonald commented Apr 4, 2026 •

edited

Loading