Skip to content

fix: resume downloads across HuggingFace commit hash changes#1536

Open
ianbmacdonald wants to merge 9 commits intolemonade-sdk:mainfrom
ianbmacdonald:fix/download-resume-across-commits
Open

fix: resume downloads across HuggingFace commit hash changes#1536
ianbmacdonald wants to merge 9 commits intolemonade-sdk:mainfrom
ianbmacdonald:fix/download-resume-across-commits

Conversation

@ianbmacdonald
Copy link
Copy Markdown
Collaborator

@ianbmacdonald ianbmacdonald commented Apr 4, 2026

Summary

  • Before creating a new snapshot directory, checks for existing in-progress downloads (.download_manifest.json + .partial files) in other snapshot directories under the same model cache path.
  • If multiple orphaned snapshots exist, picks the one with the most progress (largest total partial bytes) and cleans up the rest.
  • Fixes the manifest's download_path to point to the actual snapshot being resumed.
  • On completion, renames the old snapshot directory to the latest commit hash so the cache stays current.

Problem

When a model download is paused and resumed, the HuggingFace API returns the latest commit SHA — which may differ from when the download started (e.g., unsloth pushes frequently). The code creates a new snapshot directory for each commit hash, orphaning the old .partial files and restarting from zero. On a fast-moving repo, each pause/resume wastes the entire download.

In testing, 3 pause/resume cycles on Gemma 4 (16.8 GB) created 3 orphaned snapshot directories with 8.5 GB, 4.5 GB, and 4.2 GB of wasted partial downloads.

Test plan

  • Verified resume finds existing 8.5 GB partial in old snapshot and resumes from it
  • Verified snapshot renamed from old commit hash to new commit hash on completion
  • Verified no orphaned directories or partial files remain after completion
  • Verified model registers with correct path in the new snapshot directory
  • Verified fresh downloads (no existing partials) work normally
  • Verified with multiple orphaned snapshots: picks largest partial (1 GB over 100 MB), cleans up smaller orphan

🤖 Generated with Claude Code

@ianbmacdonald ianbmacdonald marked this pull request as ready for review April 4, 2026 04:13
@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from f788eb2 to c895f54 Compare April 5, 2026 14:19
@ianbmacdonald
Copy link
Copy Markdown
Collaborator Author

added some testing and addressed some agent review feedback

1921182 makes orphan detection and directory-based download-status checks recursive for .partial files, which is important for nested paths under a snapshot.

226cfb8 finishes that off by updating the parallel add_model_to_cache() path and fixing the nested-partial regression test so the manifest filename actually matches the nested partial path being resumed.

The new test shape uses a real HF-backed /pull, but it does not depend on a live upstream repo changing commit hash during the test window. Instead it seeds orphaned snapshot state locally and verifies selection / cleanup behavior from there. The main change is extra runtime/network.

One integration note for later: if this branch is rebased onto or merged after #1412, the orphan-resume logic in download_from_huggingface() will need a follow-up pass for #1412’s per-file download_path manifest format and multi-repo snapshot layout. The recursive .partial scan changes should carry over directly, but the resume/rename path handling is still written around a single main-repo snapshot to be compatible with current main

@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from 226cfb8 to e19e7b1 Compare April 8, 2026 03:29
ianbmacdonald and others added 4 commits April 8, 2026 17:33
When a model download is paused and resumed, the HF API may return a
different commit SHA if the upstream repo received new commits. This
caused the download to create a new snapshot directory, orphaning the
existing .partial files and restarting from zero.

Before creating a new snapshot, check for any existing in-progress
downloads (manifest + .partial files) in other snapshot directories.
If found, resume from the existing partial files, then rename the
snapshot to the latest commit hash on completion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a model download is paused and resumed, the HF API may return a
different commit SHA if the upstream repo received new commits. This
caused the download to create a new snapshot directory, orphaning the
existing .partial files and restarting from zero.

Before creating a new snapshot, check for any existing in-progress
downloads (manifest + .partial files) in other snapshot directories.
If multiple orphans exist, pick the one with the most progress (largest
total partial bytes), clean up the rest, and resume. On completion,
rename the snapshot to the latest commit hash.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use recursive_directory_iterator instead of directory_iterator in both
the orphan snapshot scanner and the download status check. Without this,
nested .partial files (e.g. split_files/vae/model.safetensors.partial)
are invisible — orphans with only nested partials are ignored during
resume, and models with nested partials falsely report downloaded=true.

Add integration tests that plant orphaned snapshots with manifests and
partial files, then pull via the real HF endpoint to verify:
- Best orphan selection: largest partial bytes wins, smaller orphans
  are cleaned up
- Nested-path detection: orphan with only subdirectory partials is
  found and reused instead of creating a duplicate snapshot

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix nested-path test: manifest "name" now matches the nested partial
  path so download_from_manifest() actually resumes it instead of
  downloading fresh and leaving the orphaned partial behind.
- Fix add_model_to_cache() to use recursive_directory_iterator (third
  instance of the same non-recursive scan bug, missed in prior commit).
- Add manifest_filename parameter to _make_orphan_snapshot helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from e19e7b1 to 7a39854 Compare April 8, 2026 21:34
ianbmacdonald and others added 5 commits April 8, 2026 17:42
… models

The orphan-resume logic overwrote the top-level download_path in the
manifest, which also clobbered per-file download_path entries for
non-main repos (text_encoder, vae). Now only updates entries that
match the old main snapshot path, leaving non-main repo paths intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review feedback on the orphan-resume-across-commits logic:

- Count completed file sizes + partial sizes (not just partials) when
  picking the best orphaned snapshot to resume from
- Move individual files into target snapshot dir instead of renaming
  the directory, avoiding collision with an existing current-hash snapshot
- Fall through to normal download path after resume so the fresh HF API
  manifest catches any upstream file set changes
- Add test_030_pull_resumes_orphaned_snapshot integration test that seeds
  an orphaned snapshot with a manifest and partial stubs, then verifies
  pull resumes and completes successfully

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation

- Remove download_from_manifest(existing_manifest) call from orphan-resume
  path entirely. The stale manifest's URLs or file list may not match
  upstream. Instead, just relocate salvageable files and fall through to
  the normal path which builds a fresh manifest from the current HF API.

- Add size verification in download_from_manifest: when a completed file
  exists on disk, compare its size against the manifest. If sizes differ
  (upstream updated the file between commits), remove and re-download it.

- Fix test to read refs/main for authoritative snapshot hash instead of
  picking an arbitrary directory entry.

- Add test_031_pull_resumes_orphaned_snapshot_multi_repo covering the
  multi-repo download_path handling with per-repo cache directories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the current-hash snapshot already exists during orphan relocation,
apply a comparison-aware merge policy:
- Keep a complete dest file over any orphan copy
- When both are .partial, keep whichever is larger (more progress)
- Only move orphan files that don't exist or are smaller at dest

Add test_032_pull_orphan_collision_with_existing_snapshot that seeds both
an orphaned snapshot and a partial current-hash snapshot, then verifies
the merge policy preserves the better data from each.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /delete endpoint wipes the entire models--org--repo directory for
non-shared repos, destroying the seeded orphan snapshots before pull
can find them. Since the pull endpoint always runs
download_from_huggingface (do_not_upgrade defaults to false), the
tests can just manipulate disk state and re-pull without needing
/delete. This ensures:

- test_030: orphan snapshot survives to be found by the scanner
- test_031: no dependency on test ordering for repo sharing
- test_032: both current-hash and orphan snapshots coexist on disk
  so the collision merge path is actually exercised

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald
Copy link
Copy Markdown
Collaborator Author

Now that #1412 has merged and this branch is rebased onto it, the integration notes from the earlier comment have been addressed across several commits:

Multi-repo manifest support (e8e71cf, 073520e):

  • Orphan-resume path now preserves per-file download_path entries for non-main repos (text_encoder, vae, etc.) — only updates entries that match the old main snapshot path
  • Orphan detection heuristic counts completed file sizes + partial sizes (not just partials) using manifest file entries with their per-repo paths

Stale manifest and content verification (24ae02e):

  • Removed download_from_manifest(existing_manifest) call from orphan-resume entirely — the stale manifest's URLs or file list may not match upstream after a commit change
  • Instead, just relocate salvageable files (completed + partials) and fall through to the normal download path, which builds a fresh manifest from the current HF API response
  • Added size verification in download_from_manifest: when a completed file exists on disk, its size is compared against the manifest and re-downloaded if mismatched

Snapshot collision handling (a282ba7):

  • Merge policy when relocating into an existing current-hash snapshot: keep complete files over partials, keep the larger .partial when both exist, only move what improves on dest

Test coverage (073520e, a282ba7, 70c58d0):

  • test_030: single-repo orphan-resume using refs/main for authoritative hash
  • test_031: multi-repo orphan-resume with per-file download_path entries
  • test_032: collision path where both orphan and current-hash snapshots coexist
  • Tests manipulate disk state and re-pull directly (no /delete calls, which would wipe the seeded orphan state)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant