fix: resume downloads across HuggingFace commit hash changes#1536
fix: resume downloads across HuggingFace commit hash changes#1536ianbmacdonald wants to merge 9 commits intolemonade-sdk:mainfrom
Conversation
f788eb2 to
c895f54
Compare
|
added some testing and addressed some agent review feedback 1921182 makes orphan detection and directory-based download-status checks recursive for .partial files, which is important for nested paths under a snapshot. 226cfb8 finishes that off by updating the parallel add_model_to_cache() path and fixing the nested-partial regression test so the manifest filename actually matches the nested partial path being resumed. The new test shape uses a real HF-backed /pull, but it does not depend on a live upstream repo changing commit hash during the test window. Instead it seeds orphaned snapshot state locally and verifies selection / cleanup behavior from there. The main change is extra runtime/network. One integration note for later: if this branch is rebased onto or merged after #1412, the orphan-resume logic in download_from_huggingface() will need a follow-up pass for #1412’s per-file download_path manifest format and multi-repo snapshot layout. The recursive .partial scan changes should carry over directly, but the resume/rename path handling is still written around a single main-repo snapshot to be compatible with current main |
226cfb8 to
e19e7b1
Compare
When a model download is paused and resumed, the HF API may return a different commit SHA if the upstream repo received new commits. This caused the download to create a new snapshot directory, orphaning the existing .partial files and restarting from zero. Before creating a new snapshot, check for any existing in-progress downloads (manifest + .partial files) in other snapshot directories. If found, resume from the existing partial files, then rename the snapshot to the latest commit hash on completion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a model download is paused and resumed, the HF API may return a different commit SHA if the upstream repo received new commits. This caused the download to create a new snapshot directory, orphaning the existing .partial files and restarting from zero. Before creating a new snapshot, check for any existing in-progress downloads (manifest + .partial files) in other snapshot directories. If multiple orphans exist, pick the one with the most progress (largest total partial bytes), clean up the rest, and resume. On completion, rename the snapshot to the latest commit hash. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use recursive_directory_iterator instead of directory_iterator in both the orphan snapshot scanner and the download status check. Without this, nested .partial files (e.g. split_files/vae/model.safetensors.partial) are invisible — orphans with only nested partials are ignored during resume, and models with nested partials falsely report downloaded=true. Add integration tests that plant orphaned snapshots with manifests and partial files, then pull via the real HF endpoint to verify: - Best orphan selection: largest partial bytes wins, smaller orphans are cleaned up - Nested-path detection: orphan with only subdirectory partials is found and reused instead of creating a duplicate snapshot Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix nested-path test: manifest "name" now matches the nested partial path so download_from_manifest() actually resumes it instead of downloading fresh and leaving the orphaned partial behind. - Fix add_model_to_cache() to use recursive_directory_iterator (third instance of the same non-recursive scan bug, missed in prior commit). - Add manifest_filename parameter to _make_orphan_snapshot helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e19e7b1 to
7a39854
Compare
… models The orphan-resume logic overwrote the top-level download_path in the manifest, which also clobbered per-file download_path entries for non-main repos (text_encoder, vae). Now only updates entries that match the old main snapshot path, leaving non-main repo paths intact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review feedback on the orphan-resume-across-commits logic: - Count completed file sizes + partial sizes (not just partials) when picking the best orphaned snapshot to resume from - Move individual files into target snapshot dir instead of renaming the directory, avoiding collision with an existing current-hash snapshot - Fall through to normal download path after resume so the fresh HF API manifest catches any upstream file set changes - Add test_030_pull_resumes_orphaned_snapshot integration test that seeds an orphaned snapshot with a manifest and partial stubs, then verifies pull resumes and completes successfully Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation - Remove download_from_manifest(existing_manifest) call from orphan-resume path entirely. The stale manifest's URLs or file list may not match upstream. Instead, just relocate salvageable files and fall through to the normal path which builds a fresh manifest from the current HF API. - Add size verification in download_from_manifest: when a completed file exists on disk, compare its size against the manifest. If sizes differ (upstream updated the file between commits), remove and re-download it. - Fix test to read refs/main for authoritative snapshot hash instead of picking an arbitrary directory entry. - Add test_031_pull_resumes_orphaned_snapshot_multi_repo covering the multi-repo download_path handling with per-repo cache directories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the current-hash snapshot already exists during orphan relocation, apply a comparison-aware merge policy: - Keep a complete dest file over any orphan copy - When both are .partial, keep whichever is larger (more progress) - Only move orphan files that don't exist or are smaller at dest Add test_032_pull_orphan_collision_with_existing_snapshot that seeds both an orphaned snapshot and a partial current-hash snapshot, then verifies the merge policy preserves the better data from each. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /delete endpoint wipes the entire models--org--repo directory for non-shared repos, destroying the seeded orphan snapshots before pull can find them. Since the pull endpoint always runs download_from_huggingface (do_not_upgrade defaults to false), the tests can just manipulate disk state and re-pull without needing /delete. This ensures: - test_030: orphan snapshot survives to be found by the scanner - test_031: no dependency on test ordering for repo sharing - test_032: both current-hash and orphan snapshots coexist on disk so the collision merge path is actually exercised Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Now that #1412 has merged and this branch is rebased onto it, the integration notes from the earlier comment have been addressed across several commits: Multi-repo manifest support (e8e71cf, 073520e):
Stale manifest and content verification (24ae02e):
Snapshot collision handling (a282ba7):
Test coverage (073520e, a282ba7, 70c58d0):
|
Summary
.download_manifest.json+.partialfiles) in other snapshot directories under the same model cache path.download_pathto point to the actual snapshot being resumed.Problem
When a model download is paused and resumed, the HuggingFace API returns the latest commit SHA — which may differ from when the download started (e.g., unsloth pushes frequently). The code creates a new snapshot directory for each commit hash, orphaning the old
.partialfiles and restarting from zero. On a fast-moving repo, each pause/resume wastes the entire download.In testing, 3 pause/resume cycles on Gemma 4 (16.8 GB) created 3 orphaned snapshot directories with 8.5 GB, 4.5 GB, and 4.2 GB of wasted partial downloads.
Test plan
🤖 Generated with Claude Code