Skip to content

Conversation

@hanouticelina
Copy link
Contributor

This PR adds a new CLI command that checks cached files against their checksums on the Hub. It verifies all cached revisions for a repo, or specific snapshots if a revision is provided.

Under the hoods, it lists remote files for each revision using list_repo_tree, maps them to local snapshots, and compares the sets to find files that are missing locally or on the Hub. Then for each file, it computes and compares checksums.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Contributor

Wauplin commented Oct 22, 2025

Hey! Thanks for opening this PR. Here are some high-level thoughts about this feature:

  • 💯% agree that the purpose of the command is to compute file checksums
  • as requested in command for verifying local files #3298, I would make the command compatible with local directories as well (not necessarily the cache). It is a bit counter-intuitive with the naming hf cache verify but it's still fine IMO. Another possibility would be to have directly hf verify but it's less self-explanatory.
  • I don't think we should scan the entire cache only to verify 1 repo and/or 1 revision. Scanning the cache is a heavy task (i.e. listing all files from all revisions from all repos) and most of it is useless if we target only a repo
  • I think it's fine to assume we want to be able to target a single folder per command execution. This makes the CLI much easier to extend with the "generic" arguments like --repo-type, --revision, --local-dir, etc. existing in the hf download command.
  • I don't think the command should fail on missing files. It's quite common for someone to download only a subpart of a repo in which case the verify command should not fail if the downloaded files are valid. Same for files that are present locally but not on remote. So having optional flags like --fail-on-missing-files and --fail-on-extra-files makes sense IMO.
    • without these flags, I'd say it's ok to print a warning on missing/extra files with a message like ("12 local files do not exist on remote repo. Use --fail-on-extra-files for more details.")

In the end, the CLI I suggest would look like this:

hf cache verify <repo-id> [--repo-type ...] [--revision ...] [--cache-dir ...] [--token ...] [--local-dir ...] [--fail-on-missing-files]  [--fail-on-extra-files]

# Verify main revision of "deepseek-ai/DeepSeek-OCR" in cache
hf cache verify deepseek-ai/DeepSeek-OCR

# Verify specific revision
hf cache verify deepseek-ai/DeepSeek-OCR --revision refs/pr/1
hf cache verify deepseek-ai/DeepSeek-OCR --revision abcdef123

# Verify using private repo
hf cache verify me/private-model --token ...

# Verify dataset
hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset

# Verify local dir
hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo

Let me know what you think. I might now have thought of all possible use cases so happy to get it challenged ^^

Base automatically changed from v1.0-release to main October 23, 2025 12:48
@hanouticelina
Copy link
Contributor Author

agh the commit history is messed up since we merged v1.0-release into main. fixing it now!

@hanouticelina hanouticelina marked this pull request as ready for review October 24, 2025 15:34
@hanouticelina hanouticelina requested a review from Wauplin October 24, 2025 15:34
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(haven't reviewed the tests)

Comment on lines 149 to 158
except OSError as exc:
mismatches.append(
Mismatch(path=rel_path, expected="<unavailable>", actual=f"io-error:{exc}", algorithm="io")
)
continue
except ValueError as exc:
mismatches.append(
Mismatch(path=rel_path, expected="<unavailable>", actual=f"meta-error:{exc}", algorithm="meta")
)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part 😕 Shouldn't "algorithm" be git-hash and sha256? Also why could a OSError or ValueError be raised since compute_file_hash is not raising anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the algorithm is either git-sha1 or sha256. i added a catch for the OSError because compute_file_hash opens the local file and reads it,that can fail with one of OSError subclasses. we indeed know that the file exists in advance but by the time compute_file_hash opens the path, the file could have been deleted, replaced or its permission changed. a bit of an edge case maybe?
and yes, no need for a ValueError catch (i thought we were accessing some optional field of the remote_entry object).

@hanouticelina hanouticelina requested a review from Wauplin October 29, 2025 17:28
@hanouticelina
Copy link
Contributor Author

thanks @Wauplin for the very thorough review! I addressed all your comments and refactored a bit the logic

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iteration! This time I've checked the tests which look great 🤗

Left a last round of comments but overall looks good :)

Comment on lines +131 to +135
except OSError as exc:
mismatches.append(
Mismatch(path=rel_path, expected="<unavailable>", actual=f"io-error:{exc}", algorithm="io")
)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid handling this use case. If it's really happens a lot, we might reintroduce a try-except but for now I think we can safely assume that someone running hf cache verify won't be modifying the files at the same time.

Comment on lines +3151 to +3155
verification = verify_maps(
remote_by_path=remote_by_path, local_by_path=local_by_path, revision=remote_revision
)

return replace(verification, verified_path=root)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
verification = verify_maps(
remote_by_path=remote_by_path, local_by_path=local_by_path, revision=remote_revision
)
return replace(verification, verified_path=root)
return verify_maps(
verified_path=root,
remote_by_path=remote_by_path,
local_by_path=local_by_path,
revision=remote_revision,
)

I feel it's cleaner if we pass the verified_path to verify_maps directly so the FolderVerification dataclass is directly instantiated with the correct values. The issue with current implementation is that FolderVerification.verified_path is set of Optional[str] while in reality it shouldn't be optional (internally it's currently optional only to make type annotations happy).

mismatches: list[Mismatch]
missing_paths: list[str]
extra_paths: list[str]
verified_path: Optional[Path] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^here is what I meant above

Suggested change
verified_path: Optional[Path] = None
verified_path: Path

)


def compute_file_hash(path: Path, algorithm: HashAlgo, *, git_hash_cache: dict[Path, str]) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def compute_file_hash(path: Path, algorithm: HashAlgo, *, git_hash_cache: dict[Path, str]) -> str:
def compute_file_hash(path: Path, algorithm: HashAlgo) -> str:

I feel that the git_hash_cache is never used in practice since the key is path and we iterate over unique path. So better to simplify the logic by not having a cache at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache is relevant, let's use it for both git-hash and sha256 isn't it?

Comment on lines +84 to +97
if algorithm == "sha256":
with path.open("rb") as stream:
return sha_fileobj(stream).hex()

if algorithm == "git-sha1":
try:
return git_hash_cache[path]
except KeyError:
with path.open("rb") as stream:
digest = git_hash(stream.read())
git_hash_cache[path] = digest
return digest

raise ValueError(f"Unsupported hash algorithm: {algorithm}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if algorithm == "sha256":
with path.open("rb") as stream:
return sha_fileobj(stream).hex()
if algorithm == "git-sha1":
try:
return git_hash_cache[path]
except KeyError:
with path.open("rb") as stream:
digest = git_hash(stream.read())
git_hash_cache[path] = digest
return digest
raise ValueError(f"Unsupported hash algorithm: {algorithm}")
with path.open("rb") as stream:
if algorithm == "sha256":
return sha_fileobj(stream).hex()
if algorithm == "git-sha1":
return git_hash(stream.read())
raise ValueError(f"Unsupported hash algorithm: {algorithm}")

^ I think this is how the logic could be simplified without a cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants