[CLI] Add `hf cache verify` #3461

hanouticelina · 2025-10-21T16:21:41Z

This PR adds a new CLI command that checks cached files against their checksums on the Hub. It verifies all cached revisions for a repo, or specific snapshots if a revision is provided.

Under the hoods, it lists remote files for each revision using list_repo_tree, maps them to local snapshots, and compares the sets to find files that are missing locally or on the Hub. Then for each file, it computes and compares checksums.

src/huggingface_hub/cli/cache.py

HuggingFaceDocBuilderDev · 2025-10-21T16:27:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wauplin · 2025-10-22T10:10:45Z

Hey! Thanks for opening this PR. Here are some high-level thoughts about this feature:

💯% agree that the purpose of the command is to compute file checksums
as requested in command for verifying local files #3298, I would make the command compatible with local directories as well (not necessarily the cache). It is a bit counter-intuitive with the naming hf cache verify but it's still fine IMO. Another possibility would be to have directly hf verify but it's less self-explanatory.
I don't think we should scan the entire cache only to verify 1 repo and/or 1 revision. Scanning the cache is a heavy task (i.e. listing all files from all revisions from all repos) and most of it is useless if we target only a repo
I think it's fine to assume we want to be able to target a single folder per command execution. This makes the CLI much easier to extend with the "generic" arguments like --repo-type, --revision, --local-dir, etc. existing in the hf download command.
I don't think the command should fail on missing files. It's quite common for someone to download only a subpart of a repo in which case the verify command should not fail if the downloaded files are valid. Same for files that are present locally but not on remote. So having optional flags like --fail-on-missing-files and --fail-on-extra-files makes sense IMO.
- without these flags, I'd say it's ok to print a warning on missing/extra files with a message like ("12 local files do not exist on remote repo. Use --fail-on-extra-files for more details.")

In the end, the CLI I suggest would look like this:

hf cache verify <repo-id> [--repo-type ...] [--revision ...] [--cache-dir ...] [--token ...] [--local-dir ...] [--fail-on-missing-files]  [--fail-on-extra-files]

# Verify main revision of "deepseek-ai/DeepSeek-OCR" in cache
hf cache verify deepseek-ai/DeepSeek-OCR

# Verify specific revision
hf cache verify deepseek-ai/DeepSeek-OCR --revision refs/pr/1
hf cache verify deepseek-ai/DeepSeek-OCR --revision abcdef123

# Verify using private repo
hf cache verify me/private-model --token ...

# Verify dataset
hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset

# Verify local dir
hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo

Let me know what you think. I might now have thought of all possible use cases so happy to get it challenged ^^

hanouticelina · 2025-10-24T14:39:29Z

agh the commit history is messed up since we merged v1.0-release into main. fixing it now!

Wauplin

(haven't reviewed the tests)

docs/source/en/guides/cli.md

src/huggingface_hub/utils/_verification.py

Wauplin · 2025-10-27T11:05:34Z

src/huggingface_hub/utils/_verification.py

+        except OSError as exc:
+            mismatches.append(
+                Mismatch(path=rel_path, expected="<unavailable>", actual=f"io-error:{exc}", algorithm="io")
+            )
+            continue
+        except ValueError as exc:
+            mismatches.append(
+                Mismatch(path=rel_path, expected="<unavailable>", actual=f"meta-error:{exc}", algorithm="meta")
+            )
+            continue


I don't understand this part 😕 Shouldn't "algorithm" be git-hash and sha256? Also why could a OSError or ValueError be raised since compute_file_hash is not raising anything?

yes the algorithm is either git-sha1 or sha256. i added a catch for the OSError because compute_file_hash opens the local file and reads it,that can fail with one of OSError subclasses. we indeed know that the file exists in advance but by the time compute_file_hash opens the path, the file could have been deleted, replaced or its permission changed. a bit of an edge case maybe?
and yes, no need for a ValueError catch (i thought we were accessing some optional field of the remote_entry object).

src/huggingface_hub/utils/_verification.py

Co-authored-by: Lucain <[email protected]>

…rify-checksum-cli

hanouticelina · 2025-10-29T17:31:36Z

thanks @Wauplin for the very thorough review! I addressed all your comments and refactored a bit the logic

Wauplin

Thanks for the iteration! This time I've checked the tests which look great 🤗

Left a last round of comments but overall looks good :)

Wauplin · 2025-10-30T16:06:05Z

src/huggingface_hub/utils/_verification.py

+        except OSError as exc:
+            mismatches.append(
+                Mismatch(path=rel_path, expected="<unavailable>", actual=f"io-error:{exc}", algorithm="io")
+            )
+            continue


I would avoid handling this use case. If it's really happens a lot, we might reintroduce a try-except but for now I think we can safely assume that someone running hf cache verify won't be modifying the files at the same time.

Wauplin · 2025-10-30T16:10:25Z

src/huggingface_hub/hf_api.py

+        verification = verify_maps(
+            remote_by_path=remote_by_path, local_by_path=local_by_path, revision=remote_revision
+        )
+
+        return replace(verification, verified_path=root)


Suggested change

verification = verify_maps(

remote_by_path=remote_by_path, local_by_path=local_by_path, revision=remote_revision

)

return replace(verification, verified_path=root)

return verify_maps(

verified_path=root,

remote_by_path=remote_by_path,

local_by_path=local_by_path,

revision=remote_revision,

)

I feel it's cleaner if we pass the verified_path to verify_maps directly so the FolderVerification dataclass is directly instantiated with the correct values. The issue with current implementation is that FolderVerification.verified_path is set of Optional[str] while in reality it shouldn't be optional (internally it's currently optional only to make type annotations happy).

Wauplin · 2025-10-30T16:11:26Z

src/huggingface_hub/utils/_verification.py

+    mismatches: list[Mismatch]
+    missing_paths: list[str]
+    extra_paths: list[str]
+    verified_path: Optional[Path] = None


^here is what I meant above

Suggested change

verified_path: Optional[Path] = None

verified_path: Path

Wauplin · 2025-10-30T16:14:12Z

src/huggingface_hub/utils/_verification.py

+    )
+
+
+def compute_file_hash(path: Path, algorithm: HashAlgo, *, git_hash_cache: dict[Path, str]) -> str:


Suggested change

def compute_file_hash(path: Path, algorithm: HashAlgo, *, git_hash_cache: dict[Path, str]) -> str:

def compute_file_hash(path: Path, algorithm: HashAlgo) -> str:

I feel that the git_hash_cache is never used in practice since the key is path and we iterate over unique path. So better to simplify the logic by not having a cache at all.

If the cache is relevant, let's use it for both git-hash and sha256 isn't it?

Wauplin · 2025-10-30T16:16:40Z

src/huggingface_hub/utils/_verification.py

+    if algorithm == "sha256":
+        with path.open("rb") as stream:
+            return sha_fileobj(stream).hex()
+
+    if algorithm == "git-sha1":
+        try:
+            return git_hash_cache[path]
+        except KeyError:
+            with path.open("rb") as stream:
+                digest = git_hash(stream.read())
+            git_hash_cache[path] = digest
+            return digest
+
+    raise ValueError(f"Unsupported hash algorithm: {algorithm}")


Suggested change

if algorithm == "sha256":

with path.open("rb") as stream:

return sha_fileobj(stream).hex()

if algorithm == "git-sha1":

try:

return git_hash_cache[path]

except KeyError:

with path.open("rb") as stream:

digest = git_hash(stream.read())

git_hash_cache[path] = digest

return digest

raise ValueError(f"Unsupported hash algorithm: {algorithm}")

with path.open("rb") as stream:

if algorithm == "sha256":

return sha_fileobj(stream).hex()

if algorithm == "git-sha1":

return git_hash(stream.read())

raise ValueError(f"Unsupported hash algorithm: {algorithm}")

^ I think this is how the logic could be simplified without a cache.

hanouticelina commented Oct 21, 2025

View reviewed changes

src/huggingface_hub/cli/cache.py Outdated Show resolved Hide resolved

Base automatically changed from v1.0-release to main October 23, 2025 12:48

checksum verification for cached repos

a3e5a67

hanouticelina force-pushed the verify-checksum-cli branch from 1611e8d to a3e5a67 Compare October 24, 2025 15:03

better docs

b014948

hanouticelina marked this pull request as ready for review October 24, 2025 15:34

hanouticelina requested a review from Wauplin October 24, 2025 15:34

Wauplin reviewed Oct 27, 2025

View reviewed changes

hanouticelina and others added 14 commits October 27, 2025 13:51

review suggestions

f1fa8d3

small refacto

f275b93

Apply suggestions from code review

73a4607

Co-authored-by: Lucain <[email protected]>

update boolean options

7b2ea16

update docstring

eba1cc7

remove helper

8d6c1f2

another refactor

cd158f2

better

3a16a65

Merge branch 'main' of github.com:huggingface/huggingface_hub into ve…

ed06564

…rify-checksum-cli

remove unused helper

f9b2441

update tests

e87e904

add repo id and repo type in cli output

f24d975

better cli output

84752d9

fix

3b6f462

hanouticelina requested a review from Wauplin October 29, 2025 17:28

Wauplin reviewed Oct 30, 2025

View reviewed changes

		)


		def compute_file_hash(path: Path, algorithm: HashAlgo, *, git_hash_cache: dict[Path, str]) -> str:

[CLI] Add hf cache verify #3461

Are you sure you want to change the base?

[CLI] Add hf cache verify #3461

Uh oh!

Conversation

hanouticelina commented Oct 21, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 21, 2025

Uh oh!

Wauplin commented Oct 22, 2025

Uh oh!

hanouticelina commented Oct 24, 2025

Uh oh!

Wauplin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hanouticelina commented Oct 29, 2025

Uh oh!

Wauplin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[CLI] Add `hf cache verify` #3461

[CLI] Add `hf cache verify` #3461