-
Notifications
You must be signed in to change notification settings - Fork 837
[CLI] Add hf cache verify
#3461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hey! Thanks for opening this PR. Here are some high-level thoughts about this feature:
In the end, the CLI I suggest would look like this: hf cache verify <repo-id> [--repo-type ...] [--revision ...] [--cache-dir ...] [--token ...] [--local-dir ...] [--fail-on-missing-files] [--fail-on-extra-files]
# Verify main revision of "deepseek-ai/DeepSeek-OCR" in cache
hf cache verify deepseek-ai/DeepSeek-OCR
# Verify specific revision
hf cache verify deepseek-ai/DeepSeek-OCR --revision refs/pr/1
hf cache verify deepseek-ai/DeepSeek-OCR --revision abcdef123
# Verify using private repo
hf cache verify me/private-model --token ...
# Verify dataset
hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset
# Verify local dir
hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repoLet me know what you think. I might now have thought of all possible use cases so happy to get it challenged ^^ |
|
agh the commit history is messed up since we merged |
1611e8d to
a3e5a67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(haven't reviewed the tests)
| except OSError as exc: | ||
| mismatches.append( | ||
| Mismatch(path=rel_path, expected="<unavailable>", actual=f"io-error:{exc}", algorithm="io") | ||
| ) | ||
| continue | ||
| except ValueError as exc: | ||
| mismatches.append( | ||
| Mismatch(path=rel_path, expected="<unavailable>", actual=f"meta-error:{exc}", algorithm="meta") | ||
| ) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this part 😕 Shouldn't "algorithm" be git-hash and sha256? Also why could a OSError or ValueError be raised since compute_file_hash is not raising anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the algorithm is either git-sha1 or sha256. i added a catch for the OSError because compute_file_hash opens the local file and reads it,that can fail with one of OSError subclasses. we indeed know that the file exists in advance but by the time compute_file_hash opens the path, the file could have been deleted, replaced or its permission changed. a bit of an edge case maybe?
and yes, no need for a ValueError catch (i thought we were accessing some optional field of the remote_entry object).
Co-authored-by: Lucain <[email protected]>
…rify-checksum-cli
|
thanks @Wauplin for the very thorough review! I addressed all your comments and refactored a bit the logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iteration! This time I've checked the tests which look great 🤗
Left a last round of comments but overall looks good :)
| except OSError as exc: | ||
| mismatches.append( | ||
| Mismatch(path=rel_path, expected="<unavailable>", actual=f"io-error:{exc}", algorithm="io") | ||
| ) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid handling this use case. If it's really happens a lot, we might reintroduce a try-except but for now I think we can safely assume that someone running hf cache verify won't be modifying the files at the same time.
| verification = verify_maps( | ||
| remote_by_path=remote_by_path, local_by_path=local_by_path, revision=remote_revision | ||
| ) | ||
|
|
||
| return replace(verification, verified_path=root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| verification = verify_maps( | |
| remote_by_path=remote_by_path, local_by_path=local_by_path, revision=remote_revision | |
| ) | |
| return replace(verification, verified_path=root) | |
| return verify_maps( | |
| verified_path=root, | |
| remote_by_path=remote_by_path, | |
| local_by_path=local_by_path, | |
| revision=remote_revision, | |
| ) |
I feel it's cleaner if we pass the verified_path to verify_maps directly so the FolderVerification dataclass is directly instantiated with the correct values. The issue with current implementation is that FolderVerification.verified_path is set of Optional[str] while in reality it shouldn't be optional (internally it's currently optional only to make type annotations happy).
| mismatches: list[Mismatch] | ||
| missing_paths: list[str] | ||
| extra_paths: list[str] | ||
| verified_path: Optional[Path] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^here is what I meant above
| verified_path: Optional[Path] = None | |
| verified_path: Path |
| ) | ||
|
|
||
|
|
||
| def compute_file_hash(path: Path, algorithm: HashAlgo, *, git_hash_cache: dict[Path, str]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def compute_file_hash(path: Path, algorithm: HashAlgo, *, git_hash_cache: dict[Path, str]) -> str: | |
| def compute_file_hash(path: Path, algorithm: HashAlgo) -> str: |
I feel that the git_hash_cache is never used in practice since the key is path and we iterate over unique path. So better to simplify the logic by not having a cache at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cache is relevant, let's use it for both git-hash and sha256 isn't it?
| if algorithm == "sha256": | ||
| with path.open("rb") as stream: | ||
| return sha_fileobj(stream).hex() | ||
|
|
||
| if algorithm == "git-sha1": | ||
| try: | ||
| return git_hash_cache[path] | ||
| except KeyError: | ||
| with path.open("rb") as stream: | ||
| digest = git_hash(stream.read()) | ||
| git_hash_cache[path] = digest | ||
| return digest | ||
|
|
||
| raise ValueError(f"Unsupported hash algorithm: {algorithm}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if algorithm == "sha256": | |
| with path.open("rb") as stream: | |
| return sha_fileobj(stream).hex() | |
| if algorithm == "git-sha1": | |
| try: | |
| return git_hash_cache[path] | |
| except KeyError: | |
| with path.open("rb") as stream: | |
| digest = git_hash(stream.read()) | |
| git_hash_cache[path] = digest | |
| return digest | |
| raise ValueError(f"Unsupported hash algorithm: {algorithm}") | |
| with path.open("rb") as stream: | |
| if algorithm == "sha256": | |
| return sha_fileobj(stream).hex() | |
| if algorithm == "git-sha1": | |
| return git_hash(stream.read()) | |
| raise ValueError(f"Unsupported hash algorithm: {algorithm}") |
^ I think this is how the logic could be simplified without a cache.
This PR adds a new CLI command that checks cached files against their checksums on the Hub. It verifies all cached revisions for a repo, or specific snapshots if a revision is provided.
Under the hoods, it lists remote files for each revision using
list_repo_tree, maps them to local snapshots, and compares the sets to find files that are missing locally or on the Hub. Then for each file, it computes and compares checksums.