Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/source/en/guides/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -673,6 +673,45 @@ Deleted 3 unreferenced revision(s); freed 2.4G.

As with the other cache commands, `--dry-run`, `--yes`, and `--cache-dir` are available. Refer to the [Manage your cache](./manage-cache) guide for more examples.

## hf cache verify

Use `hf cache verify` to validate local files against their checksums on the Hub. You can verify either a cache snapshot or a regular local directory.

Examples:

```bash
# Verify main revision of a model in cache
>>> hf cache verify deepseek-ai/DeepSeek-OCR

# Verify a specific revision
>>> hf cache verify deepseek-ai/DeepSeek-OCR --revision refs/pr/5
>>> hf cache verify deepseek-ai/DeepSeek-OCR --revision ef93bf4a377c5d5ed9dca78e0bc4ea50b26fe6a4

# Verify a private repo
>>> hf cache verify me/private-model --token hf_***

# Verify a dataset
>>> hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset

# Verify files in a local directory
>>> hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo
```

By default, the command warns about missing or extra files. Use flags to turn these warnings into errors:

```bash
>>> hf cache verify deepseek-ai/DeepSeek-OCR --fail-on-missing-files --fail-on-extra-files
```

On success, you will see a summary:

```text
✅ Verified 13 file(s) for 'deepseek-ai/DeepSeek-OCR' (model) in ~/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6
All checksums match.
```

If mismatches are detected, the command prints a detailed list and exits with a non-zero status.

## hf repo tag create

The `hf repo tag create` command allows you to tag, untag, and list tags for repositories.
Expand Down
20 changes: 20 additions & 0 deletions docs/source/en/guides/manage-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -479,6 +479,26 @@ HFCacheInfo(
)
```

### Verify your cache

`huggingface_hub` can verify that your cached files match the checksums on the Hub. Use `hf cache verify` CLI to validate file consistency for a specific revision of a specific repository:


```bash
>>> hf cache verify meta-llama/Llama-3.2-1B-Instruct
✅ Verified 13 file(s) for 'meta-llama/Llama-3.2-1B-Instruct' (model) in ~/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6
All checksums match.
```

Verify a specific cached revision:

```bash
>>> hf cache verify meta-llama/Llama-3.1-8B-Instruct --revision 0e9e39f249a16976918f6564b8830bc894c89659
```

> [!TIP]
> Check the [`hf cache verify` CLI reference](../package_reference/cli#hf-cache-verify) for more details about the usage and a complete list of options.

### Clean your cache

Scanning your cache is interesting but what you really want to do next is usually to
Expand Down
32 changes: 32 additions & 0 deletions docs/source/en/package_reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ $ hf cache [OPTIONS] COMMAND [ARGS]...
* `ls`: List cached repositories or revisions.
* `prune`: Remove detached revisions from the cache.
* `rm`: Remove cached repositories or revisions.
* `verify`: Verify checksums for a single repo...

### `hf cache ls`

Expand Down Expand Up @@ -210,6 +211,37 @@ $ hf cache rm [OPTIONS] TARGETS...
* `--dry-run / --no-dry-run`: Preview deletions without removing anything. [default: no-dry-run]
* `--help`: Show this message and exit.

### `hf cache verify`

Verify checksums for a single repo revision from cache or a local directory.

Examples:
- Verify main revision in cache: `hf cache verify gpt2`
- Verify specific revision: `hf cache verify gpt2 --revision refs/pr/1`
- Verify dataset: `hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset`
- Verify local dir: `hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo`

**Usage**:

```console
$ hf cache verify [OPTIONS] REPO_ID
```

**Arguments**:

* `REPO_ID`: The ID of the repo (e.g. `username/repo-name`). [required]

**Options**:

* `--repo-type [model|dataset|space]`: The type of repository (model, dataset, or space). [default: model]
* `--revision TEXT`: Git revision id which can be a branch name, a tag, or a commit hash.
* `--cache-dir TEXT`: Cache directory to use when verifying files from cache (defaults to Hugging Face cache).
* `--local-dir TEXT`: If set, verify files under this directory instead of the cache.
* `--fail-on-missing-files`: Fail if some files exist on the remote but are missing locally.
* `--fail-on-extra-files`: Fail if some files exist locally but are not present on the remote revision.
* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens.
* `--help`: Show this message and exit.

## `hf download`

Download files from the Hub.
Expand Down
3 changes: 3 additions & 0 deletions src/huggingface_hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,7 @@
"upload_file",
"upload_folder",
"upload_large_folder",
"verify_repo_checksums",
"whoami",
],
"hf_file_system": [
Expand Down Expand Up @@ -968,6 +969,7 @@
"upload_file",
"upload_folder",
"upload_large_folder",
"verify_repo_checksums",
"webhook_endpoint",
"whoami",
]
Expand Down Expand Up @@ -1302,6 +1304,7 @@ def __dir__():
upload_file, # noqa: F401
upload_folder, # noqa: F401
upload_large_folder, # noqa: F401
verify_repo_checksums, # noqa: F401
whoami, # noqa: F401
)
from .hf_file_system import (
Expand Down
105 changes: 104 additions & 1 deletion src/huggingface_hub/cli/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
tabulate,
)
from ..utils._parsing import parse_duration, parse_size
from ._cli_utils import typer_factory
from ._cli_utils import RepoIdArg, RepoTypeOpt, RevisionOpt, TokenOpt, get_hf_api, typer_factory


cache_cli = typer_factory(help="Manage local cache directory.")
Expand Down Expand Up @@ -634,3 +634,106 @@ def prune(

strategy.execute()
print(f"Deleted {counts.total_revision_count} unreferenced revision(s); freed {strategy.expected_freed_size_str}.")


@cache_cli.command()
def verify(
repo_id: RepoIdArg,
repo_type: RepoTypeOpt = RepoTypeOpt.model,
revision: RevisionOpt = None,
cache_dir: Annotated[
Optional[str],
typer.Option(
help="Cache directory to use when verifying files from cache (defaults to Hugging Face cache).",
),
] = None,
local_dir: Annotated[
Optional[str],
typer.Option(
help="If set, verify files under this directory instead of the cache.",
),
] = None,
fail_on_missing_files: Annotated[
bool,
typer.Option(
"--fail-on-missing-files",
help="Fail if some files exist on the remote but are missing locally.",
),
] = False,
fail_on_extra_files: Annotated[
bool,
typer.Option(
"--fail-on-extra-files",
help="Fail if some files exist locally but are not present on the remote revision.",
),
] = False,
token: TokenOpt = None,
) -> None:
"""Verify checksums for a single repo revision from cache or a local directory.

Examples:
- Verify main revision in cache: `hf cache verify gpt2`
- Verify specific revision: `hf cache verify gpt2 --revision refs/pr/1`
- Verify dataset: `hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset`
- Verify local dir: `hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo`
"""

if local_dir is not None and cache_dir is not None:
print("Cannot pass both --local-dir and --cache-dir. Use one or the other.")
raise typer.Exit(code=2)

api = get_hf_api(token=token)

result = api.verify_repo_checksums(
repo_id=repo_id,
repo_type=repo_type.value if hasattr(repo_type, "value") else str(repo_type),
revision=revision,
local_dir=local_dir,
cache_dir=cache_dir,
token=token,
)

exit_code = 0

has_mismatches = bool(result.mismatches)
if has_mismatches:
print("❌ Checksum verification failed for the following file(s):")
for m in result.mismatches:
print(f" - {m['path']}: expected {m['expected']} ({m['algorithm']}), got {m['actual']}")
exit_code = 1

if result.missing_paths:
if fail_on_missing_files:
print("Missing files (present remotely, absent locally):")
for p in result.missing_paths:
print(f" - {p}")
exit_code = 1
else:
warning = (
f"{len(result.missing_paths)} remote file(s) are missing locally. "
"Use --fail-on-missing-files for details."
)
print(f"⚠️ {warning}")

if result.extra_paths:
if fail_on_extra_files:
print("Extra files (present locally, absent remotely):")
for p in result.extra_paths:
print(f" - {p}")
exit_code = 1
else:
warning = (
f"{len(result.extra_paths)} local file(s) do not exist on the remote repo. "
"Use --fail-on-extra-files for details."
)
print(f"⚠️ {warning}")

verified_location = result.verified_path

if exit_code != 0:
print(f"❌ Verification failed for '{repo_id}' ({repo_type.value}) in {verified_location}.")
print(f" Revision: {result.revision}")
raise typer.Exit(code=exit_code)

print(f"✅ Verified {result.checked_count} file(s) for '{repo_id}' ({repo_type.value}) in {verified_location}")
print(" All checksums match.")
78 changes: 77 additions & 1 deletion src/huggingface_hub/hf_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,13 @@
from .utils._auth import _get_token_from_environment, _get_token_from_file, _get_token_from_google_colab
from .utils._deprecation import _deprecate_arguments
from .utils._typing import CallableT
from .utils._verification import collect_local_files, resolve_local_root, verify_maps
from .utils.endpoint_helpers import _is_emission_within_threshold


if TYPE_CHECKING:
from .inference._providers import PROVIDER_T
from .utils._verification import FolderVerification

R = TypeVar("R") # Return type
CollectionItemType_T = Literal["model", "dataset", "space", "paper", "collection"]
Expand Down Expand Up @@ -596,7 +598,7 @@ class RepoFile:
The file's size, in bytes.
blob_id (`str`):
The file's git OID.
lfs (`BlobLfsInfo`):
lfs (`BlobLfsInfo`, *optional*):
The file's LFS metadata.
last_commit (`LastCommitInfo`, *optional*):
The file's last commit metadata. Only defined if [`list_repo_tree`] and [`get_paths_info`]
Expand Down Expand Up @@ -3080,6 +3082,79 @@ def list_repo_tree(
for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
yield (RepoFile(**path_info) if path_info["type"] == "file" else RepoFolder(**path_info))

@validate_hf_hub_args
def verify_repo_checksums(
self,
repo_id: str,
*,
repo_type: Optional[str] = None,
revision: Optional[str] = None,
local_dir: Optional[Union[str, Path]] = None,
cache_dir: Optional[Union[str, Path]] = None,
token: Union[str, bool, None] = None,
) -> "FolderVerification":
"""
Verify local files for a repo against Hub checksums.

Args:
repo_id (`str`):
A namespace (user or an organization) and a repo name separated by a `/`.
repo_type (`str`, *optional*):
The type of the repository from which to get the tree (`"model"`, `"dataset"` or `"space"`.
Defaults to `"model"`.
revision (`str`, *optional*):
The revision of the repository from which to get the tree. Defaults to `"main"` branch.
local_dir (`str` or `Path`, *optional*):
The local directory to verify.
cache_dir (`str` or `Path`, *optional*):
The cache directory to verify.
token (Union[bool, str, None], optional):
A valid user access token (string). Defaults to the locally saved
token, which is the recommended method for authentication (see
https://huggingface.co/docs/huggingface_hub/quick-start#authentication).
To disable authentication, pass `False`.

Returns:
[`FolderVerification`]: a structured result containing the verification details.

Raises:
[`~utils.RepositoryNotFoundError`]:
If repository is not found (error 404): wrong repo_id/repo_type, private but not authenticated or repo
does not exist.
[`~utils.RevisionNotFoundError`]:
If revision is not found (error 404) on the repo.

"""

if repo_type is None:
repo_type = constants.REPO_TYPE_MODEL

if local_dir is not None and cache_dir is not None:
raise ValueError("Pass either `local_dir` or `cache_dir`, not both.")

root, remote_revision = resolve_local_root(
repo_id=repo_id,
repo_type=repo_type,
revision=revision,
cache_dir=Path(cache_dir) if cache_dir is not None else None,
local_dir=Path(local_dir) if local_dir is not None else None,
)
local_by_path = collect_local_files(root)

# get remote entries
remote_by_path: dict[str, Union[RepoFile, RepoFolder]] = {}
for entry in self.list_repo_tree(
repo_id=repo_id, recursive=True, revision=remote_revision, repo_type=repo_type, token=token
):
remote_by_path[entry.path] = entry

return verify_maps(
remote_by_path=remote_by_path,
local_by_path=local_by_path,
revision=remote_revision,
verified_path=root,
)

@validate_hf_hub_args
def list_repo_refs(
self,
Expand Down Expand Up @@ -10733,6 +10808,7 @@ def _parse_revision_from_pr_url(pr_url: str) -> str:
list_repo_commits = api.list_repo_commits
list_repo_tree = api.list_repo_tree
get_paths_info = api.get_paths_info
verify_repo_checksums = api.verify_repo_checksums

get_model_tags = api.get_model_tags
get_dataset_tags = api.get_dataset_tags
Expand Down
Loading
Loading