[HfFileSystem] improve cache for multiprocessing fork and multithreading #3500

lhoestq · 2025-10-29T15:41:30Z

The "fork" start method in multiprocessing doesn't work well with the instances cache.

Indeed contrary to "spawn" which pickles the instances and repopulates the cache, "fork" doesn't repopulate the instances. However "fork" does keep the old cache but it's unable to reuse the old instances because the fs_token used to identify instances changes in subprocesses.

I fixed that by making fs_token independent from the process id. This implied improving a fsspec metaclass 😬

Finally I improved the multithreading case: it can now reuse the cache from the main thread in a new instance.

Minor: I needed to make HfHubHTTPError picklable / deepcopyable.

TODO:

tests

related to #3443

This improvement will avoid calling the API again in DataLoader workers when using "fork" and reuse the data files list from the parent process (see https://huggingface.co/blog/streaming-datasets)

HuggingFaceDocBuilderDev · 2025-10-29T15:47:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/huggingface_hub/hf_file_system.py

lhoestq added 2 commits October 29, 2025 16:37

improve hffs cache for multiprocessing fork

80e4da2

minor

8f20a2d

lhoestq added 4 commits October 29, 2025 16:56

mypy

696f443

fix for token

ad20689

add test

5abbfeb

fix for threading too

ded5d85

lhoestq changed the title ~~[HfFileSystem] improve cache for multiprocessing fork~~ [HfFileSystem] improve cache for multiprocessing fork and multithreading Oct 29, 2025

lhoestq added 6 commits October 29, 2025 18:51

comment

f72ab13

fix CI: make HfHubHTTPError picklable

95c485e

fix tests

539a892

better naming

dffdbbb

clear instance cache before testing to ignore remaning Mock objects

991443e

don't test "fork" on windows

82a58ee

lhoestq marked this pull request as ready for review October 30, 2025 13:03

lhoestq commented Oct 30, 2025

View reviewed changes

src/huggingface_hub/hf_file_system.py Outdated Show resolved Hide resolved

lhoestq force-pushed the improve-hffs-cache-for-mp-fork branch from 177b790 to 82a58ee Compare October 30, 2025 15:01

lhoestq requested a review from Wauplin October 30, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HfFileSystem] improve cache for multiprocessing fork and multithreading #3500

[HfFileSystem] improve cache for multiprocessing fork and multithreading #3500

lhoestq commented Oct 29, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[HfFileSystem] improve cache for multiprocessing fork and multithreading #3500

Are you sure you want to change the base?

[HfFileSystem] improve cache for multiprocessing fork and multithreading #3500

Conversation

lhoestq commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lhoestq commented Oct 29, 2025 •

edited

Loading