Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Oct 29, 2025

The "fork" start method in multiprocessing doesn't work well with the instances cache.

Indeed contrary to "spawn" which pickles the instances and repopulates the cache, "fork" doesn't repopulate the instances. However "fork" does keep the old cache but it's unable to reuse the old instances because the fs_token used to identify instances changes in subprocesses.

I fixed that by making fs_token independent from the process id. This implied improving a fsspec metaclass 😬

Finally I improved the multithreading case: it can now reuse the cache from the main thread in a new instance.

Minor: I needed to make HfHubHTTPError picklable / deepcopyable.

TODO:

  • tests

related to #3443

This improvement will avoid calling the API again in DataLoader workers when using "fork" and reuse the data files list from the parent process (see https://huggingface.co/blog/streaming-datasets)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq changed the title [HfFileSystem] improve cache for multiprocessing fork [HfFileSystem] improve cache for multiprocessing fork and multithreading Oct 29, 2025
@lhoestq lhoestq marked this pull request as ready for review October 30, 2025 13:03
@lhoestq lhoestq force-pushed the improve-hffs-cache-for-mp-fork branch from 177b790 to 82a58ee Compare October 30, 2025 15:01
@lhoestq lhoestq requested a review from Wauplin October 30, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants