[HfFileSystem] improve cache for multiprocessing fork and multithreading #3500
+131
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The "fork" start method in multiprocessing doesn't work well with the instances cache.
Indeed contrary to "spawn" which pickles the instances and repopulates the cache, "fork" doesn't repopulate the instances. However "fork" does keep the old cache but it's unable to reuse the old instances because the
fs_tokenused to identify instances changes in subprocesses.I fixed that by making
fs_tokenindependent from the process id. This implied improving a fsspec metaclass 😬Finally I improved the multithreading case: it can now reuse the cache from the main thread in a new instance.
Minor: I needed to make HfHubHTTPError picklable / deepcopyable.
TODO:
related to #3443
This improvement will avoid calling the API again in DataLoader workers when using "fork" and reuse the data files list from the parent process (see https://huggingface.co/blog/streaming-datasets)