-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Feature request
At the moment .from_generator
can only create a dataset that lives in the cache. The cached dataset cannot be loaded with load_from_disk
because the cache folder is missing state.json
. So the only way to convert this cached dataset to a regular is to use save_to_disk
which needs to create a copy of the cached dataset. For large datasets this can end up wasting a lot of space. In my case the saving operation failed so I am stuck with a large cached dataset and no clear way to convert to a Dataset
that I can use. The requested feature is to provide a way to be able to load a cached dataset using .load_from_disk
. Alternatively .from_generator
can create the dataset at a specified location so that it can be loaded from there with .load_from_disk
.
Motivation
I have the following workflow which has exposed some awkwardness about the Datasets saving/caching.
- I created a cached dataset using
.from_generator
which was cached in a folder. This dataset is rather large (~600GB) with many shards. - I tried to save this dataset using
.save_to_disk
to another location so that I can use later as aDataset
. This essentially creates another copy (for a total of 1.2TB!) of what is already in the cache... In my case the saving operation keeps dying for some reason and I am stuck with a cached dataset and no copy. - Now I am trying to "save" the existing cached dataset but it is not clear how to access the cached files after
.from_generator
has finished e.g. from a different process. I should not be even looking at the cache but I really do not want to waste another 2hr to generate the set so that if fails agains (I already did this couple of times).
- I tried
.load_from_disk
but it does not work with cached files and complains that this is not aDataset
(!). - I looked at
.from_file
which takes one file but the cached file has many (shards) so I am not sure how to make this work. - I tried
.load_dataset
but this seems to either try to "download" a copy (of a file which is already in the local file system!) which I will then need to save or I need to usestreaming=False
to create anIterableDataset
which then I need to convert (using the cache) toDataset
so that I can save it. With both options I will end up with 3 copies of the same dataset for a total of ~2TB! I am hoping here is another way to do this...
Maybe I am missing something here: I looked at docs and forums but no luck. I have a bunch of arrow files cached by Dataset.from_generator
and no clean way to make them into a Dataset
that I can use.
This all could be so much easer if load_from_disk
can recognize the cached files and produce a Dataset
: after the cache is created I would not have to "save" it again and I can just load it when I need. At the moment load_from_disk
needs state.json
which is lacking in the cache folder. So perhaps .from_generator
could be made to "finalize" (e.g. create state.json
) the dataset once it is done so that it can be loaded easily. Or provide .from_generator
with a save_to_dir
parameter in addition to cache_dir
which can be used for the whole process including creating the state.json
at the end.
As a proof of concept I just created state.json
by hand and load_from_disk
worked using the cache! So it seems to be the missing piece here.
Your contribution
Time permitting I can look into .from_generator
to see if adding state.json
is feasible.