Skip to content

better correspondence between cached and saved datasets created using from_generator #7420

@vttrifonov

Description

@vttrifonov

Feature request

At the moment .from_generator can only create a dataset that lives in the cache. The cached dataset cannot be loaded with load_from_disk because the cache folder is missing state.json. So the only way to convert this cached dataset to a regular is to use save_to_disk which needs to create a copy of the cached dataset. For large datasets this can end up wasting a lot of space. In my case the saving operation failed so I am stuck with a large cached dataset and no clear way to convert to a Dataset that I can use. The requested feature is to provide a way to be able to load a cached dataset using .load_from_disk. Alternatively .from_generator can create the dataset at a specified location so that it can be loaded from there with .load_from_disk.

Motivation

I have the following workflow which has exposed some awkwardness about the Datasets saving/caching.

  1. I created a cached dataset using .from_generator which was cached in a folder. This dataset is rather large (~600GB) with many shards.
  2. I tried to save this dataset using .save_to_disk to another location so that I can use later as a Dataset. This essentially creates another copy (for a total of 1.2TB!) of what is already in the cache... In my case the saving operation keeps dying for some reason and I am stuck with a cached dataset and no copy.
  3. Now I am trying to "save" the existing cached dataset but it is not clear how to access the cached files after .from_generator has finished e.g. from a different process. I should not be even looking at the cache but I really do not want to waste another 2hr to generate the set so that if fails agains (I already did this couple of times).
  • I tried .load_from_disk but it does not work with cached files and complains that this is not a Dataset (!).
  • I looked at .from_file which takes one file but the cached file has many (shards) so I am not sure how to make this work.
  • I tried .load_dataset but this seems to either try to "download" a copy (of a file which is already in the local file system!) which I will then need to save or I need to use streaming=False to create an IterableDataset which then I need to convert (using the cache) to Dataset so that I can save it. With both options I will end up with 3 copies of the same dataset for a total of ~2TB! I am hoping here is another way to do this...

Maybe I am missing something here: I looked at docs and forums but no luck. I have a bunch of arrow files cached by Dataset.from_generator and no clean way to make them into a Dataset that I can use.

This all could be so much easer if load_from_disk can recognize the cached files and produce a Dataset: after the cache is created I would not have to "save" it again and I can just load it when I need. At the moment load_from_disk needs state.json which is lacking in the cache folder. So perhaps .from_generator could be made to "finalize" (e.g. create state.json) the dataset once it is done so that it can be loaded easily. Or provide .from_generator with a save_to_dir parameter in addition to cache_dir which can be used for the whole process including creating the state.json at the end.

As a proof of concept I just created state.json by hand and load_from_disk worked using the cache! So it seems to be the missing piece here.

Your contribution

Time permitting I can look into .from_generator to see if adding state.json is feasible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions