better correspondence between cached and saved datasets created using from_generator

### Feature request

At the moment `.from_generator` can only create a dataset that lives in the cache. The cached dataset cannot be loaded with `load_from_disk` because the cache folder is missing `state.json`. So the only way to convert this cached dataset to a regular is to use `save_to_disk` which needs to create a copy of the cached dataset. For large datasets this can end up wasting a lot of space. In my case the saving operation failed so I am stuck with a large cached dataset and no clear way to convert to a `Dataset` that I can use. The requested feature is to provide a way to be able to load a cached dataset using `.load_from_disk`. Alternatively `.from_generator` can create the dataset at a specified location so that it can be loaded from there with `.load_from_disk`.

### Motivation

I have the following workflow which has exposed some awkwardness about the Datasets saving/caching.

1. I created a cached dataset using `.from_generator` which was cached in a folder. This dataset is rather large (~600GB) with many shards.
2. I tried to save this dataset using `.save_to_disk` to another location so that I can use later as a `Dataset`. This essentially creates another copy (for a total of 1.2TB!) of what is already in the cache... In my case the saving operation keeps dying for some reason and I am stuck with a cached dataset and no copy.
3. Now I am trying to "save" the existing cached dataset but it is not clear how to access the cached files after `.from_generator` has finished e.g. from a different process. I should not be even looking at the cache but I really do not want to waste another 2hr to generate the set so that if fails agains (I already did this couple of times). 
- I tried `.load_from_disk` but it does not work with cached files and complains that this is not a `Dataset` (!).
- I looked at `.from_file` which takes one file but the cached file has many (shards) so I am not sure how to make this work. 
- I tried `.load_dataset` but this seems to either try to "download" a copy (of a file which is already in the local file system!) which I will then need to save or I need to use `streaming=False` to create an `IterableDataset `which then I need to convert (using the cache) to `Dataset` so that I can save it. With both options I  will end up with 3 copies of the same dataset for a total of ~2TB! I am hoping here is another way to do this...

Maybe I am missing something here: I looked at docs and forums but no luck. I have a bunch of arrow files cached by `Dataset.from_generator` and no clean way to make them into a `Dataset` that I can use. 

This all could be so much easer if `load_from_disk` can recognize the cached files and produce a `Dataset`: after the cache is created I would not have to "save" it again and I can just load it when I need.  At the moment `load_from_disk` needs `state.json` which is lacking in the cache folder. So perhaps `.from_generator` could be made to "finalize" (e.g. create `state.json`) the dataset once it is done so that it can be loaded easily. Or provide `.from_generator` with a `save_to_dir` parameter in addition to `cache_dir` which can be used for the whole process including creating the `state.json` at the end. 

As a proof of concept I just created `state.json` by hand and `load_from_disk` worked using the cache! So it seems to be the missing piece here.

### Your contribution

Time permitting I can look into `.from_generator` to see if adding  `state.json` is feasible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

better correspondence between cached and saved datasets created using from_generator #7420

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

better correspondence between cached and saved datasets created using from_generator #7420

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions