Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing data_dir to load_dataset for HuggingfaceDatasetLoader to enable more datasets from HF #2811

Open
Pablo1785 opened this issue Feb 14, 2025 · 1 comment
Labels
dataset Related to `burn-dataset` enhancement Enhance existing features

Comments

@Pablo1785
Copy link
Contributor

Pablo1785 commented Feb 14, 2025

Feature description

Add .with_huggingface_data_dir() method on the HuggingfaceDataLoader to allow modifying the data_dir for some datasets.

Feature motivation

When trying to download "facebook/covost2" using HuggingfaceDataLoader I received this error message:

raise ManualDownloadError(
datasets.exceptions.ManualDownloadError: The dataset covost2 with config fr_en requires manual data.
Please follow the manual download instructions:
 Please download the Common Voice Corpus 4 in fr from https://commonvoice.mozilla.org/en/datasets and unpack it with `tar xvzf fr.tar`. Make sure to pass the path to the directory in which you unpacked the downloaded file as `data_dir`: `datasets.load_dataset('covost2', data_dir="path/to/dir")`

Manual data can be loaded with:
 datasets.load_dataset("facebook/covost2", data_dir="<path/to/manual/data>")

Unfortunately it seems there is no way to pass in the data_dir to load_dataset right now. I can imagine there is more datasets that will require a similar manual step.

(Optional) Suggest a Solution

.with_huggingface_data_dir() method on the HuggingfaceDataLoader + the python code analogous to the cache_dir.

EDIT: After some more digging, manual download seems to be a thing in a bunch of HF datasets, so this could cover a lot of ground.

@laggui laggui added enhancement Enhance existing features dataset Related to `burn-dataset` labels Feb 14, 2025
@laggui
Copy link
Member

laggui commented Feb 14, 2025

I'm not super familiar with all mechanisms for HF datasets, but the suggestion sounds reasonable. Adds a bit more flexibility 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Related to `burn-dataset` enhancement Enhance existing features
Projects
None yet
Development

No branches or pull requests

2 participants