Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterating over values of a column in the IterableDataset #7381

Open
TopCoder2K opened this issue Jan 28, 2025 · 1 comment
Open

Iterating over values of a column in the IterableDataset #7381

TopCoder2K opened this issue Jan 28, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@TopCoder2K
Copy link

TopCoder2K commented Jan 28, 2025

Feature request

I would like to be able to iterate (and re-iterate if needed) over a column of an IterableDataset instance. The following example shows the supposed API:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)
texts = ds["text"]

for v in texts:
    print(v)  # Prints "Good" and "Bad"

for v in texts:
    print(v)  # Prints "Good" and "Bad" again

Motivation

In the real world problems, huge NNs like Transformer are not always the best option, so there is a need to conduct experiments with different methods. While 🤗Datasets is perfectly adapted to 🤗Transformers, it may be inconvenient when being used with other libraries. The ability to retrieve a particular column is the case (e.g., gensim's FastText requires only lists of strings, not dictionaries).
While there are ways to achieve the desired functionality, they are not good (forum). It would be great if there was a built-in solution.

Your contribution

Theoretically, I can submit a PR, but I have very little knowledge of the internal structure of 🤗Datasets, so some help may be needed.
Moreover, I can only work on weekends, since I have a full-time job. However, the feature does not seem to be popular, so there is no need to implement it as fast as possible.

@TopCoder2K TopCoder2K added the enhancement New feature or request label Jan 28, 2025
@lhoestq
Copy link
Member

lhoestq commented Feb 3, 2025

I'd be in favor of that ! I saw many people implementing their own iterables that wrap a dataset just to iterate on a single column, that would make things more practical.

Kinda related: #5847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants