Dataset Features
-
Support async functions in map() by @lhoestq in #7384
- Especially useful to download content like images or call inference APIs
prompt = "Answer the following question: {question}. You should think step by step." async def ask_llm(example): return await query_model(prompt.format(question=example["question"])) ds = ds.map(ask_llm)
-
Add repeat method to datasets by @alex-hh in #7198
ds = ds.repeat(10)
-
Support faster processing using pandas or polars functions in
IterableDataset.map()
by @lhoestq in #7370- Add support for "pandas" and "polars" formats in IterableDatasets
- This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True) ds = ds.with_format("polars") expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution") ds = ds.map(lambda df: df.with_columns(expr), batched=True)
-
Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207
- IterableDatasets with "numpy" format are now much faster
What's Changed
- don't import soundfile in tests by @lhoestq in #7340
- minor video docs on how to install by @lhoestq in #7341
- Fix typo in arrow_dataset by @AndreaFrancis in #7328
- remove filecheck to enable symlinks by @fschlatt in #7133
- Webdataset special columns in last position by @lhoestq in #7349
- Bump hfh to 0.24 to fix ci by @lhoestq in #7350
- fsspec 2024.12.0 by @lhoestq in #7352
- changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in #7353
- Catch OSError for arrow by @lhoestq in #7348
- Remove .h5 from imagefolder extensions by @lhoestq in #7374
- Add Pandas, PyArrow and Polars docs by @lhoestq in #7382
- Optimized sequence encoding for scalars by @lukasgd in #7393
- Update docs by @lhoestq in #7395
- Update README.md by @lhoestq in #7396
- Release: 3.3.0 by @lhoestq in #7398
New Contributors
- @AndreaFrancis made their first contribution in #7328
- @vttrifonov made their first contribution in #7353
- @lukasgd made their first contribution in #7393
Full Changelog: 3.2.0...3.3.0