Skip to content

3.3.0

Latest
Compare
Choose a tag to compare
@lhoestq lhoestq released this 14 Feb 10:15
· 1 commit to main since this release
e9dae36

Dataset Features

  • Support async functions in map() by @lhoestq in #7384

    • Especially useful to download content like images or call inference APIs
    prompt = "Answer the following question: {question}. You should think step by step."
    async def ask_llm(example):
        return await query_model(prompt.format(question=example["question"]))
    ds = ds.map(ask_llm)
  • Add repeat method to datasets by @alex-hh in #7198

    ds = ds.repeat(10)
  • Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in #7370

    • Add support for "pandas" and "polars" formats in IterableDatasets
    • This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
    ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
    ds = ds.with_format("polars")
    expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
    ds = ds.map(lambda df: df.with_columns(expr), batched=True)
  • Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207

    • IterableDatasets with "numpy" format are now much faster

What's Changed

New Contributors

Full Changelog: 3.2.0...3.3.0