Skip to content

Releases: huggingface/datasets

0.4.0

11 Aug 09:20
Compare
Choose a tag to compare

Datasets Features

  • add from_pandas and from_dict
  • add shard method
  • add rename/remove/cast columns methods
  • faster select method
  • add concatenate datasets
  • add support for taking samples using numpy arrays
  • add export to TFRecords
  • add features parameter when loading from text/json/pandas/csv or when using the map transform
  • add support for nested features for json
  • add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
  • add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
  • add indexing using FAISS or ElasticSearch:
    • add add_faiss_index and add_elasticsearch_index methods
    • add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
    • add search and search_batch to query the index and return examples ids
    • add save_faiss_index/load_faiss_index to save/load a serialized faiss index

Datasets changes

  • new: PG19
  • new: ANLI
  • new: WikiSQL
  • new: qa_zre
  • new: MWSC
  • new: AG news
  • new: SQuADShifts
  • new: doc red
  • new: Wiki DPR
  • new: fever
  • new: hyperpartisan news detection
  • new: pandas
  • new: text
  • new: emotion
  • new: quora
  • new: BioMRC
  • new: web questions
  • new: search QA
  • new: LinCE
  • new: TREC
  • new: Style Change Detection
  • new: 20newsgroup
  • new: social biais frames
  • new: Emo
  • new: web of science
  • new: sogou news
  • new: crd3
  • update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
  • update: xtreme - add PAWS-X.es
  • update: xsum - manual download is no longer required.
  • new processed: Natural Questions

Metrics Features

  • add seed parameter for metrics that does sampling like rouge
  • better installation messages

Metrics changes

  • new: bleurt
  • update seqeval: fix entities extraction (more info here)

Bug fixes

  • fix bug in map and select that was causing memory issues
  • fix pyarrow version check
  • fix text/json/pandas/csv caching when loading different files in a row
  • fix metrics caching when they have with different config names
  • fix cache that was nto discarded when there's a KeybordInterrupt during .map
  • fix sacrebleu tokenizer's parameter
  • fix docstrings of metrics when multiple instances are created

More Tests

  • add tests for features handling in dataset transforms
  • add tests for dataset builders
  • add tests for metrics loading

Backward compatibility

  • because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json

0.3.0

19 Jun 09:36
Compare
Choose a tag to compare

New methods to transform a dataset:

  • dataset.shuffle: create a shuffled dataset
  • dataset.train_test_split: create a train and a test split (similar to sklearn)
  • dataset.sort: create a dataset sorted according to a certain column
  • dataset.select: create a dataset with rows selected following the given list of indices

Other features:

  • Better instructions for datasets that require manual download

    Important: if you load datasets that require manual downloads with an older version of nlp, instructions won't be shown and an error will be raised

  • Better access to dataset information (for instance dataset.feature['label'] or dataset.dataset_size)

Datasets:

  • New: cos_e v1.0
  • New: rotten_tomatoes
  • New: german and italian wikipedia

New docs:

  • documentation about splitting a dataset

Bug fixes:

  • fix metric.compute that couldn't write on file
  • fix squad_v2 imports

0.2.1

12 Jun 16:27
Compare
Choose a tag to compare

New datasets:

  • ELI5
  • CompGuessWhat?!
  • BookCorpus
  • Piaf
  • Allociné
  • BlendedSkillTalk

New features:

  • .filter method
  • option to do batching for metrics
  • make datasets deterministic

New commands:

  • nlp-cli upload_dataset
  • nlp-cli upload_metric
  • nlp-cli s3_datasets {ls,rm}
  • nlp-cli s3_metrics {ls,rm}

New datasets + Apache Beam, new metrics, bug fixes

29 May 15:43
Compare
Choose a tag to compare

Datasets changes

  • New: germeval14
  • New: wmt
  • New: Ubuntu dialog corpus
  • New: squad spanish
  • New: Quanta
  • New: arcd
  • New: Natural Questions (needs to be processed using a beam pipeline)
  • New: C4 (needs to be processed using a beam pipeline)
  • Skip the processing: wikipedia (english and french version are now already processed)
  • Skip the processing: wiki40b (english version is now already processed)
  • Renamed: anli -> art
  • Better instructions: xsum
  • Add .filter() for arrow datasets
  • Add instruction message for manual data when required

Metrics changes:

  • New: BERTScore
  • Allow to add examples by element or by batch to compute a metric score

Commands:

  • New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
  • New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud

Bug fixes:

  • Now .map return the right values when run on different splits of the same dataset
  • Fix input of the squad metric format to fit the format of the squad dataset
  • Fix download from google drive for small files
  • For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing

More tests

  • Local tests of dataset processing scripts
  • AWS tests of dataset processing scripts
  • Tests for arrow dataset methods
  • Tests for arrow reader methods

First release

15 May 11:48
Compare
Choose a tag to compare

First release of the nlp library.

Read the README.md for an introduction: https://github.com/huggingface/nlp/blob/master/README.md

Tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb

This is a beta release and the API is not expected to be stabilized yet (in particular the API for the metrics).

Documentation and tests are also still sparse.