Releases · huggingface/datasets

11 Aug 09:20

lhoestq

0.4.0

21e8091

0.4.0

Datasets Features

add from_pandas and from_dict
add shard method
add rename/remove/cast columns methods
faster select method
add concatenate datasets
add support for taking samples using numpy arrays
add export to TFRecords
add features parameter when loading from text/json/pandas/csv or when using the map transform
add support for nested features for json
add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
add indexing using FAISS or ElasticSearch:
- add add_faiss_index and add_elasticsearch_index methods
- add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
- add search and search_batch to query the index and return examples ids
- add save_faiss_index/load_faiss_index to save/load a serialized faiss index

Datasets changes

new: PG19
new: ANLI
new: WikiSQL
new: qa_zre
new: MWSC
new: AG news
new: SQuADShifts
new: doc red
new: Wiki DPR
new: fever
new: hyperpartisan news detection
new: pandas
new: text
new: emotion
new: quora
new: BioMRC
new: web questions
new: search QA
new: LinCE
new: TREC
new: Style Change Detection
new: 20newsgroup
new: social biais frames
new: Emo
new: web of science
new: sogou news
new: crd3
update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
update: xtreme - add PAWS-X.es
update: xsum - manual download is no longer required.
new processed: Natural Questions

Metrics Features

add seed parameter for metrics that does sampling like rouge
better installation messages

Metrics changes

new: bleurt
update seqeval: fix entities extraction (more info here)

Bug fixes

fix bug in map and select that was causing memory issues
fix pyarrow version check
fix text/json/pandas/csv caching when loading different files in a row
fix metrics caching when they have with different config names
fix cache that was nto discarded when there's a KeybordInterrupt during .map
fix sacrebleu tokenizer's parameter
fix docstrings of metrics when multiple instances are created

More Tests

add tests for features handling in dataset transforms
add tests for dataset builders
add tests for metrics loading

Backward compatibility

because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json

Assets 2

19 Jun 09:36

lhoestq

0.3.0

99e0ee6

0.3.0

New methods to transform a dataset:

dataset.shuffle: create a shuffled dataset
dataset.train_test_split: create a train and a test split (similar to sklearn)
dataset.sort: create a dataset sorted according to a certain column
dataset.select: create a dataset with rows selected following the given list of indices

Other features:

Better instructions for datasets that require manual download

Important: if you load datasets that require manual downloads with an older version of nlp, instructions won't be shown and an error will be raised
Better access to dataset information (for instance dataset.feature['label'] or dataset.dataset_size)

Datasets:

New: cos_e v1.0
New: rotten_tomatoes
New: german and italian wikipedia

New docs:

documentation about splitting a dataset

Bug fixes:

fix metric.compute that couldn't write on file
fix squad_v2 imports

Assets 2

12 Jun 16:27

lhoestq

0.2.1

8d7db6c

0.2.1

New datasets:

ELI5
CompGuessWhat?!
BookCorpus
Piaf
Allociné
BlendedSkillTalk

New features:

.filter method
option to do batching for metrics
make datasets deterministic

New commands:

nlp-cli upload_dataset
nlp-cli upload_metric
nlp-cli s3_datasets {ls,rm}
nlp-cli s3_metrics {ls,rm}

Assets 2

29 May 15:43

lhoestq

0.2.0

8956ccc

New datasets + Apache Beam, new metrics, bug fixes

Datasets changes

New: germeval14
New: wmt
New: Ubuntu dialog corpus
New: squad spanish
New: Quanta
New: arcd
New: Natural Questions (needs to be processed using a beam pipeline)
New: C4 (needs to be processed using a beam pipeline)
Skip the processing: wikipedia (english and french version are now already processed)
Skip the processing: wiki40b (english version is now already processed)
Renamed: anli -> art
Better instructions: xsum
Add .filter() for arrow datasets
Add instruction message for manual data when required

Metrics changes:

New: BERTScore
Allow to add examples by element or by batch to compute a metric score

Commands:

New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud

Bug fixes:

Now .map return the right values when run on different splits of the same dataset
Fix input of the squad metric format to fit the format of the squad dataset
Fix download from google drive for small files
For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing

More tests

Local tests of dataset processing scripts
AWS tests of dataset processing scripts
Tests for arrow dataset methods
Tests for arrow reader methods

Assets 2

15 May 11:48

thomwolf

0.1.0

57e1f3f

First release

First release of the nlp library.

Read the README.md for an introduction: https://github.com/huggingface/nlp/blob/master/README.md

Tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb

This is a beta release and the API is not expected to be stabilized yet (in particular the API for the metrics).

Documentation and tests are also still sparse.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets Features

Datasets changes

Metrics Features

Metrics changes

Bug fixes

More Tests

Backward compatibility

Datasets changes

Metrics changes:

Commands:

Bug fixes:

More tests

Releases: huggingface/datasets

0.4.0

Datasets Features

Datasets changes

Metrics Features

Metrics changes

Bug fixes

More Tests

Backward compatibility

0.3.0

0.2.1

New datasets + Apache Beam, new metrics, bug fixes

Datasets changes

Metrics changes:

Commands:

Bug fixes:

More tests

First release