Releases: huggingface/datasets
Releases · huggingface/datasets
0.4.0
Datasets Features
- add from_pandas and from_dict
- add shard method
- add rename/remove/cast columns methods
- faster select method
- add concatenate datasets
- add support for taking samples using numpy arrays
- add export to TFRecords
- add features parameter when loading from text/json/pandas/csv or when using the map transform
- add support for nested features for json
- add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
- add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
- add indexing using FAISS or ElasticSearch:
- add add_faiss_index and add_elasticsearch_index methods
- add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
- add search and search_batch to query the index and return examples ids
- add save_faiss_index/load_faiss_index to save/load a serialized faiss index
Datasets changes
- new: PG19
- new: ANLI
- new: WikiSQL
- new: qa_zre
- new: MWSC
- new: AG news
- new: SQuADShifts
- new: doc red
- new: Wiki DPR
- new: fever
- new: hyperpartisan news detection
- new: pandas
- new: text
- new: emotion
- new: quora
- new: BioMRC
- new: web questions
- new: search QA
- new: LinCE
- new: TREC
- new: Style Change Detection
- new: 20newsgroup
- new: social biais frames
- new: Emo
- new: web of science
- new: sogou news
- new: crd3
- update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
- update: xtreme - add PAWS-X.es
- update: xsum - manual download is no longer required.
- new processed: Natural Questions
Metrics Features
- add seed parameter for metrics that does sampling like rouge
- better installation messages
Metrics changes
- new: bleurt
- update seqeval: fix entities extraction (more info here)
Bug fixes
- fix bug in map and select that was causing memory issues
- fix pyarrow version check
- fix text/json/pandas/csv caching when loading different files in a row
- fix metrics caching when they have with different config names
- fix cache that was nto discarded when there's a KeybordInterrupt during .map
- fix sacrebleu tokenizer's parameter
- fix docstrings of metrics when multiple instances are created
More Tests
- add tests for features handling in dataset transforms
- add tests for dataset builders
- add tests for metrics loading
Backward compatibility
- because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json
0.3.0
New methods to transform a dataset:
dataset.shuffle
: create a shuffled datasetdataset.train_test_split
: create a train and a test split (similar to sklearn)dataset.sort
: create a dataset sorted according to a certain columndataset.select
: create a dataset with rows selected following the given list of indices
Other features:
- Better instructions for datasets that require manual download
Important: if you load datasets that require manual downloads with an older version of
nlp
, instructions won't be shown and an error will be raised - Better access to dataset information (for instance
dataset.feature['label']
ordataset.dataset_size
)
Datasets:
- New: cos_e v1.0
- New: rotten_tomatoes
- New: german and italian wikipedia
New docs:
- documentation about splitting a dataset
Bug fixes:
- fix metric.compute that couldn't write on file
- fix squad_v2 imports
0.2.1
New datasets:
- ELI5
- CompGuessWhat?!
- BookCorpus
- Piaf
- Allociné
- BlendedSkillTalk
New features:
- .filter method
- option to do batching for metrics
- make datasets deterministic
New commands:
- nlp-cli upload_dataset
- nlp-cli upload_metric
- nlp-cli s3_datasets {ls,rm}
- nlp-cli s3_metrics {ls,rm}
New datasets + Apache Beam, new metrics, bug fixes
Datasets changes
- New: germeval14
- New: wmt
- New: Ubuntu dialog corpus
- New: squad spanish
- New: Quanta
- New: arcd
- New: Natural Questions (needs to be processed using a beam pipeline)
- New: C4 (needs to be processed using a beam pipeline)
- Skip the processing: wikipedia (english and french version are now already processed)
- Skip the processing: wiki40b (english version is now already processed)
- Renamed: anli -> art
- Better instructions: xsum
- Add .filter() for arrow datasets
- Add instruction message for manual data when required
Metrics changes:
- New: BERTScore
- Allow to add examples by element or by batch to compute a metric score
Commands:
- New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
- New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud
Bug fixes:
- Now .map return the right values when run on different splits of the same dataset
- Fix input of the squad metric format to fit the format of the squad dataset
- Fix download from google drive for small files
- For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing
More tests
- Local tests of dataset processing scripts
- AWS tests of dataset processing scripts
- Tests for arrow dataset methods
- Tests for arrow reader methods
First release
First release of the nlp
library.
Read the README.md for an introduction: https://github.com/huggingface/nlp/blob/master/README.md
Tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb
This is a beta release and the API is not expected to be stabilized yet (in particular the API for the metrics).
Documentation and tests are also still sparse.