Releases · tensorflow/datasets

27 Feb 11:46

github-actions

v4.8.3

3e96a32

v4.8.3

Added

Changed

Deprecated

Python 3.7 support: this version and future version use Python 3.8.

Removed

Fixed

Flag ignore_verifications from Hugging Face's datasets.load_dataset is
deprecated, and used to cause errors in tfds.load(huggingface:foo).

Security

Assets 2

17 Jan 20:41

github-actions

v4.8.2

21f7b20

v4.8.2

Deprecated

Python 3.7 support: this is the last version of TFDS supporting Python 3.7.
Future versions will use Python 3.8.

Fixed

tfds new and tfds build better support the new recommended datasets
organization, where individual datasets have their own package under
datasets/, builder class is called Builder and is defined within module
${dsname}_dataset_builder.py.

Security

Assets 2

02 Jan 18:30

github-actions

v4.8.1

ce201c6

v4.8.1

Changed

Added file valid_tags.txt to not break builds.
TFDS no longer relies on TensorFlow DTypes. We chose NumPy DTypes to keep the
typing expressiveness, while dropping the heavy dependency on TensorFlow. We
migrated all our internal datasets. Please, migrate accordingly:
- tf.bool: np.bool_
- tf.string: np.str_
- tf.int64, tf.int32, etc: np.int64, np.int32, etc
- tf.float64, tf.float32, etc: np.float64, np.float32, etc

Assets 2

21 Dec 11:09

github-actions

v4.8.0

76e2dd9

v4.8.0

Added

[API] DatasetBuilder's description and citations can be specified in
dedicated README.md and CITATIONS.bib files, within the dataset package
(see https://www.tensorflow.org/datasets/add_dataset).
Tags can be associated to Datasets, in the TAGS.txt file. For
now, they are only used in the generated documentation.
[API][Experimental] New ViewBuilder to define datasets as transformations
of existing datasets. Also adds tfds.transform with functionality to apply
transformations.
Loggers are also called on tfds.as_numpy(...), base Logger class has a
new corresponding method.
tfds.core.DatasetBuilder can have a default limit for the number of
simultaneous downloads. tfds.download.DownloadConfig can override it.
tfds.features.Audio supports storing raw audio data for lazy decoding.
The number of shards can be overridden when preparing a dataset:
builder.download_and_prepare(download_config=tfds.download.DownloadConfig(num_shards=42)).
Alternatively, you can configure the min and max shard size if you want TFDS
to compute the number of shards for you, but want to have control over the
shard sizes.

Changed

Deprecated

Removed

Fixed

Security

Assets 2

05 Oct 10:23

marcenacp

v4.7.0

f00f1e3

v4.7.0

Added

[API] Added TfDataBuilder that is handy for storing experimental ad hoc TFDS datasets in notebook-like environments such that they can be versioned, described, and easily shared with teammates.
[API] Added options to create format-specific dataset builders. The new API now includes a number of NLP-specific builders, such as:
- CoNNL
- CoNNL-U
[API] Added tfds.beam.inc_counter to reduce beam.metrics.Metrics.counter boilerplate
[API] Added options to group together existing TFDS datasets into dataset collections and to perform simple operations over them.
[Documentation] update, specifically:
- New guide on format-specific dataset builders;
- New guide on adding new dataset collections to TFDS;
- Updated TFDS CLI documentation.
[TFDS CLI] Supports custom config through Json (e.g. tfds build my_dataset --config='{"name": "my_custom_config", "description": "Abc"}')
New datasets:
- conll2003
- universal_dependency 2.10
- bucc
- i_naturalist2021
- mtnt Machine Translation of Noisy Text.
- placesfull
- tatoeba
- user_libri_audio
- user_libri_text
- xtreme_pos
- yahoo_ltrc
Updated datasets:
- C4 was updated to version 3.1.
- common_voice was updated to a more recent snapshot.
- wikipedia was updated with the 20220620 snapshot.
New dataset collections, such as xtreme and LongT5

Changed

The base Logger class expects more information to be passed to the as_dataset method. This should only be relevant to people who have implemented and registered custom Logger class(es).
You can set DEFAULT_BUILDER_CONFIG_NAME in a DatasetBuilder to change the default config if it shouldn't be the first builder config defined in BUILDER_CONFIGS.

Deprecated

Removed

Fixed

Various datasets
In Linux, when loading a dataset from a directory that is not your home (~) directory, a new ~ directory is not created in the current directory (fixes #4117).

Security

Assets 2

02 Jun 09:21

pierrot0

v4.6.0

1ef4c06

v4.6.0

Added

Support for community datasets on GCS.
[API] tfds.builder_from_directory and tfds.builder_from_directories, see
https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder.
[API] Dash ("-") support in split names.
[API] file_format argument to download_and_prepare method, allowing user
to specify an alternative file format to store prepared data (e.g. "riegeli").
[API] file_format to DatasetInfo string representation.
[API] Expose the return value of Beam pipelines. This allows for users to
read the Beam metrics.
[API] Expose Feature tf_example_spec to public.
[API] doc kwarg on Features, to describe a feature.
[Documentation] Features description is shown on TFDS Catalog.
[Documentation] More metadata about HuggingFace datasets in TFDS catalog.
[Performance] Parallel load of metadata files.
[Testing] TFDS tests are now run using GitHub actions - misc improvements such
as caching and sharding.
[Testing] Improvements to MockFs.
New datasets.

Changed

[API] num_shards is now optional in the shard name.

Removed

TFDS pathlib API, migrated to a self-contained etils.epath (see
https://github.com/google/etils).

Fixed

Various datasets.
Dataset builders that are defined adhoc (e.g. in Colab).
Better DatasetNotFoundError messages.
Don't set deterministic on a global level but locally in interleave, so it
only apply to interleave and not all transformations.
Google drive downloader.

As always, thank you to all contributors!

Assets 2

31 Jan 15:45

ccl-core

v4.5.2

47baec1

v4.5.2

Release notes:

Fix import bug on Windows (#3709)
Updated documentation

Assets 2

31 Jan 12:10

Conchylicultor

v4.5.1

8297306

v4.5.1

Release notes:

Fix import bug on Windows (#3709)
Add split=tfds.split_for_jax_process('train') (alias of tfds.even_splits('train', n=jax.process_count())[jax.process_index()])

Assets 2

26 Jan 09:44

ccl-core

v4.5.0

cc7f631

v4.5.0

This is the last version of TFDS supporting 3.6. Future version will use 3.7

Better split API:
- Splits can be selected using shards: split='train[3shard]'
- Underscore supported in numbers for better readability: split='train[:500_000]'
- Select the union of all splits with split='all'
- tfds.even_splits is more precise and flexible:
  - Return splits exactly of the same size when passed tfds.even_splits('train', n=3, drop_remainder=True)
  - Works on subsplits tfds.even_splits('train[:75%]', n=3) or even nested
  - Can be composed with other splits: tfds.even_splits('train', n=3)[0] + 'test'
FeatureConnectors:
- Faster dataset generation (using tfrecords)
- Features now have serialize_example / deserialize_example methods to encode/decode example to proto: example_bytes = features.serialize_example(example_data)
- Audio now supports encoding='zlib' for better compression
- Features specs exposed in proto for better compatibility with other languages
Better testing:
- Mock dataset now supports nested datasets
- Customize the number of sub examples
Documentation update:
- Community datasets: https://www.tensorflow.org/datasets/community_catalog/overview
- New guide on TFDS and determinism
RLDS:
- Nested datasets features are supported
- New datasets: Robomimic, D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes
Misc:
- Create beam pipeline using TFDS as input with tfds.beam.ReadFromTFDS
- Support setting the file formats in tfds build --file_format=tfrecord
- Typing annotations exposed in tfds.typing
- tfds.ReadConfig has a new assert_cardinality=False to disable cardinality
- Add a tfds.display_progress_bar(True) for functional control
- Support for huge number of shards (>99999)
- DatasetInfo exposes .release_notes

And of course, new datasets, bug fixes,...

Thank you to all our contributors for improving TFDS!

Assets 2

28 Jul 12:29

Conchylicultor

v4.4.0

76f8591

v4.4.0

API:

Add PartialDecoding support, to decode only a subset of the features (for performances)
Catalog now expose links to KnowYourData visualisations
tfds.as_numpy supports datasets with None
Dataset generated with disable_shuffling=True are now read in generation order.
Loading datasets from files now supports custom tfds.features.FeatureConnector
tfds.testing.mock_data now supports
- non-scalar tensors with dtype tf.string
- builder_from_files and path-based community datasets
File format automatically restored (for datasets generated with tfds.builder(..., file_format=)).
Many new reinforcement learning datasets
Various bug fixes and internal improvements like:
- Dynamically set number of worker thread during extraction
- Update progression bar during download even if downloads are cached

Dataset creation:

Add tfds.features.LabeledImage for semantic segmentation (like image but with additional info.features['image_label'].name label metadata)
Add float32 support for tfds.features.Image (e.g. for depth map)
All FeatureConnector can now have a None dimension anywhere (previously restricted to the first position).
tfds.features.Tensor() can have arbitrary number of dynamic dimension (Tensor(..., shape=(None, None, 3, None)))
tfds.features.Tensor can now be serialised as bytes, instead of float/int values (to allow better compression): Tensor(..., encoding='zlib')
Add script to add TFDS metadata files to existing TF-record (see doc).
New guide on common implementation gotchas

Thank you all for your support and contribution!

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added

Changed

Deprecated

Removed

Fixed

Security

Uh oh!

Deprecated

Fixed

Security

Uh oh!

Changed

Uh oh!

Added

Changed

Deprecated

Removed

Fixed

Security

Uh oh!

Added

Changed

Deprecated

Removed

Fixed

Security

Uh oh!

Added

Changed

Removed

Fixed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: tensorflow/datasets

v4.8.3

Added

Changed

Deprecated

Removed

Fixed

Security

Uh oh!

v4.8.2

Deprecated

Fixed

Security

Uh oh!

v4.8.1

Changed

Uh oh!

v4.8.0

Added

Changed

Deprecated

Removed

Fixed

Security

Uh oh!

v4.7.0

Added

Changed

Deprecated

Removed

Fixed

Security

Uh oh!

v4.6.0

Added

Changed

Removed

Fixed

Uh oh!

v4.5.2

Uh oh!

v4.5.1

Uh oh!

v4.5.0

Uh oh!

v4.4.0

Uh oh!