Skip to content

Releases: tensorflow/datasets

v4.8.3

27 Feb 11:46
Compare
Choose a tag to compare

Added

Changed

Deprecated

  • Python 3.7 support: this version and future version use Python 3.8.

Removed

Fixed

  • Flag ignore_verifications from Hugging Face's datasets.load_dataset is
    deprecated, and used to cause errors in tfds.load(huggingface:foo).

Security

v4.8.2

17 Jan 20:41
Compare
Choose a tag to compare

Deprecated

  • Python 3.7 support: this is the last version of TFDS supporting Python 3.7.
    Future versions will use Python 3.8.

Fixed

  • tfds new and tfds build better support the new recommended datasets
    organization, where individual datasets have their own package under
    datasets/, builder class is called Builder and is defined within module
    ${dsname}_dataset_builder.py.

Security

v4.8.1

02 Jan 18:30
Compare
Choose a tag to compare

Changed

  • Added file valid_tags.txt to not break builds.
  • TFDS no longer relies on TensorFlow DTypes. We chose NumPy DTypes to keep the
    typing expressiveness, while dropping the heavy dependency on TensorFlow. We
    migrated all our internal datasets. Please, migrate accordingly:
    • tf.bool: np.bool_
    • tf.string: np.str_
    • tf.int64, tf.int32, etc: np.int64, np.int32, etc
    • tf.float64, tf.float32, etc: np.float64, np.float32, etc

v4.8.0

21 Dec 11:09
Compare
Choose a tag to compare

Added

  • [API] DatasetBuilder's description and citations can be specified in
    dedicated README.md and CITATIONS.bib files, within the dataset package
    (see https://www.tensorflow.org/datasets/add_dataset).
  • Tags can be associated to Datasets, in the TAGS.txt file. For
    now, they are only used in the generated documentation.
  • [API][Experimental] New ViewBuilder to define datasets as transformations
    of existing datasets. Also adds tfds.transform with functionality to apply
    transformations.
  • Loggers are also called on tfds.as_numpy(...), base Logger class has a
    new corresponding method.
  • tfds.core.DatasetBuilder can have a default limit for the number of
    simultaneous downloads. tfds.download.DownloadConfig can override it.
  • tfds.features.Audio supports storing raw audio data for lazy decoding.
  • The number of shards can be overridden when preparing a dataset:
    builder.download_and_prepare(download_config=tfds.download.DownloadConfig(num_shards=42)).
    Alternatively, you can configure the min and max shard size if you want TFDS
    to compute the number of shards for you, but want to have control over the
    shard sizes.

Changed

Deprecated

Removed

Fixed

Security

v4.7.0

05 Oct 10:23
f00f1e3
Compare
Choose a tag to compare

Added

  • [API] Added TfDataBuilder that is handy for storing experimental ad hoc TFDS datasets in notebook-like environments such that they can be versioned, described, and easily shared with teammates.
  • [API] Added options to create format-specific dataset builders. The new API now includes a number of NLP-specific builders, such as:
  • [API] Added tfds.beam.inc_counter to reduce beam.metrics.Metrics.counter boilerplate
  • [API] Added options to group together existing TFDS datasets into dataset collections and to perform simple operations over them.
  • [Documentation] update, specifically:
    • New guide on format-specific dataset builders;
    • New guide on adding new dataset collections to TFDS;
    • Updated TFDS CLI documentation.
  • [TFDS CLI] Supports custom config through Json (e.g. tfds build my_dataset --config='{"name": "my_custom_config", "description": "Abc"}')
  • New datasets:
  • Updated datasets:
    • C4 was updated to version 3.1.
    • common_voice was updated to a more recent snapshot.
    • wikipedia was updated with the 20220620 snapshot.
  • New dataset collections, such as xtreme and LongT5

Changed

  • The base Logger class expects more information to be passed to the as_dataset method. This should only be relevant to people who have implemented and registered custom Logger class(es).
  • You can set DEFAULT_BUILDER_CONFIG_NAME in a DatasetBuilder to change the default config if it shouldn't be the first builder config defined in BUILDER_CONFIGS.

Deprecated

Removed

Fixed

  • Various datasets
  • In Linux, when loading a dataset from a directory that is not your home (~) directory, a new ~ directory is not created in the current directory (fixes #4117).

Security

v4.6.0

02 Jun 09:21
Compare
Choose a tag to compare

Added

  • Support for community datasets on GCS.
  • [API] tfds.builder_from_directory and tfds.builder_from_directories, see
    https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder.
  • [API] Dash ("-") support in split names.
  • [API] file_format argument to download_and_prepare method, allowing user
    to specify an alternative file format to store prepared data (e.g. "riegeli").
  • [API] file_format to DatasetInfo string representation.
  • [API] Expose the return value of Beam pipelines. This allows for users to
    read the Beam metrics.
  • [API] Expose Feature tf_example_spec to public.
  • [API] doc kwarg on Features, to describe a feature.
  • [Documentation] Features description is shown on TFDS Catalog.
  • [Documentation] More metadata about HuggingFace datasets in TFDS catalog.
  • [Performance] Parallel load of metadata files.
  • [Testing] TFDS tests are now run using GitHub actions - misc improvements such
    as caching and sharding.
  • [Testing] Improvements to MockFs.
  • New datasets.

Changed

  • [API] num_shards is now optional in the shard name.

Removed

Fixed

  • Various datasets.
  • Dataset builders that are defined adhoc (e.g. in Colab).
  • Better DatasetNotFoundError messages.
  • Don't set deterministic on a global level but locally in interleave, so it
    only apply to interleave and not all transformations.
  • Google drive downloader.

As always, thank you to all contributors!

v4.5.2

31 Jan 15:45
Compare
Choose a tag to compare

Release notes:

  • Fix import bug on Windows (#3709)
  • Updated documentation

v4.5.1

31 Jan 12:10
Compare
Choose a tag to compare

Release notes:

  • Fix import bug on Windows (#3709)
  • Add split=tfds.split_for_jax_process('train') (alias of tfds.even_splits('train', n=jax.process_count())[jax.process_index()])

v4.5.0

26 Jan 09:44
Compare
Choose a tag to compare

This is the last version of TFDS supporting 3.6. Future version will use 3.7

  • Better split API:

    • Splits can be selected using shards: split='train[3shard]'
    • Underscore supported in numbers for better readability: split='train[:500_000]'
    • Select the union of all splits with split='all'
    • tfds.even_splits is more precise and flexible:
      • Return splits exactly of the same size when passed tfds.even_splits('train', n=3, drop_remainder=True)
      • Works on subsplits tfds.even_splits('train[:75%]', n=3) or even nested
      • Can be composed with other splits: tfds.even_splits('train', n=3)[0] + 'test'
  • FeatureConnectors:

    • Faster dataset generation (using tfrecords)
    • Features now have serialize_example / deserialize_example methods to encode/decode example to proto: example_bytes = features.serialize_example(example_data)
    • Audio now supports encoding='zlib' for better compression
    • Features specs exposed in proto for better compatibility with other languages
  • Better testing:

    • Mock dataset now supports nested datasets
    • Customize the number of sub examples
  • Documentation update:

  • RLDS:

    • Nested datasets features are supported
    • New datasets: Robomimic, D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes
  • Misc:

    • Create beam pipeline using TFDS as input with tfds.beam.ReadFromTFDS
    • Support setting the file formats in tfds build --file_format=tfrecord
    • Typing annotations exposed in tfds.typing
    • tfds.ReadConfig has a new assert_cardinality=False to disable cardinality
    • Add a tfds.display_progress_bar(True) for functional control
    • Support for huge number of shards (>99999)
    • DatasetInfo exposes .release_notes

And of course, new datasets, bug fixes,...

Thank you to all our contributors for improving TFDS!

v4.4.0

28 Jul 12:29
Compare
Choose a tag to compare

API:

  • Add PartialDecoding support, to decode only a subset of the features (for performances)
  • Catalog now expose links to KnowYourData visualisations
  • tfds.as_numpy supports datasets with None
  • Dataset generated with disable_shuffling=True are now read in generation order.
  • Loading datasets from files now supports custom tfds.features.FeatureConnector
  • tfds.testing.mock_data now supports
    • non-scalar tensors with dtype tf.string
    • builder_from_files and path-based community datasets
  • File format automatically restored (for datasets generated with tfds.builder(..., file_format=)).
  • Many new reinforcement learning datasets
  • Various bug fixes and internal improvements like:
    • Dynamically set number of worker thread during extraction
    • Update progression bar during download even if downloads are cached

Dataset creation:

  • Add tfds.features.LabeledImage for semantic segmentation (like image but with additional info.features['image_label'].name label metadata)
  • Add float32 support for tfds.features.Image (e.g. for depth map)
  • All FeatureConnector can now have a None dimension anywhere (previously restricted to the first position).
  • tfds.features.Tensor() can have arbitrary number of dynamic dimension (Tensor(..., shape=(None, None, 3, None)))
  • tfds.features.Tensor can now be serialised as bytes, instead of float/int values (to allow better compression): Tensor(..., encoding='zlib')
  • Add script to add TFDS metadata files to existing TF-record (see doc).
  • New guide on common implementation gotchas

Thank you all for your support and contribution!