Skip to content

Releases: tensorflow/datasets

v4.3.0

07 May 13:09
Compare
Choose a tag to compare

API:
• Add dataset.info.splits['train'].num_shards to expose the number of shards to the user
• Add tfds.features.Dataset to have a field containing sub-datasets (e.g. used in RL datasets)
• Add dtype and tf.uint16 supports for tfds.features.Video
• Add DatasetInfo.license field to add redistributing information
• Better tfds.benchmark(ds) (compatible with any iterator, not just tf.data, better colab representation)

Other
• Faster tfds.as_numpy() (avoid extra tf.Tensor <> np.array copy)
• Better tfds.as_dataframe visualisation (Sequence, ragged tensor, semantic masks with use_colormap)
• (experimental) community datasets support. To allow dynamically import datasets defined outside the TFDS repository.
• (experimental) Add a hugging-face compatibility wrapper to use Hugging-face datasets directly in TFDS.
• (experimental) Riegelli format support
• (experimental) Add DatasetInfo.disable_shuffling to force examples to be read in generation order.
• Add .copy, .format methods to GPath objects
• Many bug fixes

Testing:
• Supports custom BuilderConfig in DatasetBuilderTest
DatasetBuilderTest now has a dummy_data class property which can be used in setUpClass
• Add add_tfds_id and cardinality support to tfds.testing.mock_data

And of course, many new datasets and datasets updates.

We would like to thank all the TFDS contributors!

v4.2.0

06 Jan 15:41
Compare
Choose a tag to compare

API:

  • Add tfds build to the CLI. See documentation.
  • DownloadManager now returns Pathlib-like objects
  • Datasets returned by tfds.as_numpy are compatible with len(ds)
  • New tfds.features.Dataset to represent nested datasets
  • Add tfds.ReadConfig(add_tfds_id=True) to add a unique id to the example ex['tfds_id'] (e.g. b'train.tfrecord-00012-of-01024__123')
  • Add num_parallel_calls option to tfds.ReadConfig to overwrite to default AUTOTUNE option
  • tfds.ImageFolder now support tfds.decode.SkipDecoder
  • Add multichannel audio support to tfds.features.Audio
  • Better tfds.as_dataframe visualization (ffmpeg video if installed, bounding boxes,...)
  • Add try_gcs to tfds.builder(..., try_gcs=True)
  • Simpler BuilderConfig definition: class VERSION and RELEASE_NOTES are applied to all BuilderConfig. Config description is now optional.

Breaking compatibility changes:

  • Removed configs for all text datasets. Only plain text version is kept. For example: multi_nli/plain_text -> multi_nli.
  • To guarantee better deterministic, new validations are performed on the keys when creating a dataset (to avoid filenames as keys (non-deterministic) and restrict key to str, bytes and int). New errors likely indicates an issue in the dataset implementation.
  • tfds.core.benchmark now returns a pd.DataFrame (instead of a dict)
  • tfds.units is not visible anymore from the public API

Bug fixes:

  • Support 0-len sequence with images of dynamic shape (Fix #2616)
  • Progression bar correctly updated when copying files.
  • Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated,...)
  • Better debugging and error message (e.g. human readable size,...)
  • Allow max_examples_per_splits=0 in tfds build --max_examples_per_splits=0 to test _split_generators only (without _generate_examples).

And of course, many new datasets and datasets updates.

Thank you the community for their many valuable contributions and to supporting us in this project!!!

v4.1.0

04 Nov 12:02
Compare
Choose a tag to compare
  • When generating a dataset, if download fails for any reason, it is now possible to manually download the data. See doc.

  • Simplification of the dataset creation API.

    • We've made it is easier to create datasets outside TFDS repository (see our updated dataset creation guide).
    • _split_generators should now returns {'split_name': self._generate_examples(), ...} (but current datasets are backward compatible).
    • All dataset inherit from tfds.core.GeneratorBasedBuilder. Converting a dataset to beam now only require changing _generate_examples (see example and doc).
    • tfds.core.SplitGenerator, tfds.core.BeamBasedBuilder are deprecated and will be removed in future version.
  • Better pathlib.Path, os.PathLike compatibility:

    • dl_manager.manual_dir now returns a pathlib-Like object. Example:
    text = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()
    • Note: Other dl_manager.download, .extract,... will return pathlib-like objects in future versions
    • FeatureConnector,... and most functions should accept PathLike objects. Let us know if some functions you need are missing.
    • Add a tfds.core.as_path to create pathlib.Path-like objects compatible with GCS (e.g. tfds.core.as_path('gs://my-bucket/labels.csv').read_text()).
  • Other bug fixes and improvement. E.g.

    • Add verify_ssl= option to tfds.download.DownloadConfig to disable SSH certificate during download.
    • BuilderConfig are now compatible with Beam datasets #2348
    • --record_checksums now assume the new dataset-as-folder model
    • tfds.features.Images can accept encoded bytes images directly (useful when used with img_name, img_bytes = dl_manager.iter_archive('images.zip')).
    • Doc API now show deprecated methods, abstract methods to overwrite are now documented.
    • You can generate imagenet2012 with only a single split (e.g. only the validation data). Other split will be skipped if not present.
  • And of course new datasets

Thank you to all our contributors for improving TFDS!

v4.0.1

09 Oct 17:45
Compare
Choose a tag to compare
  • Fix tfds.load when generation code isn't present
  • Fix improve GCS compatibility.

Thanks @carlthome for reporting and fixing the issue.

v4.0.0

06 Oct 19:15
Compare
Choose a tag to compare

API changes, new features:

  • Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
  • tfds.load can now load dataset without using the generation class. So tfds.load('my_dataset:1.0.0') can work even if MyDataset.VERSION == '2.0.0' (See #2493).
  • Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
  • tfds.testing.mock_data does not require metadata files anymore!
  • Add tfds.as_dataframe(ds, ds_info) with custom visualisation (example)
  • Add tfds.even_splits to generate subsplits (e.g. tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]
  • Add new DatasetBuilder.RELEASE_NOTES property
  • tfds.features.Image now supports PNG with 4-channels
  • tfds.ImageFolder now supports custom shape, dtype
  • Downloaded URLs are available through MyDataset.url_infos
  • Add skip_prefetch option to tfds.ReadConfig
  • as_supervised=True support for tfds.show_examples, tfds.as_dataframe

Breaking compatible changes:

  • tfds.as_numpy() now returns an iterable which can be iterated multiple times. To migrate next(ds) -> next(iter(ds))
  • Rename tfds.features.text.Xyz -> tfds.deprecated.text.Xyz
  • Remove DatasetBuilder.IN_DEVELOPMENT property
  • Remove tfds.core.disallow_positional_args (should use Py3 *, instead)
  • tfds.features can now be saved/loaded, you may have to overwrite FeatureConnector.from_json_content and FeatureConnector.to_json_content to support this feature.
  • Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

  • Better archive extension detection for dl_manager.download_and_extract
  • Fix tfds.__version__ in TFDS nightly to be PEP440 compliant
  • Fix crash when GCS not available
  • Script to detect dead-urls
  • Improved open-source workflow, contributor guide, documentation
  • Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

And of course, new datasets, datasets updates.

A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ for being a major contributor.

v3.2.1

12 Aug 10:05
Compare
Choose a tag to compare
  • Fix an issue with GCS on Windows.

v3.2.0

10 Jul 21:39
Compare
Choose a tag to compare

Future breaking change:

  • The tfds.features.text encoding API is deprecated. Please use tensorflow_text instead.

New features

API:

  • Add a tfds.ImageFolder and tfds.TranslateFolder to easily create custom datasets with your custom data.
  • Add a tfds.ReadConfig(input_context=) to shard dataset, for better multi-worker compatibility (#1426).
  • The default data_dir can be controlled by the TFDS_DATA_DIR environment variable.
  • Better usability when developing datasets outside TFDS
    • Downloads are always cached
    • Checksum are optional
  • Added a tfds.show_statistics(ds_info) to display FACETS OVERVIEW. Note: This require the dataset to have been generated with the statistics.
  • Open source various scripts to help deployment/documentation (Generate catalog documentation, export all metadata files,...)

Documentation:

  • Catalog display images (example)
  • Catalog shows which dataset have been recently added and are only available in tfds-nightly nights_stay

Breaking compatibility change:

  • Fix deterministic example order on Windows when path was used as key (this only impact a few datasets). Now example order should be the same on all platforms.
  • Remove tfds.load('image_label_folder') in favor of the more user-friendly tfds.ImageFolder

Other:

  • Various performances improvements for both generation and reading (e.g. use __slot__, fix parallelisation bug in tf.data.TFRecordReader,...)
  • Various fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility,...)

Thanks to all our contributors who help improving the state of dataset for the entire research community!

v3.1.0

30 Apr 00:18
Compare
Choose a tag to compare

Beaking compatibility change:

  • Rename tfds.core.NamedSplit, tfds.core.SplitBase -> tfds.Split. Now tfds.Split.TRAIN,... are instance of tfds.Split
  • Remove deprecated num_shards argument from tfds.core.SplitGenerator. This argument was ignored as shards are automatically computed.

Future breaking compatibility changes:

  • Rename interleave_parallel_reads -> interleave_cycle_length for tfds.ReadConfig.
  • Invert ds, ds_info argument orders for tfds.show_examplesFuture breaking change:
  • The tfds.features.text encoding API is deprecated. Please use tensorflow_text instead.

Other changes:

  • Testing: Add support for custom decoders in tfds.testing.mock_data
  • Documentation: shows which datasets are only present in tfds-nightly
  • Documentation: display images for supported datasets
  • API: Add tfds.builder_cls(name) to access a DatasetBuilder class by name
  • API: Add info.split['train'].filenames for access to the tf-record files.
  • API: Add tfds.core.add_data_dir to register an additional data dir
  • Remove most ds.with_options which where applied by TFDS. Now use tf.data default.
  • Other bug fixes and improvement (Better error messages, windows compatibility,...)

Thank you all for your contributions, and helping us make TFDS better for everyone!

v3.0.0

16 Apr 03:03
Compare
Choose a tag to compare

Breaking changes:

  • Legacy mode tfds.experiment.S3 has been removed
  • New image_classification section. Some datasets have been move there from images.
  • in_memory argument has been removed from as_dataset/tfds.load (small datasets are now auto-cached).
  • DownloadConfig do not append the dataset name anymore (manual data should be in <manual_dir>/ instead of <manual_dir>/<dataset_name>/)
  • Tests now check that all dl_manager.download urls has registered checksums. To opt-out, add SKIP_CHECKSUMS = True to your DatasetBuilderTestCase.
  • tfds.load now always returns tf.compat.v2.Dataset. If you're using still using tf.compat.v1:
    • Use tf.compat.v1.data.make_one_shot_iterator(ds) rather than ds.make_one_shot_iterator()
    • Use isinstance(ds, tf.compat.v2.Dataset) instead of isinstance(ds, tf.data.Dataset)
  • tfds.Split.ALL has been removed from the API.

Future breaking change:

  • The tfds.features.text encoding API is deprecated. Please use tensorflow_text instead.
  • num_shards argument of tfds.core.SplitGenerator is currently ignored and will be removed in the next version.

Features:

  • DownloadManager is now pickable (can be used inside Beam pipelines)
  • tfds.features.Audio:
    • Support float as returned value
    • Expose sample_rate through info.features['audio'].sample_rate
    • Support for encoding audio features from file objects
  • Various bug fixes, better error messages, documentation improvements
  • More datasets

Thank you to all our contributors for helping us make TFDS better for everyone!

v2.1.0

25 Feb 21:51
Compare
Choose a tag to compare

New features:

  • Datasets expose info.dataset_size and info.download_size. All datasets generated with 2.1.0 cannot be loaded with previous version (previous datasets can be read with 2.1.0 however).
  • Auto-caching small datasets. in_memory argument is deprecated and will be removed in a future version.
  • Datasets expose their cardinality num_examples = tf.data.experimental.cardinality(ds) (Requires tf-nightly or TF >= 2.2.0)
  • Get the number of example in a sub-splits with: info.splits['train[70%:]'].num_examples