Skip to content

Method download_and_prepare poorly documented (+Tedlium broken) #2608

Open
@knightcode

Description

@knightcode

Description of issue

Using this bit of python:

  dl_config = tfds.download.DownloadConfig(
    beam_options=beam.options.pipeline_options.PipelineOptions(flags=[]),
    compute_stats=tfds.download.ComputeStatsMode.SKIP)
  builder = tfds.builder("tedlium/release3", data_dir="./datasets")
  logger.info(f"info: {builder.info}")
  builder.download_and_prepare(
    download_dir="./downloads",
    download_config=dl_config)

Initially threw a KeyError on some file in the tarball, which I fixed with this work around in core/download/extractor.py:

      try:
        extract_file = tar.extractfile(member)
      except KeyError:
        print("Failed extracting: {}".format(member))
        continue

The process has been running for 6 days straight now, and I have no idea what it's currently doing. The last output from like 5.5 days ago was this:

2020-10-16T08:26:59-0400 INFO Generating split validation
2020-10-16T08:26:59-0400 INFO Generating split test
2020-10-16T08:26:59-0400 INFO Generating split train
2020-10-16T08:27:00-0400 INFO ==================== <function annotate_downstream_side_inputs at 0x165f74430> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function fix_side_input_pcoll_coders at 0x165f74550> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function lift_combiners at 0x165f745e0> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function expand_sdf at 0x165f74670> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function expand_gbk at 0x165f74700> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function sink_flattens at 0x165f74820> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function greedily_fuse at 0x165f748b0> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function read_to_impulse at 0x165f74940> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function impulse_to_input at 0x165f749d0> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function sort_stages at 0x165f74c10> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function setup_timer_mapping at 0x165f74b80> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function populate_data_channel_coders at 0x165f74ca0> ====================
2020-10-16T08:27:00-0400 INFO Creating state cache with size 100
2020-10-16T08:27:00-0400 INFO Created Worker handler <apache_beam.runners.portability.fn_api_runner.worker_handlers.EmbeddedWorkerHandler object at 0x1661f1940> for environment ref_Environment_default_environment_1 (beam:env:embedded_python:v1, b'')
2020-10-16T08:27:00-0400 INFO Running ((((ref_AppliedPTransform_test/Create/Impulse_47)+(ref_AppliedPTransform_test/Create/FlatMap(<lambda at core.py:2826>)_48))+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/AddRandomKeys_51))+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/Map(reify_timestamps)_53))+(test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Write)
2020-10-16T08:27:00-0400 INFO Running (((((((test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_55))+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/RemoveRandomKeys_56))+(ref_AppliedPTransform_test/Create/Map(decode)_57))+(ref_AppliedPTransform_test/FlatMap(_generate_examples_from_stm_file)_58))+(ref_AppliedPTransform_test/Encode_59))+(ref_AppliedPTransform_test/SerializeBucketize_60))+(test/GroupByBucket/Write)
2020-10-16T09:01:11-0400 INFO Running ((((ref_AppliedPTransform_train/Create/Impulse_90)+(ref_AppliedPTransform_train/Create/FlatMap(<lambda at core.py:2826>)_91))+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/AddRandomKeys_94))+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/Map(reify_timestamps)_96))+(train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Write)
2020-10-16T09:01:11-0400 INFO Running (((((((train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_98))+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/RemoveRandomKeys_99))+(ref_AppliedPTransform_train/Create/Map(decode)_100))+(ref_AppliedPTransform_train/FlatMap(_generate_examples_from_stm_file)_101))+(ref_AppliedPTransform_train/Encode_102))+(ref_AppliedPTransform_train/SerializeBucketize_103))+(train/GroupByBucket/Write)

To me, download_and_prepare sounds like downloading a file, extracting it, and maybe moving some files around. I don't know why that would take this long. The 54GB file is downloaded. It's extracted. And the working directory has been at a constant 217GB. Yet, somehow, the disk is still filling up with... something. It doesn't seem to be using anything in /tmp. My home directory's storage has been constant. So I have no idea where this crap is getting saved.

The code is difficult to trace, and I can't figure out what earthly, reasonable function it could be doing that I actually want it to be doing. As such, it seems like the documentation should reflect what the bulk of the work actually entails, so that I know whether or not this is normal processing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationPull Request or Issue related with comments or documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions