Description
Description of issue
Using this bit of python:
dl_config = tfds.download.DownloadConfig(
beam_options=beam.options.pipeline_options.PipelineOptions(flags=[]),
compute_stats=tfds.download.ComputeStatsMode.SKIP)
builder = tfds.builder("tedlium/release3", data_dir="./datasets")
logger.info(f"info: {builder.info}")
builder.download_and_prepare(
download_dir="./downloads",
download_config=dl_config)
Initially threw a KeyError on some file in the tarball, which I fixed with this work around in core/download/extractor.py
:
try:
extract_file = tar.extractfile(member)
except KeyError:
print("Failed extracting: {}".format(member))
continue
The process has been running for 6 days straight now, and I have no idea what it's currently doing. The last output from like 5.5 days ago was this:
2020-10-16T08:26:59-0400 INFO Generating split validation
2020-10-16T08:26:59-0400 INFO Generating split test
2020-10-16T08:26:59-0400 INFO Generating split train
2020-10-16T08:27:00-0400 INFO ==================== <function annotate_downstream_side_inputs at 0x165f74430> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function fix_side_input_pcoll_coders at 0x165f74550> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function lift_combiners at 0x165f745e0> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function expand_sdf at 0x165f74670> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function expand_gbk at 0x165f74700> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function sink_flattens at 0x165f74820> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function greedily_fuse at 0x165f748b0> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function read_to_impulse at 0x165f74940> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function impulse_to_input at 0x165f749d0> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function sort_stages at 0x165f74c10> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function setup_timer_mapping at 0x165f74b80> ====================
2020-10-16T08:27:00-0400 INFO ==================== <function populate_data_channel_coders at 0x165f74ca0> ====================
2020-10-16T08:27:00-0400 INFO Creating state cache with size 100
2020-10-16T08:27:00-0400 INFO Created Worker handler <apache_beam.runners.portability.fn_api_runner.worker_handlers.EmbeddedWorkerHandler object at 0x1661f1940> for environment ref_Environment_default_environment_1 (beam:env:embedded_python:v1, b'')
2020-10-16T08:27:00-0400 INFO Running ((((ref_AppliedPTransform_test/Create/Impulse_47)+(ref_AppliedPTransform_test/Create/FlatMap(<lambda at core.py:2826>)_48))+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/AddRandomKeys_51))+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/Map(reify_timestamps)_53))+(test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Write)
2020-10-16T08:27:00-0400 INFO Running (((((((test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_55))+(ref_AppliedPTransform_test/Create/MaybeReshuffle/Reshuffle/RemoveRandomKeys_56))+(ref_AppliedPTransform_test/Create/Map(decode)_57))+(ref_AppliedPTransform_test/FlatMap(_generate_examples_from_stm_file)_58))+(ref_AppliedPTransform_test/Encode_59))+(ref_AppliedPTransform_test/SerializeBucketize_60))+(test/GroupByBucket/Write)
2020-10-16T09:01:11-0400 INFO Running ((((ref_AppliedPTransform_train/Create/Impulse_90)+(ref_AppliedPTransform_train/Create/FlatMap(<lambda at core.py:2826>)_91))+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/AddRandomKeys_94))+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/Map(reify_timestamps)_96))+(train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Write)
2020-10-16T09:01:11-0400 INFO Running (((((((train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_98))+(ref_AppliedPTransform_train/Create/MaybeReshuffle/Reshuffle/RemoveRandomKeys_99))+(ref_AppliedPTransform_train/Create/Map(decode)_100))+(ref_AppliedPTransform_train/FlatMap(_generate_examples_from_stm_file)_101))+(ref_AppliedPTransform_train/Encode_102))+(ref_AppliedPTransform_train/SerializeBucketize_103))+(train/GroupByBucket/Write)
To me, download_and_prepare
sounds like downloading a file, extracting it, and maybe moving some files around. I don't know why that would take this long. The 54GB file is downloaded. It's extracted. And the working directory has been at a constant 217GB. Yet, somehow, the disk is still filling up with... something. It doesn't seem to be using anything in /tmp
. My home directory's storage has been constant. So I have no idea where this crap is getting saved.
The code is difficult to trace, and I can't figure out what earthly, reasonable function it could be doing that I actually want it to be doing. As such, it seems like the documentation should reflect what the bulk of the work actually entails, so that I know whether or not this is normal processing.