Update retry algorithm to be more robust #6

egrace479 · 2024-06-05T18:56:01Z

That being said, retry algorithms (at least robust ones) for internet protocols are normally written with an exponential escalation of wait time (such as 1, 2, 4, 8, 16, 32, .. seconds). In that case, a user may want to specify at which point to give up and log a failure, for example --max-retries and/or --max-wait.

Originally posted by @hlapp in #1 (comment)

reply: It's waiting after a failed attempt if the response is any of the following: 429, 500, 502, 503, 504. It doesn't have a wait between successful downloads. It has a max number of times to retry on the designated responses, but otherwise just logs the response in the error log (along with the index, filename, and url).

Setting a maximum wait time on a request would probably be a good idea as well. urllib3.request seems to handle much of this when also passed a Retry object. @thompsonmj had also suggested HTTPAdapter as an option that also uses Retry.

Seems reasonable to use HTTPAdapter, since it's already using requests. Must also consider streaming interruption, as noted here.

The text was updated successfully, but these errors were encountered:

johnbradley · 2024-08-06T12:37:54Z

The request HTTPAdapter with the urllib3 Retry strategy looks good for some of the retry needs. The streaming interruption will still need to be handled separately though.

johnbradley · 2024-08-06T12:39:07Z

Sometimes when downloading files we end up reaching a threshold where our IP address gets blocked for a while by a remote server. In that case you typically have to wait for a few hours. I wouldn't expect or want the command to wait in this scenario. For that scenario can we re-run the cautious-robot command and have it skip already downloaded images?

egrace479 · 2024-08-06T17:00:17Z

Sometimes when downloading files we end up reaching a threshold where our IP address gets blocked for a while by a remote server. In that case you typically have to wait for a few hours. I wouldn't expect or want the command to wait in this scenario. For that scenario can we re-run the cautious-robot command and have it skip already downloaded images?

Right now I believe it relies on adjusting the start index to avoid re-downloading the image. However, I could add a line here checking for the image:

if os.path.exists(image_dir_path/image_name):
    continue

egrace479 added enhancement New feature or request structure Refactoring or architecture, general code organization labels Jun 5, 2024

This was referenced Jun 5, 2024

Divide download and main function into smaller pieces #7

Open

Initial downloader setup #1

Merged

egrace479 mentioned this issue Aug 7, 2024

Add flags for non-interactive use #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update retry algorithm to be more robust #6

Update retry algorithm to be more robust #6

egrace479 commented Jun 5, 2024

johnbradley commented Aug 6, 2024

johnbradley commented Aug 6, 2024

egrace479 commented Aug 6, 2024

Update retry algorithm to be more robust #6

Update retry algorithm to be more robust #6

Comments

egrace479 commented Jun 5, 2024

johnbradley commented Aug 6, 2024

johnbradley commented Aug 6, 2024

egrace479 commented Aug 6, 2024