Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update retry algorithm to be more robust #6

Open
egrace479 opened this issue Jun 5, 2024 · 3 comments
Open

Update retry algorithm to be more robust #6

egrace479 opened this issue Jun 5, 2024 · 3 comments
Labels
enhancement New feature or request structure Refactoring or architecture, general code organization

Comments

@egrace479
Copy link
Member

That being said, retry algorithms (at least robust ones) for internet protocols are normally written with an exponential escalation of wait time (such as 1, 2, 4, 8, 16, 32, .. seconds). In that case, a user may want to specify at which point to give up and log a failure, for example --max-retries and/or --max-wait.

Originally posted by @hlapp in #1 (comment)

reply: It's waiting after a failed attempt if the response is any of the following: 429, 500, 502, 503, 504. It doesn't have a wait between successful downloads. It has a max number of times to retry on the designated responses, but otherwise just logs the response in the error log (along with the index, filename, and url).

Setting a maximum wait time on a request would probably be a good idea as well. urllib3.request seems to handle much of this when also passed a Retry object. @thompsonmj had also suggested HTTPAdapter as an option that also uses Retry.

Seems reasonable to use HTTPAdapter, since it's already using requests. Must also consider streaming interruption, as noted here.

@egrace479 egrace479 added enhancement New feature or request structure Refactoring or architecture, general code organization labels Jun 5, 2024
@johnbradley
Copy link

The request HTTPAdapter with the urllib3 Retry strategy looks good for some of the retry needs. The streaming interruption will still need to be handled separately though.

@johnbradley
Copy link

Sometimes when downloading files we end up reaching a threshold where our IP address gets blocked for a while by a remote server. In that case you typically have to wait for a few hours. I wouldn't expect or want the command to wait in this scenario. For that scenario can we re-run the cautious-robot command and have it skip already downloaded images?

@egrace479
Copy link
Member Author

Sometimes when downloading files we end up reaching a threshold where our IP address gets blocked for a while by a remote server. In that case you typically have to wait for a few hours. I wouldn't expect or want the command to wait in this scenario. For that scenario can we re-run the cautious-robot command and have it skip already downloaded images?

Right now I believe it relies on adjusting the start index to avoid re-downloading the image. However, I could add a line here checking for the image:

if os.path.exists(image_dir_path/image_name):
    continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request structure Refactoring or architecture, general code organization
Projects
None yet
Development

No branches or pull requests

2 participants