Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add adaptive retry logic in RCI call for non-terminal errors #4499

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

tshan2001
Copy link
Contributor

@tshan2001 tshan2001 commented Feb 11, 2025

Summary

Add retry with exponential backoff when receiving non-terminal error from RCI calls, to prevent retry-storms.

Implementation details

A wrapper registerContainerInstanceWithRetry is added around the original registerContainerInstance method. It utilizes the RetryWithBackoffCtx method from the retry package. Upon receiving failures from RCI, we will examine the error type to determine if it's a terminal error, if yes, we will break the retry loop, otherwise, we will continue to retry with increased backoff time. The max backoff time is capped at ~3 minutes to ensure we don't wait too long between retries.

Testing

A new test TestRegisterContainerInstanceWithRetryNonTerminalError has been added to test both the happy and unhappy case.

New tests cover the changes: yes

Description for the changelog

Add adaptive retry logic in RCI call for non-terminal errors.

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

Does this PR include the addition of new environment variables in the README?

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tshan2001 tshan2001 requested a review from a team as a code owner February 11, 2025 19:57
@tshan2001 tshan2001 force-pushed the master branch 2 times, most recently from 99cedbd to 34b63af Compare February 11, 2025 20:48
@tshan2001 tshan2001 requested a review from amogh09 February 11, 2025 21:53
defer cancel()
err := retry.RetryWithBackoffCtx(ctx, backoff,
func() error {
containerInstanceARN, availabilityZone, errFromRCI = client.registerContainerInstance(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Don't we already use the default retryer from the AWS SDK under the hood in this call? That would mean that we perform additional retries on the actual network call.

Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline. Adding a summary here: the aws sdk retry only retries for 3 times and probably under 10 seconds, to address potential network issues. This change is to add more control over the overall initialization workflow. Without this, after the default 3 retries, the agent would exit and restart, and after around 3 seconds it repeats the same process again, we're essentially retrying 3 times every 15 seconds. We want to add more control in this process, where the backoff we add can go up to 3 minutes. This is to help alleviate systematic account-level throttling.

xxx0624
xxx0624 previously approved these changes Feb 13, 2025
err := retry.RetryWithBackoffCtx(ctx, client.rciRetryBackoff,
func() error {
// Reset the backoff such that retries from past calls won't impact the current call.
client.rciRetryBackoff.Reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to reset it for every retry? Will we still get exponential backoffs? How?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Published a new revision to move this outside of the sub-function within the retry loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants