-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
google: retry transient errors in DefaultTokenSource #203
Comments
Sounds good. Any suggestions about how many retries and within which time frame? |
The usual correct answer for this sort of thing seems to be randomized exponential backoff forever (capping the sleep time at a few seconds), until the context is cancelled or an error that doesn't appear to be transient is encountered. I don't see any obvious reason not to do that here. |
(Within Google see |
We cannot do it until forever because the client is not historically documented that it might be blocking forever if context is not cancelled, so it would be a breaking change.
gax has a retrier with backoff: https://godoc.org/github.com/googleapis/gax-go#WithRetry. |
I'm not sure I agree it's a breaking change. What's the difference between retrying transient errors for a long time, and an HTTP request to the metadata server that actually just takes a long time? If you decide to put a cap on the maximum time you spend retrying, how do you choose it? 18 hours might as well be forever, but is 30 minutes? At what point does it become compatible with the old behavior? The context is how the user has already specified how long they're willing to wait, and continues to be if you retry until the context is cancelled. Cool, the gax algorithm looks good (I'd suggest |
It matters because the metadata package has already a definite timeout which is 2 secs, see https://github.com/GoogleCloudPlatform/google-cloud-go/blob/master/compute/metadata/metadata.go#L67. Each metadata request is going through this package which is capped at 2 secs already.
Will reimplement, I was not looking closely and didn't see that the status codes were grpc-specific. |
Thanks, I didn't realize there was a timeout. I still think the context is the one true way for the user to express how long they want to wait for the blocking call, but capping the wait time to a small number times the max timeout would still be very helpful. |
As discussed in GoogleCloudPlatform/gcsfuse#187, the GCE metadata server is unreliable, which can cause DefaultTokenSource to return errors. Supposedly retrying works in these cases. (This is nothing but rumors to me; I haven't experienced this but am responsible for a product whose users have.)
It appears (googleapis/oauth2client#581) that the plan for the Python library is to add retries for at least "connection refused" errors. Perhaps the Go library should do the same?
The text was updated successfully, but these errors were encountered: