-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Credentials timeout errors are deceptive #1118
Comments
Hi @elrob, looks like a timeout of 5 seconds kicked in before you observed the desired number of retries. Can you try disabling timeout (using TimeoutConfig::disabled()) and pass it to ConfigLoader::timeout_config to see if the behavior matches your expectation? |
This is an interesting one! There is a clue here as to what's going on:
the TimedOutError in the connector doesn't have any fields—this is actually the timeout coming from identity cache! That's the actual problem here! We're trying to load credentials and it's timing out after 5 seconds (this is a pretty sensible timeout for credentials, if it's taking longer than 5 seconds to load credentials, there are other problems!) The error is a little misleading—it looks like a connector error because we never sent the request (we should fix that), but the timeout here is coming from the credentials chain. That's why it's not retried. cc @ysaito1001 this appears to be the problematic line: https://github.com/smithy-lang/smithy-rs/blob/e3f0de42db727d5419948b03c7d5a3773b07e34b/rust-runtime/aws-smithy-runtime-api/src/client/orchestrator.rs#L273 We are incorrectly assuming that any dispatch failure is a connector error which isn't the case. The error here was actually a failure to load credentials. We need to figure out how to fix the error type here. |
@ysaito1001 @rcoh My assumption remains the same though, I'd expect the credentials provider to retry after a timeout up to the configured number of retries. Or if it isn't the credentials provider, I'd expect the client to retry after a timeout up to the configured number of retries. Perhaps the credentials provider STS client needs a separately configured retry configuration? I actually looked for that but didn't work out how I could configure that. |
Good eyes @rcoh! True, that's definitely something we want to clarify the error type for.
That depends on the underlying retry classifiers (those that implement the ClassifyRetry trait). For instance, if you do not get a response from STS and fail to load credentials, then I don't think it gets retried regardless of the number of retries configured. Can you turn on |
Just wanted to clarify some things here. The timeout is on the top level of loading credentials (at the cache level). It is set to 5 seconds (this is configurable), but we don't really recommend messing with it. Re: This is because credentials taking longer than 5 seconds to load almost always is indicative of other problems or anti patterns. For example, this can occur if you are simultaneously creating a large number of clients or You can configure this timeout to be longer, but the timeout is almost certainly not the problem here. I would dive deeper into where your credentials are coming from, how often they're being refreshed, etc. — it's also not out of the question that you're triggering an SDK bug around credentials caching! |
Thanks. I haven't seen this issue appear in EKS but it appeared a few times for me when testing something out on my machine. I can live with it assuming it doesn't appear on EKS in production. |
I don't seem to see any retries in the logs even with
|
Leaving this open to make the error message less confusing when credentials providers time out (also tracking in smithy-lang/smithy-rs#2950). |
Describe the bug
Today, when testing I've been seeing this error:
This is despite the fact that I have this configuration set:
Expected Behavior
I assume that this error is transient so should be resolved with retries even if my local network is unstable?
Or based on my RetryConfig, it should have retried 20 times but I assume it didn't based on how quickly the error appears.
Current Behavior
Occasional errors like this which seem to happen immediately rather than retrying up to 20 times.
Reproduction Steps
Maybe just configure a client as above and run a command with no network connection. The error appears quickly and retries don't seem to be obeyed.
Possible Solution
No response
Additional Information/Context
No response
Version
The text was updated successfully, but these errors were encountered: