Credentials timeout errors are deceptive #1118

elrob · 2024-03-29T09:23:58Z

Describe the bug

Today, when testing I've been seeing this error:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: TimedOutError(5s), connection: Unknown } })

This is despite the fact that I have this configuration set:

    let aws_sdk_config = aws_config::defaults(BehaviorVersion::latest())
        .retry_config(RetryConfig::standard().with_max_attempts(20))
        .load()
        .await;

Expected Behavior

I assume that this error is transient so should be resolved with retries even if my local network is unstable?
Or based on my RetryConfig, it should have retried 20 times but I assume it didn't based on how quickly the error appears.

Current Behavior

Occasional errors like this which seem to happen immediately rather than retrying up to 20 times.

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: TimedOutError(5s), connection: Unknown } })

Reproduction Steps

Maybe just configure a client as above and run a command with no network connection. The error appears quickly and retries don't seem to be obeyed.

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v1.1.9
│   ├── aws-credential-types v1.1.8
│   │   ├── aws-smithy-async v1.2.0
│   │   ├── aws-smithy-runtime-api v1.2.0
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-sigv4 v1.2.0
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-http v0.60.7
│   │   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-sdk-sso v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-http v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-ssooidc v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-sts v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-xml v0.60.7
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-types v1.1.8 (*)
├── aws-sdk-elasticloadbalancingv2 v1.19.0
│   ├── aws-credential-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-query v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-xml v0.60.7 (*)
│   ├── aws-types v1.1.8 (*)



### Environment details (OS name and version, etc.)

macOS 13.6.3 (22G436)

### Logs

_No response_

The text was updated successfully, but these errors were encountered:

ysaito1001 · 2024-03-29T21:10:03Z

Hi @elrob, looks like a timeout of 5 seconds kicked in before you observed the desired number of retries. Can you try disabling timeout (using TimeoutConfig::disabled()) and pass it to ConfigLoader::timeout_config to see if the behavior matches your expectation?

rcoh · 2024-04-01T19:37:50Z

This is an interesting one! There is a clue here as to what's going on:

TimedOutError(5s)

the TimedOutError in the connector doesn't have any fields—this is actually the timeout coming from identity cache!

That's the actual problem here! We're trying to load credentials and it's timing out after 5 seconds (this is a pretty sensible timeout for credentials, if it's taking longer than 5 seconds to load credentials, there are other problems!)

The error is a little misleading—it looks like a connector error because we never sent the request (we should fix that), but the timeout here is coming from the credentials chain. That's why it's not retried.

cc @ysaito1001 this appears to be the problematic line: https://github.com/smithy-lang/smithy-rs/blob/e3f0de42db727d5419948b03c7d5a3773b07e34b/rust-runtime/aws-smithy-runtime-api/src/client/orchestrator.rs#L273

We are incorrectly assuming that any dispatch failure is a connector error which isn't the case. The error here was actually a failure to load credentials. We need to figure out how to fix the error type here.

elrob · 2024-04-02T01:20:51Z

@ysaito1001 @rcoh
Thank you!
Yeah, I also thought it might be coming from the credentials provider. There is some annoying flakey ~~malware~~ security software installed on my machine which often causes temporary connection issues.

My assumption remains the same though, I'd expect the credentials provider to retry after a timeout up to the configured number of retries. Or if it isn't the credentials provider, I'd expect the client to retry after a timeout up to the configured number of retries. Perhaps the credentials provider STS client needs a separately configured retry configuration? I actually looked for that but didn't work out how I could configure that.

ysaito1001 · 2024-04-03T18:41:37Z

Good eyes @rcoh! True, that's definitely something we want to clarify the error type for.

I'd expect the credentials provider to retry after a timeout up to the configured number of retries.

That depends on the underlying retry classifiers (those that implement the ClassifyRetry trait). For instance, if you do not get a response from STS and fail to load credentials, then I don't think it gets retried regardless of the number of retries configured. Can you turn on tracing_subscriber e.g. tracing_subscriber::fmt::init() and see whether an retry occurs or does not retry at all before you error out?

rcoh · 2024-04-04T10:52:57Z

Just wanted to clarify some things here. The timeout is on the top level of loading credentials (at the cache level). It is set to 5 seconds (this is configurable), but we don't really recommend messing with it.

Re: ClassifyRetry: Credentials retries do exist, but they're configured specifically for each provider and don't operate the same way as general-purpose retries (nor are they configurable in the same way). They will show up in the logs, however.

This is because credentials taking longer than 5 seconds to load almost always is indicative of other problems or anti patterns.

For example, this can occur if you are simultaneously creating a large number of clients or aws_config instances. There is currently an open issue around credentials being cached per client instead of on the root SdkConfig (but you still should not be creating a large number of clients!)

You can configure this timeout to be longer, but the timeout is almost certainly not the problem here. I would dive deeper into where your credentials are coming from, how often they're being refreshed, etc. — it's also not out of the question that you're triggering an SDK bug around credentials caching!

elrob · 2024-04-04T11:45:51Z

Thanks. I haven't seen this issue appear in EKS but it appeared a few times for me when testing something out on my machine. I can live with it assuming it doesn't appear on EKS in production.

elrob · 2024-04-04T11:52:07Z

I don't seem to see any retries in the logs even with RUST_LOG=trace.
This example happens locally on my machine if I disable my network connection (turn off wifi).
Whether or not this should be retryable is something I'll leave to you to decide.

{"timestamp":"2024-04-04T11:42:33.950270Z","level":"INFO","fields":{"message":"first credentials will be loaded from Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }","base":"Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:33.950301Z","level":"INFO","fields":{"message":"which will be used to assume a role","role_arn":"RoleArn { role_arn: \"arn:aws:iam::REDACTED\", external_id: None, session_name: Some(\"sso\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.227689Z","level":"WARN","fields":{"message":"failed to load base credentials","error":"an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }))"},"target":"aws_config::profile::credentials","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.228331Z","level":"WARN","fields":{"message":"provider failed to provide credentials","provider":"Profile","error":"an error occurred while loading credentials: an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }) }))"},"target":"aws_config::meta::credentials::chain","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}

jdisanti · 2024-04-09T20:43:50Z

Leaving this open to make the error message less confusing when credentials providers time out (also tracking in smithy-lang/smithy-rs#2950).

elrob added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 29, 2024

ysaito1001 removed the needs-triage This issue or PR still needs to be triaged. label Mar 29, 2024

joseluisjimenez1 mentioned this issue Apr 9, 2024

Sink(Elasticsearch): Dropping events with AWS auth strategy vectordotdev/vector#20266

Open

jdisanti mentioned this issue Apr 9, 2024

Make it obvious that errors in credentials providers originate in credentials providers smithy-lang/smithy-rs#2950

Open

jdisanti added the p2 This is a standard priority issue label Apr 9, 2024

jdisanti changed the title ~~Retries not working as expected?~~ Credentials timeout errors are deceptive Apr 9, 2024

xizhem mentioned this issue Apr 16, 2024

Upticks of S3 timeout errors after SDK upgrade #1130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Credentials timeout errors are deceptive #1118

Credentials timeout errors are deceptive #1118

elrob commented Mar 29, 2024

ysaito1001 commented Mar 29, 2024

rcoh commented Apr 1, 2024 •

edited

Loading

elrob commented Apr 2, 2024

ysaito1001 commented Apr 3, 2024

rcoh commented Apr 4, 2024 •

edited

Loading

elrob commented Apr 4, 2024

elrob commented Apr 4, 2024

jdisanti commented Apr 9, 2024

Credentials timeout errors are deceptive #1118

Credentials timeout errors are deceptive #1118

Comments

elrob commented Mar 29, 2024

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

Version

ysaito1001 commented Mar 29, 2024

rcoh commented Apr 1, 2024 • edited Loading

elrob commented Apr 2, 2024

ysaito1001 commented Apr 3, 2024

rcoh commented Apr 4, 2024 • edited Loading

elrob commented Apr 4, 2024

elrob commented Apr 4, 2024

jdisanti commented Apr 9, 2024

rcoh commented Apr 1, 2024 •

edited

Loading

rcoh commented Apr 4, 2024 •

edited

Loading