Call to KDS 'put_records' fails intermittently with 'Connection reset by Peer' within lambda extension #1106

dgarcia-collegeboard · 2024-03-20T17:52:46Z

Describe the bug

Hey!

I've derived an example from this repository: HERE

But instead of pushing to firehose, pushing to KDS instead. See minimal example

I've added some extra logic to my version of the above code where I'm providing custom credentials to the KDS client that's instantiated, but mostly, my implementation is the same. Is there a common reason for the Connection reset by peer error? It seems like the extension doesn't spin up the logs processor unless I invoke my lambda again, but this could just be because the async processing means any logs made in the Processors call method aren't spit out until they're resolved. I've seen some calls to kinesis succeed, but others seem to fail unexpectedly with this error:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" } }), connection: Unknown } }

The above error is logged during a match on the result of the future that is pinned inside of a Box in the example, expanded from this value HERE

Please note that the error is intermittent, meaning that sometimes the call to KDS works, but fails randomly.

I created an issue here in the lambda extension repository, but one of the maintainers mentioned this could be an issue with the SDK. I am thinking it may be a result of the lifecycle of the extension causing connection interference with the requests to KDS.

Any guidance would be much appreciated!

Expected Behavior

Lambda extension pushes logs to KDS with no issues

Current Behavior

Lambda extension fails to push logs to KDS on an intermittent / irregular basis

Reproduction Steps

https://github.com/dgarcia-collegeboard/aws-rust-lambda-extension-kinesis-example/blob/main/src/main.rs

The above code pushes to a KDS based on set env var

Possible Solution

No response

Additional Information/Context

Relevant issue link:
awslabs/aws-lambda-rust-runtime#837

Version

1.15.0

Environment details (OS name and version, etc.)

AWS NodeJS runtime for lambda

Logs

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" } }), connection: Unknown } }

The text was updated successfully, but these errors were encountered:

rcoh · 2024-03-20T17:54:46Z

is this happening during running the lambda locally or running it on a real lambda function? There are apparently some issues with the local simulator.

dgarcia-collegeboard · 2024-03-20T18:25:39Z

is this happening during running the lambda locally or running it on a real lambda function? There are apparently some issues with the local simulator.

Hey, thanks for the reply. This is happening with the extension deployed to AWS and attached to a lambda running nodejs

dgarcia-collegeboard · 2024-03-20T19:17:42Z

Updating, I'm getting another error that seems potentially related:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Canceled, hyper::Error(IncompleteMessage))

After looking a bit, I found this issue which seems to indicate that maybe the issue is related to connection pooling. I've attempted to look to see if anyone has encountered this issue with KDS / lambda extensions, agnostic to the actual language, but some of these errors seem specific to the usage of hyper. I can't narrow down if this is some setting with the Extension struct I need to modify, or if I need to add some more logic when creating the future to send to KDS.

Is there any similar issues others have experienced that could maybe lead to a resolution? I'm unsure what logic on the client-side will lead to resolving this issue

dgarcia-collegeboard · 2024-03-21T17:16:12Z

@rcoh hey there, just updating. I've added a bit more diagnostic information over on the other issue. There may be some nuances here between the lifecycle of the extension and the aws sdk client connections used / re-used for sdk calls. Do you see potential for conflicts there?

In regards to the error from the above comment:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Canceled, hyper::Error(IncompleteMessage))

This was caused by some experimenting with timeout_ms configuration on the extension's log buffer. I added a lot more technical info on this over on that issue I linked.

satyamagarwal249 · 2024-03-24T16:27:25Z

I had something similar issue in past when pushing data to AWS-S3, my findings:

Observed on wireshirk: both errors you mentions os 104 connection reset and incomplete message are same and cause by server closing (reset) connection (RST and not FIN/graceful close). Its just that hyper throws different error based on racing condition of when/how it gets to know about connection closure where it is attempting to flush data. os 104 is thrown when hyper is informed of closed connection by os while writing to closed socket, while incomplete msg is thrown when hyper knows of it while reading from closed socket.

So above is normal workflow, main issue is to identify cause of why server is closing connection. In my case I solved as:

a) connection pool: Default timeout for idle connection is 90s in hyper, while for S3 it is actually around 20s. So, for next request when a idle connection is picked by hyper, it might have been already closed by server leading to those errors.

b) Server may close connections on violation of different policies. I found that I was exceeding the requests-count/sec rate as well as request-size/sec rate. So, as soon as rates exceeded, server started connection resets randomly on connections leading to these errors intermittently. I solved it by tuning the data-inflight as well as concurrent connection count, like max idle pool size =500. Because even I limit data-inflight , in high-bandwidth low-latency scenario for small sized requests, I still could exceed request-rate limit.

For me, this reduced incomplete message error count from hundreds-of-thousands to almost zero.

dgarcia-collegeboard · 2024-03-27T12:05:32Z

I solved it by tuning the data-inflight as well as concurrent connection count, like max idle pool size =500. Because even I limit data-inflight , in high-bandwidth low-latency scenario for small sized requests, I still could exceed request-rate limit.

For me, this reduced incomplete message error count from hundreds-of-thousands to almost zero.

@satyamagarwal249 thanks for providing some more datapoints!

Do you have an example of how this is done? I have no results for idle or pool in the kinesis sdk docs
https://docs.rs/aws-sdk-kinesis/latest/aws_sdk_kinesis/?search=idle

dgarcia-collegeboard · 2024-03-28T12:17:19Z

This ticket seems like a good analog breakdown of the problem being experienced here

npgsql/npgsql#3559 (comment)

@rcoh I can't quite narrow in on the fix to this problem. The other ticket in the lambda extension repository has been closed because it seems like the issue may be related to the sdk library / underlying hyper configuration for the client. That ticket is here and it contains a lot of information that I'd rather not copy paste into here.

This error pattern seems to occur sometimes in node js aws sdk, and from research it seems like the fix there is to set a Keep-Alive header when sending requests. I don't know if that may fully resolve the issue, with consideration to the breakdown in the npgsql repository ticket.

Any potential root cause / fix you're seeing with respect to the linked tickets or the input from above?

dgarcia-collegeboard added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 20, 2024

dgarcia-collegeboard mentioned this issue Mar 25, 2024

Lambda Extension sometimes fails with "Connection reset by peer" awslabs/aws-lambda-rust-runtime#837

Closed

jmklix added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call to KDS 'put_records' fails intermittently with 'Connection reset by Peer' within lambda extension #1106

Call to KDS 'put_records' fails intermittently with 'Connection reset by Peer' within lambda extension #1106

dgarcia-collegeboard commented Mar 20, 2024

rcoh commented Mar 20, 2024

dgarcia-collegeboard commented Mar 20, 2024

dgarcia-collegeboard commented Mar 20, 2024

dgarcia-collegeboard commented Mar 21, 2024

satyamagarwal249 commented Mar 24, 2024

dgarcia-collegeboard commented Mar 27, 2024 •

edited

Loading

dgarcia-collegeboard commented Mar 28, 2024 •

edited

Loading

Call to KDS 'put_records' fails intermittently with 'Connection reset by Peer' within lambda extension #1106

Call to KDS 'put_records' fails intermittently with 'Connection reset by Peer' within lambda extension #1106

Comments

dgarcia-collegeboard commented Mar 20, 2024

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

Version

Environment details (OS name and version, etc.)

Logs

rcoh commented Mar 20, 2024

dgarcia-collegeboard commented Mar 20, 2024

dgarcia-collegeboard commented Mar 20, 2024

dgarcia-collegeboard commented Mar 21, 2024

satyamagarwal249 commented Mar 24, 2024

dgarcia-collegeboard commented Mar 27, 2024 • edited Loading

dgarcia-collegeboard commented Mar 28, 2024 • edited Loading

dgarcia-collegeboard commented Mar 27, 2024 •

edited

Loading

dgarcia-collegeboard commented Mar 28, 2024 •

edited

Loading