Fix Error Handling For Connection Manager #507

waahm7 · 2025-03-19T21:31:21Z

Description of changes:
Our current connection manager will try to acquire max_connections, and if any connection acquisition request fails, it will fail all the excess connection acquisition requests with an error. For example, if max_connections is 50 and someone tries to acquire 1000 connections and connection acquisition fails for any reason, we will fail the 950 requests plus also the pending connection acquires with the same error. So we might end up failing 999 connection requests just because the first connection acquisition failed for any reason. These requests will count against the retry bucket and will result in emptying the bucket and leaking that error to the user.

Fix it so that we only fail the N requests if we fail to acquire N connections and don't fail everything just because one request failed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov-commenter · 2025-03-20T15:53:22Z

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.46%. Comparing base (e3a9cab) to head (29ac206).

Files with missing lines	Patch %	Lines
source/connection_manager.c	88.23%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #507      +/-   ##
==========================================
- Coverage   79.47%   79.46%   -0.01%     
==========================================
  Files          27       27              
  Lines       11666    11674       +8     
==========================================
+ Hits         9271     9277       +6     
- Misses       2395     2397       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TingDaoK

looks like we don't provide an api for user to cancel any pending acquisition.
It's okay that us don't check the error code, but i'd expect some level will check the error code to be non-retriable and then kill all the pending acquisitions? And we probably need some api for the user to do so.

waahm7 · 2025-03-24T20:32:28Z

@TingDaoK One possible option is to just drop the connection manager if they want to kill all pending acquisitions. I thought about that API, but I think it would be better to just wait for the use case instead of adding an API that no one uses. The existing behavior of killing all pending acquisitions was a surprise and no one has asked for a way to kill pending acquisitions for years.

TingDaoK · 2025-03-25T20:22:56Z

source/connection_manager.c

+        for (size_t i = 0; i < new_connection_failures &&
+                           manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS];


should manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS] be an assert?

We just decrease the AWS_HCMCT_PENDING_CONNECTIONS by new_connection_failures. If this condition happens, it means at somepoint, we have more AWS_HCMCT_PENDING_CONNECTIONS then pending_acquisition_count, which is probably wrong.

And it is confusing.

TingDaoK · 2025-03-25T20:26:27Z

source/connection_manager.c

        aws_mutex_unlock(&manager->lock);
+        if (has_pending_acquisitions) {
+            s_aws_http_connection_manager_execute_transaction(&updated_work);


I am not a fan of this recursive call happens in the mid of the function.
let's move it to the tail of the function.
https://www.geeksforgeeks.org/tail-recursion/

Also, it could lead to all the connection fails in the recursion and then callbacks starts to happen, which seems not be expected.

TingDaoK · 2025-03-25T20:42:24Z

source/connection_manager.c

-        /* fail acquisition as one connection cannot be used any more */
-        while (manager->pending_acquisition_count >
-               manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS] + manager->pending_settings_count) {
+        if (manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS]) {


assert?
I spent a while try to figure out what those numbers are. We only start connecting when there is some acquisition from user.

TingDaoK · 2025-03-25T20:46:11Z

source/connection_manager.c

@@ -1218,14 +1217,18 @@ static void s_aws_http_connection_manager_execute_transaction(struct aws_connect
    for (size_t i = 0; i < work->new_connections; ++i) {
        if (s_aws_http_connection_manager_new_connection(manager)) {
            ++new_connection_failures;
-            representative_error = aws_last_error();
+            int error = aws_last_error();
            if (push_errors) {


let's just assert on the array list allocation failure. That code is probably written when the allocation can still fail and we want to handle it properly.

TingDaoK · 2025-03-26T16:43:45Z

@TingDaoK One possible option is to just drop the connection manager if they want to kill all pending acquisitions. I thought about that API, but I think it would be better to just wait for the use case instead of adding an API that no one uses. The existing behavior of killing all pending acquisitions was a surprise and no one has asked for a way to kill pending acquisitions for years.

I guess if we don't provide the API now, and there is a workaround then people will just stick with the workaround and live with it.

waahm7 added 8 commits March 19, 2025 12:20

add tests with async connect

2e74baf

fix the test

616bd70

fix win warning

1f84ecb

Move extra step back

ad3ef27

online complete pending acqusition if there is one

784d5f9

just assert that acquisitions are still there

9afe70c

has pending acquires should happen after failing

e92bf1c

need to cleanup the work if we don't execute it

77edbff

rename

29ac206

TingDaoK reviewed Mar 24, 2025

View reviewed changes

TingDaoK reviewed Mar 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Error Handling For Connection Manager #507

Fix Error Handling For Connection Manager #507

waahm7 commented Mar 19, 2025

codecov-commenter commented Mar 20, 2025 •

edited

Loading

TingDaoK left a comment

waahm7 commented Mar 24, 2025

TingDaoK Mar 25, 2025

TingDaoK Mar 25, 2025

TingDaoK Mar 25, 2025

TingDaoK Mar 25, 2025

TingDaoK commented Mar 26, 2025

		for (size_t i = 0; i < new_connection_failures &&
		manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS];

Fix Error Handling For Connection Manager #507

Are you sure you want to change the base?

Fix Error Handling For Connection Manager #507

Conversation

waahm7 commented Mar 19, 2025

codecov-commenter commented Mar 20, 2025 • edited Loading

Codecov Report

TingDaoK left a comment

Choose a reason for hiding this comment

waahm7 commented Mar 24, 2025

TingDaoK Mar 25, 2025

Choose a reason for hiding this comment

TingDaoK Mar 25, 2025

Choose a reason for hiding this comment

TingDaoK Mar 25, 2025

Choose a reason for hiding this comment

TingDaoK Mar 25, 2025

Choose a reason for hiding this comment

TingDaoK commented Mar 26, 2025

codecov-commenter commented Mar 20, 2025 •

edited

Loading