Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Error Handling For Connection Manager #507

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

waahm7
Copy link
Contributor

@waahm7 waahm7 commented Mar 19, 2025

Description of changes:
Our current connection manager will try to acquire max_connections, and if any connection acquisition request fails, it will fail all the excess connection acquisition requests with an error. For example, if max_connections is 50 and someone tries to acquire 1000 connections and connection acquisition fails for any reason, we will fail the 950 requests plus also the pending connection acquires with the same error. So we might end up failing 999 connection requests just because the first connection acquisition failed for any reason. These requests will count against the retry bucket and will result in emptying the bucket and leaking that error to the user.

Fix it so that we only fail the N requests if we fail to acquire N connections and don't fail everything just because one request failed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2025

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.46%. Comparing base (e3a9cab) to head (29ac206).

Files with missing lines Patch % Lines
source/connection_manager.c 88.23% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #507      +/-   ##
==========================================
- Coverage   79.47%   79.46%   -0.01%     
==========================================
  Files          27       27              
  Lines       11666    11674       +8     
==========================================
+ Hits         9271     9277       +6     
- Misses       2395     2397       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@TingDaoK TingDaoK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we don't provide an api for user to cancel any pending acquisition.
It's okay that us don't check the error code, but i'd expect some level will check the error code to be non-retriable and then kill all the pending acquisitions? And we probably need some api for the user to do so.

@waahm7
Copy link
Contributor Author

waahm7 commented Mar 24, 2025

@TingDaoK One possible option is to just drop the connection manager if they want to kill all pending acquisitions. I thought about that API, but I think it would be better to just wait for the use case instead of adding an API that no one uses. The existing behavior of killing all pending acquisitions was a surprise and no one has asked for a way to kill pending acquisitions for years.

Comment on lines +1241 to +1242
for (size_t i = 0; i < new_connection_failures &&
manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS] be an assert?

We just decrease the AWS_HCMCT_PENDING_CONNECTIONS by new_connection_failures. If this condition happens, it means at somepoint, we have more AWS_HCMCT_PENDING_CONNECTIONS then pending_acquisition_count, which is probably wrong.

And it is confusing.

aws_mutex_unlock(&manager->lock);
if (has_pending_acquisitions) {
s_aws_http_connection_manager_execute_transaction(&updated_work);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this recursive call happens in the mid of the function.
let's move it to the tail of the function.
https://www.geeksforgeeks.org/tail-recursion/

Also, it could lead to all the connection fails in the recursion and then callbacks starts to happen, which seems not be expected.

/* fail acquisition as one connection cannot be used any more */
while (manager->pending_acquisition_count >
manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS] + manager->pending_settings_count) {
if (manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert?
I spent a while try to figure out what those numbers are. We only start connecting when there is some acquisition from user.

@@ -1218,14 +1217,18 @@ static void s_aws_http_connection_manager_execute_transaction(struct aws_connect
for (size_t i = 0; i < work->new_connections; ++i) {
if (s_aws_http_connection_manager_new_connection(manager)) {
++new_connection_failures;
representative_error = aws_last_error();
int error = aws_last_error();
if (push_errors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just assert on the array list allocation failure. That code is probably written when the allocation can still fail and we want to handle it properly.

@TingDaoK
Copy link
Contributor

@TingDaoK One possible option is to just drop the connection manager if they want to kill all pending acquisitions. I thought about that API, but I think it would be better to just wait for the use case instead of adding an API that no one uses. The existing behavior of killing all pending acquisitions was a surprise and no one has asked for a way to kill pending acquisitions for years.

I guess if we don't provide the API now, and there is a workaround then people will just stick with the workaround and live with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants