-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Error Handling For Connection Manager #507
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #507 +/- ##
==========================================
- Coverage 79.47% 79.46% -0.01%
==========================================
Files 27 27
Lines 11666 11674 +8
==========================================
+ Hits 9271 9277 +6
- Misses 2395 2397 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we don't provide an api for user to cancel any pending acquisition.
It's okay that us don't check the error code, but i'd expect some level will check the error code to be non-retriable and then kill all the pending acquisitions? And we probably need some api for the user to do so.
@TingDaoK One possible option is to just drop the connection manager if they want to kill all pending acquisitions. I thought about that API, but I think it would be better to just wait for the use case instead of adding an API that no one uses. The existing behavior of killing all pending acquisitions was a surprise and no one has asked for a way to kill pending acquisitions for years. |
for (size_t i = 0; i < new_connection_failures && | ||
manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS]
be an assert?
We just decrease the AWS_HCMCT_PENDING_CONNECTIONS
by new_connection_failures
. If this condition happens, it means at somepoint, we have more AWS_HCMCT_PENDING_CONNECTIONS
then pending_acquisition_count
, which is probably wrong.
And it is confusing.
aws_mutex_unlock(&manager->lock); | ||
if (has_pending_acquisitions) { | ||
s_aws_http_connection_manager_execute_transaction(&updated_work); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of this recursive call happens in the mid of the function.
let's move it to the tail of the function.
https://www.geeksforgeeks.org/tail-recursion/
Also, it could lead to all the connection fails in the recursion and then callbacks starts to happen, which seems not be expected.
/* fail acquisition as one connection cannot be used any more */ | ||
while (manager->pending_acquisition_count > | ||
manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS] + manager->pending_settings_count) { | ||
if (manager->pending_acquisition_count > manager->internal_ref[AWS_HCMCT_PENDING_CONNECTIONS]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert?
I spent a while try to figure out what those numbers are. We only start connecting when there is some acquisition from user.
@@ -1218,14 +1217,18 @@ static void s_aws_http_connection_manager_execute_transaction(struct aws_connect | |||
for (size_t i = 0; i < work->new_connections; ++i) { | |||
if (s_aws_http_connection_manager_new_connection(manager)) { | |||
++new_connection_failures; | |||
representative_error = aws_last_error(); | |||
int error = aws_last_error(); | |||
if (push_errors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just assert on the array list allocation failure. That code is probably written when the allocation can still fail and we want to handle it properly.
I guess if we don't provide the API now, and there is a workaround then people will just stick with the workaround and live with it. |
Description of changes:
Our current connection manager will try to acquire max_connections, and if any connection acquisition request fails, it will fail all the excess connection acquisition requests with an error. For example, if max_connections is 50 and someone tries to acquire 1000 connections and connection acquisition fails for any reason, we will fail the 950 requests plus also the pending connection acquires with the same error. So we might end up failing 999 connection requests just because the first connection acquisition failed for any reason. These requests will count against the retry bucket and will result in emptying the bucket and leaking that error to the user.
Fix it so that we only fail the N requests if we fail to acquire N connections and don't fail everything just because one request failed.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.