Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mysql-connect_retries_on_failure not effective to failures like connection refused #4597

Open
xykhappy opened this issue Jul 31, 2024 · 3 comments

Comments

@xykhappy
Copy link

For the hostgroup with a single backend, if the backend server goes down, the ProxySQL will return the error "Max connect timeout reached while reaching hostgroup" after the "mysql-connect_timeout_server_max" duration elapses if the backend is marked as "SHUNNED". In contrast, if establishing a direct connection without ProxySQL, the client will get a "connection refused" error or error code 111 immediately, which makes more sense.

After debugging the latest binary, it appears that ProxySQL keeps attempting to get a good connection at intervals controlled by "mysql-connect_retries_delay" until "mysql-connect_timeout_server_max" is reached according to the method MySQL_Session::handler_again___status_CONNECTING_SERVER.

bool MySQL_Session::handler_again___status_CONNECTING_SERVER(int *_rc) {

And the retry is not even controlled by "mysql-connect_retries_on_failure" (as it should be) because the process always falls into this condition in the scenario I mentioned above.
if (mybe->server_myds->myconn==NULL) {

If debug further, we will find the method MyHGC::get_random_MySrvC returns NULL all the time during the retry for this single "SHUNNED" backend.
MySrvC *MyHGC::get_random_MySrvC(char * gtid_uuid, uint64_t gtid_trxid, int max_lag_ms, MySQL_Session *sess) {

According to this method, the ProxySQL should attempt to bring the "SHUNNED" backend online but it fails all the time because the "mysrvc->time_last_detected_error" is always in the future related to "mysql-monitor_ping_interval".

					// if Monitor is enabled and mysql-monitor_ping_interval is
					// set too high, ProxySQL will unshun hosts that are not
					// available. For this reason time_last_detected_error will
					// be tuned in the future
					if (mysql_thread___monitor_enabled) {
						int a = mysql_thread___shun_recovery_time_sec;
						int b = mysql_thread___monitor_ping_interval;
						b = b/1000;
						if (b > a) {
							t = t + (b - a);
						}
					}
					mysrvc->time_last_detected_error = t;

I'm not sure whether the logic is intentionally designed this way. However, the retry time should be controlled by 'mysql-connect_retries_on_failure' rather than its current implementation. Additionally, I believe it would be more effective to return the exact failure error to the client.

@renecannao
Copy link
Contributor

Hi @xykhappy .

A premise of this issue:
if you get error 111 when trying to connect, that means that there is no process listening on that port.
111 is OS error code 111: Connection refused , that means that the 3 way TCP handshake couldn't happen .

The issue you are facing is actually a feature.
In fact, one of the main feature of ProxySQL is to handle backend failures and failover in a way that is completely transparent to the clients.

if the backend server goes down, the ProxySQL will return the error "Max connect timeout reached while reaching hostgroup" [...] In contrast, if establishing a direct connection without ProxySQL, the client will get a "connection refused" error or error code 111 immediately, which makes more sense

It doesn't really make sense once you understand the difference between the two setups.

If you are attempting directly to a backend, you get error 111 because there is no service.
That means that the connection was not established.

If the client connects to ProxySQL , the connection to the client's backend (ProxySQL) was successful.
Then, when the client tries to execute a query and there are no backend avaiable, ProxySQL is not able to execute the query.
Should you expect the client to see its connection (that was successfully created) being dropped?
For sure it can't get an error 111 , because 111 means Connection refused , and it is not possible to return to that stage.
Maybe (I would argue no) the client could receive a Lost connection .

If a client connects to ProxySQL , it runs a query successfully, and the backend is restarted, would you expect ProxySQL to drop the client's connection?
The way ProxySQL operates would allow you to restart a backend without any disruption to the client connections.

In a nutshell, you are trying to compare a direct connection to a backend to running a query through proxysql .
They are two completely different entities (connection vs query) in two completely different setups.

Furthermore, please remember that ProxySQL is a reverse proxy and not a forward proxy.
The client sends the query to the proxy , and the proxy will returns a result to the client: the client is completely unaware of the backends' involvement and status.

@xykhappy
Copy link
Author

xykhappy commented Jul 31, 2024

Hi @renecannao,

Thanks for the quick response!
I completely agree with you regarding the handling of backend failures. And I have updated the issue title to prevent any confusion.
To make my point more clear, I think the ProxySQL should offer users the same level of control to the retry logic of the "connection refused" error as it does to other failure types. In general, the ProxySQL should honor not only the "mysql-connect_retries_delay" but also "mysql-connect_retries_on_failure" when processing such error. In current implementation, it keeps retrying until "mysql-connect_timeout_server_max" duration elapses, which causes a slow failure.
In our scenario, during an Aurora failover, we expect the process to fail rapidly to prevent connection creation requests from accumulating on the client side and subsequently overwhelming the backend Aurora instance. Conversely, we require a relatively large "mysql-connect_timeout_server_max" and "mysql-connect_timeout_server" (around 5 seconds) to ensure the client can endure a certain level of performance degradation in Aurora. This presents a contradiction since both situations depend on the "mysql-connect_timeout_server_max" setting.

Hope it makes more sense this time.

@xykhappy xykhappy changed the title Get "Max connect timeout reached" rather than "connection refused" for single broken backend in a hostgroup mysql-connect_retries_on_failure not effective to failures like connection refused Jul 31, 2024
@xykhappy
Copy link
Author

Created the pull request for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants