-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279
Comments
Hi, I am facing the same problem! Thanks |
Hey @nicolascaiitec, To resolve the issue, I need some more information:
To help troubleshoot the issue, you can try enabling gRPC trace logs by following the instructions in this link: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md#grpc_trace |
@PaulaDelgado-Santos I have actually solved by downgrading flwr to 1.6 version. I tried various versions, I needed xg boosting, and flwr 1.6 was a good compromise. @Robert-Steiner dockerfile:
This contains the server running. The clients have the same requirements.txt (with flwr 1.6.0) with requirementst.txt having flwr==1.11.0 (but I solved now with flwr 1.6.0, the connection does not close anymore immediately from the client side without returning any error) I think it is a problem maybe related to the version of grpcio . The certificates did not influence the problem. I tried to remove the requirements of the certificates from the server and client., the problem was still there when deployed in AWS, while in local worked just fine. |
@Robert-Steiner has this been dealt with? If so, can we close it? |
Hi @nicolascaiitec, Thank you for raising this. Have you tried setting it up with the newest Could you try with the official Docker images and let us know? If the problem still persists, could you please paste the whole set-up here? |
Hi, I tested. Still does not work. I did with certificates and got SSL problem, from client local and server in AWS. I removed the certificates requirements in both, and the server in AWS seems to not get anything, while the client closes immediately the connection:
Locally flwr 1.14.0 works fine. With the server deployed in AWS ECS does not work. flwr 1.6.0 works fine. When I have time I try to give more info to replicate the pb. |
Hm could you maybe test using this: https://flower.ai/docs/framework/docker/tutorial-deploy-on-multiple-machines.html? That might help. Best regards |
I am actually using the old way of starting client and server:
and
Not the superlink and supernode, which I have seen being added in the newer versions. Might be because of this? With superlink and supernode I have problems replicating in local the examples from the site, I get various errors with dependencies, or the command to run the federation after updating the pyproject.toml gives me some error. |
What is your question?
I have a docker container running locally that contains the server.
The clients runs locally on my host machine and when I connect the client to the server it works normally. The fit, aggregation, etc.. all the rounds are fine.
But when putting the docker container running on AWS ECS (a service of ECS), the server is on listening, and then I try to connect 1 client. The client immediately closes the connection without error:
The server does not have any log. Just listening.
When I try to connect the client to the server on AWS/ECS, but without passing certificates, it fails and there is a log on AWS/ECS: E0000 00:00:1727877451.667359 36 ssl_transport_security.cc:1654] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
This suggests that the client is indeed able to communicate with the AWS server. But somehow, when passing certificates, immediately closes without any error.
I tried also not to require certificates from server and not pass them from client. The output is the same.
This was not happening before I made some major changes in the code, from flwr 1.3.0, I upgraded to flwr 1.9.0.
I also tried the latest flwr 1.11.1. The same output.
I do not know how to fix this. It was working just fine with flwr 1.3.0. Then I also did some refactoring to adapt to the classes of flwr 1.9.0 (fit, fitresponse, evaluate, evaluateResponse, etc..). It works fine in a docker image/container that runs locally. This same image is the same as in AWS/ECS.
Cannot understand what is wrong.
Any help is appreciated.
p.s.
Let me add some details:
The server keeps on listening, even though 1 client immediately closes the connection (minimum clients required for federation is 2 in this case).
It works fine with flwr 1.3.0, without touching any configuration/network settings in AWS ECS. The code upgrade from flwr 1.3.0 to 1.9.0 changed nothing from networking logics. I have just adapted the classes of fit and evaluate to take the new classes FitRes, EvaluateRes, FitIns, EvaluateIns, Parameters, etc.. Other than this there is only the change that
fl.client.start_numpy_client (flwr 1.3.0) ----> now using fl.client.start_client
Other than this there are not any network and config changes.
The text was updated successfully, but these errors were encountered: