Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279

Open
nicolascaiitec opened this issue Oct 2, 2024 · 8 comments
Labels
part: communication Issues/PRs that affect federated communication e.g. gRPC. stale If issue/PR hasn't been updated within 3 weeks.

Comments

@nicolascaiitec
Copy link

nicolascaiitec commented Oct 2, 2024

What is your question?

I have a docker container running locally that contains the server.

The clients runs locally on my host machine and when I connect the client to the server it works normally. The fit, aggregation, etc.. all the rounds are fine.

But when putting the docker container running on AWS ECS (a service of ECS), the server is on listening, and then I try to connect 1 client. The client immediately closes the connection without error:

DEBUG:flwr:Opened secure gRPC connection using certificates
DEBUG:flwr:ChannelConnectivity.IDLE
DEBUG:flwr:ChannelConnectivity.CONNECTING
DEBUG:flwr:ChannelConnectivity.READY
DEBUG:flwr:gRPC channel closed
INFO :      Disconnect and shut down
INFO:flwr:Disconnect and shut down

The server does not have any log. Just listening.

When I try to connect the client to the server on AWS/ECS, but without passing certificates, it fails and there is a log on AWS/ECS: E0000 00:00:1727877451.667359 36 ssl_transport_security.cc:1654] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER

This suggests that the client is indeed able to communicate with the AWS server. But somehow, when passing certificates, immediately closes without any error.

I tried also not to require certificates from server and not pass them from client. The output is the same.

This was not happening before I made some major changes in the code, from flwr 1.3.0, I upgraded to flwr 1.9.0.
I also tried the latest flwr 1.11.1. The same output.

I do not know how to fix this. It was working just fine with flwr 1.3.0. Then I also did some refactoring to adapt to the classes of flwr 1.9.0 (fit, fitresponse, evaluate, evaluateResponse, etc..). It works fine in a docker image/container that runs locally. This same image is the same as in AWS/ECS.

Cannot understand what is wrong.

Any help is appreciated.

p.s.

Let me add some details:

  • The server keeps on listening, even though 1 client immediately closes the connection (minimum clients required for federation is 2 in this case).

  • It works fine with flwr 1.3.0, without touching any configuration/network settings in AWS ECS. The code upgrade from flwr 1.3.0 to 1.9.0 changed nothing from networking logics. I have just adapted the classes of fit and evaluate to take the new classes FitRes, EvaluateRes, FitIns, EvaluateIns, Parameters, etc.. Other than this there is only the change that

  • fl.client.start_numpy_client (flwr 1.3.0) ----> now using fl.client.start_client

Other than this there are not any network and config changes.

@PaulaDelgado-Santos
Copy link

Hi, I am facing the same problem! Thanks

@Robert-Steiner
Copy link
Member

Hey @nicolascaiitec,
great to see that you want to run the server on AWS ECS.

To resolve the issue, I need some more information:

To help troubleshoot the issue, you can try enabling gRPC trace logs by following the instructions in this link: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md#grpc_trace

@nicolascaiitec
Copy link
Author

@PaulaDelgado-Santos I have actually solved by downgrading flwr to 1.6 version. I tried various versions, I needed xg boosting, and flwr 1.6 was a good compromise.

@Robert-Steiner
I am not using official images, I just have a backend environment hosted with the following specs:

dockerfile:

FROM public.ecr.aws/docker/library/python:3.10.13-bookworm

COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

This contains the server running. The clients have the same requirements.txt (with flwr 1.6.0)

with requirementst.txt having flwr==1.11.0 (but I solved now with flwr 1.6.0, the connection does not close anymore immediately from the client side without returning any error)

I think it is a problem maybe related to the version of grpcio .

The certificates did not influence the problem. I tried to remove the requirements of the certificates from the server and client., the problem was still there when deployed in AWS, while in local worked just fine.

@WilliamLindskog WilliamLindskog added stale If issue/PR hasn't been updated within 3 weeks. part: communication Issues/PRs that affect federated communication e.g. gRPC. labels Dec 11, 2024
@WilliamLindskog
Copy link
Contributor

@Robert-Steiner has this been dealt with? If so, can we close it?

@WilliamLindskog
Copy link
Contributor

Hi @nicolascaiitec,

Thank you for raising this. Have you tried setting it up with the newest flwr version? 1.14.? Versions 1.6 and 1.3 are no longer supported and launching the newer ClientApp and ServerApp on AWS should work.

Could you try with the official Docker images and let us know? If the problem still persists, could you please paste the whole set-up here?

@nicolascaiitec
Copy link
Author

Hi @nicolascaiitec,

Thank you for raising this. Have you tried setting it up with the newest flwr version? 1.14.? Versions 1.6 and 1.3 are no longer supported and launching the newer ClientApp and ServerApp on AWS should work.

Could you try with the official Docker images and let us know? If the problem still persists, could you please paste the whole set-up here?

Hi, I tested. Still does not work. I did with certificates and got SSL problem, from client local and server in AWS. I removed the certificates requirements in both, and the server in AWS seems to not get anything, while the client closes immediately the connection:

DEBUG:flwr:Opened insecure gRPC connection (no certificates were passed)
DEBUG:flwr:ChannelConnectivity.IDLE
DEBUG:flwr:ChannelConnectivity.CONNECTING
DEBUG:flwr:ChannelConnectivity.READY
DEBUG:flwr:gRPC channel closed
INFO :      Disconnect and shut down
INFO:flwr:Disconnect and shut down

Locally flwr 1.14.0 works fine. With the server deployed in AWS ECS does not work.

flwr 1.6.0 works fine.

When I have time I try to give more info to replicate the pb.

@WilliamLindskog
Copy link
Contributor

Hm could you maybe test using this: https://flower.ai/docs/framework/docker/tutorial-deploy-on-multiple-machines.html? That might help.

Best regards
Will

@nicolascaiitec
Copy link
Author

I am actually using the old way of starting client and server:

    fl.client.start_client(  # start client numpy è deprecato 
        server_address=server_address, 
        client=...
        root_certificates=ca_cert
    )

and

hist = fl.server.start_server(
    server_address=..
    server=MyServer(client_manager=MyClientManager(data), strategy=strategy),
    strategy=strategy,
    client_manager=...
    config=fl.server.ServerConfig(num_rounds=num_rounds)
)

Not the superlink and supernode, which I have seen being added in the newer versions. Might be because of this?

With superlink and supernode I have problems replicating in local the examples from the site, I get various errors with dependencies, or the command to run the federation after updating the pyproject.toml gives me some error.
It seems it is still experimental superlink and supernode or just thought for simulations. We actually need more than some simulations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part: communication Issues/PRs that affect federated communication e.g. gRPC. stale If issue/PR hasn't been updated within 3 weeks.
Projects
None yet
Development

No branches or pull requests

4 participants