grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279

nicolascaiitec · 2024-10-02T14:19:05Z

What is your question?

I have a docker container running locally that contains the server.

The clients runs locally on my host machine and when I connect the client to the server it works normally. The fit, aggregation, etc.. all the rounds are fine.

But when putting the docker container running on AWS ECS (a service of ECS), the server is on listening, and then I try to connect 1 client. The client immediately closes the connection without error:

DEBUG:flwr:Opened secure gRPC connection using certificates
DEBUG:flwr:ChannelConnectivity.IDLE
DEBUG:flwr:ChannelConnectivity.CONNECTING
DEBUG:flwr:ChannelConnectivity.READY
DEBUG:flwr:gRPC channel closed
INFO :      Disconnect and shut down
INFO:flwr:Disconnect and shut down

The server does not have any log. Just listening.

When I try to connect the client to the server on AWS/ECS, but without passing certificates, it fails and there is a log on AWS/ECS: E0000 00:00:1727877451.667359 36 ssl_transport_security.cc:1654] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER

This suggests that the client is indeed able to communicate with the AWS server. But somehow, when passing certificates, immediately closes without any error.

I tried also not to require certificates from server and not pass them from client. The output is the same.

This was not happening before I made some major changes in the code, from flwr 1.3.0, I upgraded to flwr 1.9.0.
I also tried the latest flwr 1.11.1. The same output.

I do not know how to fix this. It was working just fine with flwr 1.3.0. Then I also did some refactoring to adapt to the classes of flwr 1.9.0 (fit, fitresponse, evaluate, evaluateResponse, etc..). It works fine in a docker image/container that runs locally. This same image is the same as in AWS/ECS.

Cannot understand what is wrong.

Any help is appreciated.

p.s.

Let me add some details:

The server keeps on listening, even though 1 client immediately closes the connection (minimum clients required for federation is 2 in this case).
It works fine with flwr 1.3.0, without touching any configuration/network settings in AWS ECS. The code upgrade from flwr 1.3.0 to 1.9.0 changed nothing from networking logics. I have just adapted the classes of fit and evaluate to take the new classes FitRes, EvaluateRes, FitIns, EvaluateIns, Parameters, etc.. Other than this there is only the change that
fl.client.start_numpy_client (flwr 1.3.0) ----> now using fl.client.start_client

Other than this there are not any network and config changes.

The text was updated successfully, but these errors were encountered:

PaulaDelgado-Santos · 2024-10-07T10:14:05Z

Hi, I am facing the same problem! Thanks

Robert-Steiner · 2024-10-07T13:35:06Z

Hey @nicolascaiitec,
great to see that you want to run the server on AWS ECS.

To resolve the issue, I need some more information:

What Docker images are you using? Are you using the official flwr images?
Are you persisting the state or using the in-memory database?
Did you generate the certificates using the script at https://github.com/adap/flower/blob/main/dev/certificates/generate.sh?

To help troubleshoot the issue, you can try enabling gRPC trace logs by following the instructions in this link: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md#grpc_trace

nicolascaiitec · 2024-10-14T06:22:20Z

@PaulaDelgado-Santos I have actually solved by downgrading flwr to 1.6 version. I tried various versions, I needed xg boosting, and flwr 1.6 was a good compromise.

@Robert-Steiner
I am not using official images, I just have a backend environment hosted with the following specs:

dockerfile:

FROM public.ecr.aws/docker/library/python:3.10.13-bookworm

COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

This contains the server running. The clients have the same requirements.txt (with flwr 1.6.0)

with requirementst.txt having flwr==1.11.0 (but I solved now with flwr 1.6.0, the connection does not close anymore immediately from the client side without returning any error)

I think it is a problem maybe related to the version of grpcio .

The certificates did not influence the problem. I tried to remove the requirements of the certificates from the server and client., the problem was still there when deployed in AWS, while in local worked just fine.

WilliamLindskog · 2025-01-24T19:35:39Z

@Robert-Steiner has this been dealt with? If so, can we close it?

WilliamLindskog · 2025-01-29T12:59:04Z

Hi @nicolascaiitec,

Thank you for raising this. Have you tried setting it up with the newest flwr version? 1.14.? Versions 1.6 and 1.3 are no longer supported and launching the newer ClientApp and ServerApp on AWS should work.

Could you try with the official Docker images and let us know? If the problem still persists, could you please paste the whole set-up here?

nicolascaiitec · 2025-01-31T16:09:30Z

Hi @nicolascaiitec,

Thank you for raising this. Have you tried setting it up with the newest flwr version? 1.14.? Versions 1.6 and 1.3 are no longer supported and launching the newer ClientApp and ServerApp on AWS should work.

Could you try with the official Docker images and let us know? If the problem still persists, could you please paste the whole set-up here?

Hi, I tested. Still does not work. I did with certificates and got SSL problem, from client local and server in AWS. I removed the certificates requirements in both, and the server in AWS seems to not get anything, while the client closes immediately the connection:

DEBUG:flwr:Opened insecure gRPC connection (no certificates were passed)
DEBUG:flwr:ChannelConnectivity.IDLE
DEBUG:flwr:ChannelConnectivity.CONNECTING
DEBUG:flwr:ChannelConnectivity.READY
DEBUG:flwr:gRPC channel closed
INFO :      Disconnect and shut down
INFO:flwr:Disconnect and shut down

Locally flwr 1.14.0 works fine. With the server deployed in AWS ECS does not work.

flwr 1.6.0 works fine.

When I have time I try to give more info to replicate the pb.

WilliamLindskog · 2025-01-31T19:00:59Z

Hm could you maybe test using this: https://flower.ai/docs/framework/docker/tutorial-deploy-on-multiple-machines.html? That might help.

Best regards
Will

nicolascaiitec · 2025-02-10T15:11:29Z

I am actually using the old way of starting client and server:

    fl.client.start_client(  # start client numpy è deprecato 
        server_address=server_address, 
        client=...
        root_certificates=ca_cert
    )

and

hist = fl.server.start_server(
    server_address=..
    server=MyServer(client_manager=MyClientManager(data), strategy=strategy),
    strategy=strategy,
    client_manager=...
    config=fl.server.ServerConfig(num_rounds=num_rounds)
)

Not the superlink and supernode, which I have seen being added in the newer versions. Might be because of this?

With superlink and supernode I have problems replicating in local the examples from the site, I get various errors with dependencies, or the command to run the federation after updating the pyproject.toml gives me some error.
It seems it is still experimental superlink and supernode or just thought for simulations. We actually need more than some simulations.

nicolascaiitec added the question label Oct 2, 2024

WilliamLindskog added stale If issue/PR hasn't been updated within 3 weeks. part: communication Issues/PRs that affect federated communication e.g. gRPC. labels Dec 11, 2024

WilliamLindskog removed the question label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279

grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279

nicolascaiitec commented Oct 2, 2024 •

edited

Loading

PaulaDelgado-Santos commented Oct 7, 2024

Robert-Steiner commented Oct 7, 2024

nicolascaiitec commented Oct 14, 2024

WilliamLindskog commented Jan 24, 2025

WilliamLindskog commented Jan 29, 2025

nicolascaiitec commented Jan 31, 2025

WilliamLindskog commented Jan 31, 2025

nicolascaiitec commented Feb 10, 2025

grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279

grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279

Comments

nicolascaiitec commented Oct 2, 2024 • edited Loading

What is your question?

PaulaDelgado-Santos commented Oct 7, 2024

Robert-Steiner commented Oct 7, 2024

nicolascaiitec commented Oct 14, 2024

WilliamLindskog commented Jan 24, 2025

WilliamLindskog commented Jan 29, 2025

nicolascaiitec commented Jan 31, 2025

WilliamLindskog commented Jan 31, 2025

nicolascaiitec commented Feb 10, 2025

nicolascaiitec commented Oct 2, 2024 •

edited

Loading