Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server stops waiting for client evaluation results when both server-side and client-side evaluation are used #4519

Open
MikeRiz521 opened this issue Nov 18, 2024 · 3 comments
Labels
bug Something isn't working needs reproduction details Issues that require additional information or context from the author before proceeding part: communication Issues/PRs that affect federated communication e.g. gRPC. state: open Open Issue/PR that is not yet under review.

Comments

@MikeRiz521
Copy link

Describe the bug

When evaluating the model both centrally in the server and federated in the clients, after server finishes evaluating the model for the current round and sends the evaluate message to the clients, it seems that after a specific amount of time the server abruptly closes the connection with the clients and proceeds to log the aggregate_evaluate steps as failures, and proceeds to the next round, while the client has nowhere to send the evaluation results

Steps/Code to Reproduce

implement both an evaluate_fn and an evaluate_metrics_aggregation_fn in the federated learning strategy method

Expected Results

The server to collect the distributed evaluation results from the clients, aggregate them according to the configured evaluate_metrics_aggregate_fn function, and then proceed with the next round of federated training.

Actual Results

This error appears after a while in all of the clients:
I0000 00:00:1731929155.490662 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {grpc_status:14, http2_error:11, created_time:"2024-11-18T13:25:55.490653401+02:00"} I0000 00:00:1731929155.491912 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {created_time:"2024-11-18T13:25:55.491907782+02:00", http2_error:11, grpc_status:14}

After some experimentation, it seems to be a timing issue, if the client evaluation step takes a short amount of time, then it works as expected. However when running only client-side evaluation, this time limis is non-existent and evaluation can last for an hour or more. Therefore maybe there is a way to configure the gRPC setting of the server to keep the connection alive for longer?

@MikeRiz521 MikeRiz521 added the bug Something isn't working label Nov 18, 2024
@adam-narozniak
Copy link
Contributor

Could you provide your full code you used?

@MikeRiz521
Copy link
Author

Do you require the entire evaluation code in both server and clients? I'd rather not share is since tit is part of a research work, but I can share the arguments in start_server if that is helpful.

@WilliamLindskog WilliamLindskog added needs reproduction details Issues that require additional information or context from the author before proceeding state: open Open Issue/PR that is not yet under review. part: communication Issues/PRs that affect federated communication e.g. gRPC. labels Dec 11, 2024
@WilliamLindskog
Copy link
Contributor

Hi @MikeRiz521,

Thank you for raising this. If you can give us as much code as possible for reproduction, that would help speed up reviewing this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs reproduction details Issues that require additional information or context from the author before proceeding part: communication Issues/PRs that affect federated communication e.g. gRPC. state: open Open Issue/PR that is not yet under review.
Projects
None yet
Development

No branches or pull requests

3 participants