Support Serverless orchestration and Async strategies #4273

zzsi · 2024-09-30T18:17:09Z

Describe the type of feature and its functionality.

We have used flwr for a large scale (100s of TB) medical imaging use case. Thank you for this great library which made our life much easier.

When operating thousands of long running experiments, we faced a few pain points, mainly:

Managing multiple servers for individual experiments became tedious, fragile and unsustainable.
Different institutions (clients) have very different data and compute properties that make the training speed and time very uneven and instable, to a point that the synchronization became a bottleneck.

To scratch our own itch, we implemented flwr_serverless as a wrapper of flwr for both Sync and Async strategies. It allows federated training to run without a central server that aggregates models. The core federation functionality is passed through to flwr strategies. We summarized our learnings on public data in this tech report. With the added robustness due to serverless+async, this implementation addressed our pain points and allowed us to do large scale experimentation using flwr FL for the past year. We think other teams may also find this feature useful. Feedback and critique are welcome.

PS: I should probably have submitted feature request a year ago, but better late than never to contribute upstream. I also noticed related work on Async, which seems to have increasing need for practical deployments.

best,
ZZ AT kungfu.ai

Describe step by step what files and adjustments are you planning to include.

We implemented SyncFederatedNode and AsyncFederatedNode to handle commutation of model weights to a shared "Folder" (e.g. S3). And for tensorflow/keras, we implemented a FlwrFederatedCallback that is easy to plug into the user's training code. This callback holds the federated node, which in turn manages model federation. We haven't implemented torch integration but it could be similar.

An example usage:

# Create a FL Node that has a strategy and a shared folder.
from flwr.server.strategy import FedAvg  # This is a flwr federated strategy.
from flwr_serverless import AsyncFederatedNode, S3Folder
from flwr_serverless.keras import FlwrFederatedCallback

strategy = FedAvg()
shared_folder = S3Folder(directory="mybucket/experiment1")
node = AsyncFederatedNode(strategy=strategy, shared_folder=shared_folder)

# Create a keras Callback with the FL node.
num_examples_per_epoch = steps_per_epoch * batch_size # number of examples used in each epoch
callback = FlwrFederatedCallback(
    node,
    num_examples_per_epoch=num_examples_per_epoch,
    save_model_before_aggregation=False,
    save_model_after_aggregation=False,
)

# Join the federated learning, by fitting the model with the federated callback.
model = keras.Model(...)
model.compile(...)
model.fit(dataset, callbacks=[callback])

Is there something else you want to add?

No response

The text was updated successfully, but these errors were encountered:

leeloodub · 2024-11-21T18:54:33Z

Thanks @zzsi - I am not part of Flower team, but I have been looking for something similar in terms of having a serverless implementation. I would be interested in understanding whether it makes sense to create an architecture proposal around this for Flower team, and a PR.

Additionally, instead of having a central storage capability I was hoping to have a peer-to-peer system and I believe you've comments in your code about it. For that, clients could gossip updates on their model wieghts to each other I would imagine.

Will comment more on your repo, commenting here since your repo has not had any activity in the past 8 months.

zzsi · 2024-11-29T04:12:53Z

Hi @leeloodub, apologies for my delayed reply. P2p is also a great option we considered but opted for a central storage for its simplicity. Doing an architecture proposal and PR sounds good to me. Do you have a use case in mind that motivates the implementation?

WilliamLindskog · 2025-01-28T21:29:32Z

Hi @zzsi,

this would be a great contribution to the framework. Would you be interested in opening a PR? We can support you throughout the process.

zzsi · 2025-01-31T17:57:24Z

@WilliamLindskog yes, will do, appreciate it!

WilliamLindskog · 2025-01-31T18:10:00Z

@zzsi Great! Looking forward to it. Let me know if I can help in any other way.

WilliamLindskog · 2025-03-07T14:53:29Z

Hi @zzsi, just checking in here. Did you open a PR?

zzsi · 2025-03-07T15:33:46Z

@WilliamLindskog I have some work in progress, half done, sorry for the delay.

We only had a tensorflow example and I would like to add a torch quickstart. I will put more time on it next week.

WilliamLindskog · 2025-03-07T17:41:11Z

Okay great! Just thought I would check in. Lmk if we can help.

zzsi · 2025-03-07T18:25:15Z

@WilliamLindskog I am looking into simulating a serverless federated run. The current tool seems to assume client/server. Do you have recommendations or a particular way to simulation you prefer?

WilliamLindskog · 2025-03-12T16:42:40Z

Good question! @zzsi

There is no particular way to implement this I think. Since Flower usually expects a clientApp and serverApp to be passed, maybe we can pass an empty serverApp or dynamically "change" location of the server component where the aggregation happens.

Interested in hearing your thoughts.

Spaarsh · 2025-03-12T17:30:54Z

Hey all! I subscribed to the thread because I am also doing research in async FL. Thought about giving my 2 cents on this.

@WilliamLindskog In the entire Serverless FL setup, the controller acts as the only static node and keeps a client registry as well. It sounds similar to what a DNS does in a network (or it can do what a DNS does). It would make more sense to wrap the ServerApp around the controller. Of course, aggregation will take place elsewhere. But we could essentially reduce the existing ServerApp to simply a client registry and (maybe or maybe not) orchestration tool. The best thing is that it also doesn't require us to modify how the ClientApp treats this new "Server" App.

zzsi added the feature request This issue or comment suggests an additional feature. label Sep 30, 2024

WilliamLindskog added stale If issue/PR hasn't been updated within 3 weeks. part: communication Issues/PRs that affect federated communication e.g. gRPC. labels Dec 11, 2024

WilliamLindskog mentioned this issue Mar 12, 2025

Does the Flower support decentralized federated learning? #3149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Serverless orchestration and Async strategies #4273

Support Serverless orchestration and Async strategies #4273

zzsi commented Sep 30, 2024 •

edited

Loading

leeloodub commented Nov 21, 2024

zzsi commented Nov 29, 2024

WilliamLindskog commented Jan 28, 2025

zzsi commented Jan 31, 2025

WilliamLindskog commented Jan 31, 2025

WilliamLindskog commented Mar 7, 2025

zzsi commented Mar 7, 2025

WilliamLindskog commented Mar 7, 2025

zzsi commented Mar 7, 2025

WilliamLindskog commented Mar 12, 2025

Spaarsh commented Mar 12, 2025

Support Serverless orchestration and Async strategies #4273

Support Serverless orchestration and Async strategies #4273

Comments

zzsi commented Sep 30, 2024 • edited Loading

Describe the type of feature and its functionality.

Describe step by step what files and adjustments are you planning to include.

Is there something else you want to add?

leeloodub commented Nov 21, 2024

zzsi commented Nov 29, 2024

WilliamLindskog commented Jan 28, 2025

zzsi commented Jan 31, 2025

WilliamLindskog commented Jan 31, 2025

WilliamLindskog commented Mar 7, 2025

zzsi commented Mar 7, 2025

WilliamLindskog commented Mar 7, 2025

zzsi commented Mar 7, 2025

WilliamLindskog commented Mar 12, 2025

Spaarsh commented Mar 12, 2025

zzsi commented Sep 30, 2024 •

edited

Loading