-
Notifications
You must be signed in to change notification settings - Fork 955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Serverless orchestration and Async strategies #4273
Comments
Thanks @zzsi - I am not part of Flower team, but I have been looking for something similar in terms of having a serverless implementation. I would be interested in understanding whether it makes sense to create an architecture proposal around this for Flower team, and a PR. Additionally, instead of having a central storage capability I was hoping to have a peer-to-peer system and I believe you've comments in your code about it. For that, clients could gossip updates on their model wieghts to each other I would imagine. Will comment more on your repo, commenting here since your repo has not had any activity in the past 8 months. |
Hi @leeloodub, apologies for my delayed reply. P2p is also a great option we considered but opted for a central storage for its simplicity. Doing an architecture proposal and PR sounds good to me. Do you have a use case in mind that motivates the implementation? |
Hi @zzsi, this would be a great contribution to the framework. Would you be interested in opening a PR? We can support you throughout the process. |
@WilliamLindskog yes, will do, appreciate it! |
@zzsi Great! Looking forward to it. Let me know if I can help in any other way. |
Hi @zzsi, just checking in here. Did you open a PR? |
@WilliamLindskog I have some work in progress, half done, sorry for the delay. We only had a tensorflow example and I would like to add a torch quickstart. I will put more time on it next week. |
Okay great! Just thought I would check in. Lmk if we can help. |
@WilliamLindskog I am looking into simulating a serverless federated run. The current tool seems to assume client/server. Do you have recommendations or a particular way to simulation you prefer? |
Good question! @zzsi There is no particular way to implement this I think. Since Flower usually expects a Interested in hearing your thoughts. |
Hey all! I subscribed to the thread because I am also doing research in async FL. Thought about giving my 2 cents on this. @WilliamLindskog In the entire Serverless FL setup, the controller acts as the only static node and keeps a client registry as well. It sounds similar to what a DNS does in a network (or it can do what a DNS does). It would make more sense to wrap the |
Describe the type of feature and its functionality.
We have used
flwr
for a large scale (100s of TB) medical imaging use case. Thank you for this great library which made our life much easier.When operating thousands of long running experiments, we faced a few pain points, mainly:
To scratch our own itch, we implemented flwr_serverless as a wrapper of
flwr
for both Sync and Async strategies. It allows federated training to run without a central server that aggregates models. The core federation functionality is passed through toflwr
strategies. We summarized our learnings on public data in this tech report. With the added robustness due to serverless+async, this implementation addressed our pain points and allowed us to do large scale experimentation usingflwr
FL for the past year. We think other teams may also find this feature useful. Feedback and critique are welcome.PS: I should probably have submitted feature request a year ago, but better late than never to contribute upstream. I also noticed related work on Async, which seems to have increasing need for practical deployments.
best,
ZZ AT kungfu.ai
Describe step by step what files and adjustments are you planning to include.
We implemented
SyncFederatedNode
andAsyncFederatedNode
to handle commutation of model weights to a shared "Folder" (e.g. S3). And for tensorflow/keras, we implemented aFlwrFederatedCallback
that is easy to plug into the user's training code. This callback holds the federated node, which in turn manages model federation. We haven't implemented torch integration but it could be similar.An example usage:
Is there something else you want to add?
No response
The text was updated successfully, but these errors were encountered: