-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2401: Kubeflow LLM Trainer V2 #2410
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, varshaprasad96, truc0, astefanutti, seanlaii. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 13089269276Details
💛 - Coveralls |
Signed-off-by: Electronic-Waste <[email protected]>
|
||
And it's worthwhile to notice that we'll preprocess dataset for users with builtin dataset classes or a customized one. If users want to preprocess datasets by themselves, they need to implement a customized data class with specified methods implemented and pass it to the Python SDK. | ||
|
||
In the future, we'll provide users with more options on launchers (`torchtune`, `accelerate`), frameworks (TensorFlow, Jax, etc.) and fine-tuning techniques (RLHF, Distilation, etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchtune would be convenient for Llama-stack https://github.com/meta-llama/llama-stack/blob/529708215c5ad54e1ef41ba3e68d3a2af8d563b0/llama_stack/providers/inline/post_training/torchtune/post_training.py#L32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably strategic for us to make this work nicely with llama-stack. I think that could be a great growth lever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but currently torchtune
lacks support in multi-node training support: pytorch/torchtune#2018. We may need to wait for the multi-node training support or we can't scale our training on the Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Electronic-Waste and @franciscojavierarceo, we just merged official multi-node support in pytorch/torchtune#2301. It's available in nightly builds and builds from source right now, but will be released in the next week or so in our v0.6 version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great news @joecummings, thanks for letting us know!
I think, if we initially add Kubernetes blueprints for LLMs that torchtune
supports, we should be good: https://github.com/pytorch/torchtune/tree/main?tab=readme-ov-file#models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joecummings Thanks for updating this. It's a really big news for us!
I noticed that in pytorch/torchtune#2348 the README added the description of multinode support, but only full finetuning supports multinode mode: https://github.com/pytorch/torchtune#finetuning-recipes. May I ask whether torchtune has plans to support multinode mode in other finetuning methods in the future?
I think, if we initially add Kubernetes blueprints for LLMs that torchtune supports, we should be good: https://github.com/pytorch/torchtune/tree/main?tab=readme-ov-file#models.
@andreyvelich SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we'll have multi node support for all our distributed recipes soon!
Technically, multi node is already supported in all our distributed recipes. The reason we don't claim support is b/c we haven't officially benchmarked it + we want to land other optimizations like initialization of the process group on GPU if available and 2D parallelism.
mixed_precision: bool = True | ||
use_fp16: bool = False | ||
fsdp_cpu_offload: bool=False | ||
sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to utilize the new APIs for FSDP2 as this composes well with QLoRA (among other things). The docs are noticeably lacking, but the best resource is probably this ref from torchtitan's repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks @joecummings!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info @joecummings!
sharding_group_size: int = 0 # requires hsdp to be set. | ||
replica_group_size: int = 0 #requires hsdp to be set. | ||
checkpoint_type: StateDictType = StateDictType.SHARDED_STATE_DICT | ||
fsdp_activation_checkpointing: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool opportunity here to support activation offloading, too, which @janeyx99 was nice enough to develop and implement in torchtune! pytorch/torchtune#1443
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joecummings That's amazing!
1|52|Loss: 2.3697006702423096: 0%|▏ | 52/25880 [00:24<3:55:01, 1.83it/s] | ||
``` | ||
|
||
(**Note**: We need to create a new plugin for `torchtune`, so that it can fit in the yaml-based fine-tuning configurations. And also we may need to explore how to integrate the recipes provided by `torchtune`.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to chat more about the best way to integrate recipes here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, @joecummings!
Do you have any community calls where we can discuss it further ?
If you are available, we can talk about it during the next Kubeflow Training Working Group call on Feb 19th at 5pm UTC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, unfortunate timing - we just had a call last Thursday. More than willing to find some additional time for us to meet, but I'm also free to join your meeting on February 19th so we can talk more then :)
Should security, so hard multi-tenancy, istio support and Podsecuritystandards restricted be part of the KEP? |
This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170
We will collect the final community feedback by 2.12 and start the implementation after that.
Open Questions
torchrun
as the launcher for LLM Trainer, do we need to support more launchers liketorchtune
andaccelerate
in the future?/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0