Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2401: Kubeflow LLM Trainer V2 #2410

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Electronic-Waste
Copy link
Member

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We will collect the final community feedback by 2.12 and start the implementation after that.

Open Questions

  1. Since we adopt torchrun as the launcher for LLM Trainer, do we need to support more launchers like torchtune and accelerate in the future?
  2. Do we want to support Adapter Prompt Tuning and Prefix Tuning?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, varshaprasad96, truc0, astefanutti, seanlaii.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We will collect the final community feedback by 2.12 and start the implementation after that.

Open Questions

  1. Since we adopt torchrun as the launcher for LLM Trainer, do we need to support more launchers like torchtune and accelerate in the future?
  2. Do we want to support Adapter Prompt Tuning and Prefix Tuning?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Feb 1, 2025

Pull Request Test Coverage Report for Build 13089269276

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on doc/KEP-2401 at 100.0%

Totals Coverage Status
Change from base Build 13016586638: 100.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls


And it's worthwhile to notice that we'll preprocess dataset for users with builtin dataset classes or a customized one. If users want to preprocess datasets by themselves, they need to implement a customized data class with specified methods implemented and pass it to the Python SDK.

In the future, we'll provide users with more options on launchers (`torchtune`, `accelerate`), frameworks (TensorFlow, Jax, etc.) and fine-tuning techniques (RLHF, Distilation, etc.).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably strategic for us to make this work nicely with llama-stack. I think that could be a great growth lever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but currently torchtune lacks support in multi-node training support: pytorch/torchtune#2018. We may need to wait for the multi-node training support or we can't scale our training on the Kubernetes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's unfortunate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Electronic-Waste and @franciscojavierarceo, we just merged official multi-node support in pytorch/torchtune#2301. It's available in nightly builds and builds from source right now, but will be released in the next week or so in our v0.6 version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great news @joecummings, thanks for letting us know!

I think, if we initially add Kubernetes blueprints for LLMs that torchtune supports, we should be good: https://github.com/pytorch/torchtune/tree/main?tab=readme-ov-file#models.

Copy link
Member Author

@Electronic-Waste Electronic-Waste Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joecummings Thanks for updating this. It's a really big news for us!

I noticed that in pytorch/torchtune#2348 the README added the description of multinode support, but only full finetuning supports multinode mode: https://github.com/pytorch/torchtune#finetuning-recipes. May I ask whether torchtune has plans to support multinode mode in other finetuning methods in the future?

I think, if we initially add Kubernetes blueprints for LLMs that torchtune supports, we should be good: https://github.com/pytorch/torchtune/tree/main?tab=readme-ov-file#models.

@andreyvelich SGTM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we'll have multi node support for all our distributed recipes soon!

Technically, multi node is already supported in all our distributed recipes. The reason we don't claim support is b/c we haven't officially benchmarked it + we want to land other optimizations like initialization of the process group on GPU if available and 2D parallelism.

mixed_precision: bool = True
use_fp16: bool = False
fsdp_cpu_offload: bool=False
sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to utilize the new APIs for FSDP2 as this composes well with QLoRA (among other things). The docs are noticeably lacking, but the best resource is probably this ref from torchtitan's repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks @joecummings!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info @joecummings!

sharding_group_size: int = 0 # requires hsdp to be set.
replica_group_size: int = 0 #requires hsdp to be set.
checkpoint_type: StateDictType = StateDictType.SHARDED_STATE_DICT
fsdp_activation_checkpointing: bool = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool opportunity here to support activation offloading, too, which @janeyx99 was nice enough to develop and implement in torchtune! pytorch/torchtune#1443

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joecummings That's amazing!

1|52|Loss: 2.3697006702423096: 0%|| 52/25880 [00:24<3:55:01, 1.83it/s]
```

(**Note**: We need to create a new plugin for `torchtune`, so that it can fit in the yaml-based fine-tuning configurations. And also we may need to explore how to integrate the recipes provided by `torchtune`.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to chat more about the best way to integrate recipes here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, @joecummings!
Do you have any community calls where we can discuss it further ?
If you are available, we can talk about it during the next Kubeflow Training Working Group call on Feb 19th at 5pm UTC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, unfortunate timing - we just had a call last Thursday. More than willing to find some additional time for us to meet, but I'm also free to join your meeting on February 19th so we can talk more then :)

@juliusvonkohout
Copy link
Member

Should security, so hard multi-tenancy, istio support and Podsecuritystandards restricted be part of the KEP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants