KEP-2401: Kubeflow LLM Trainer V2 #2410

Electronic-Waste · 2025-02-01T13:45:12Z

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We will collect the final community feedback by 2.12 and start the implementation after that.

Open Questions

Since we adopt torchrun as the launcher for LLM Trainer, do we need to support more launchers like torchtune and accelerate in the future?
Do we want to support Adapter Prompt Tuning and Prefix Tuning?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow · 2025-02-01T13:45:23Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, varshaprasad96, truc0, astefanutti, seanlaii.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We will collect the final community feedback by 2.12 and start the implementation after that.

Open Questions

Since we adopt torchrun as the launcher for LLM Trainer, do we need to support more launchers like torchtune and accelerate in the future?

Do we want to support Adapter Prompt Tuning and Prefix Tuning?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2025-02-01T13:45:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-02-01T13:50:30Z

Pull Request Test Coverage Report for Build 13089269276

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on doc/KEP-2401 at 100.0%

Totals
Change from base Build 13016586638:	100.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

Signed-off-by: Electronic-Waste <[email protected]>

franciscojavierarceo · 2025-02-05T14:44:16Z

docs/proposals/2401-llm-trainer-v2/README.md

+
+And it's worthwhile to notice that we'll preprocess dataset for users with builtin dataset classes or a customized one. If users want to preprocess datasets by themselves, they need to implement a customized data class with specified methods implemented and pass it to the Python SDK.
+
+In the future, we'll provide users with more options on launchers (`torchtune`, `accelerate`), frameworks (TensorFlow, Jax, etc.) and fine-tuning techniques (RLHF, Distilation, etc.).


torchtune would be convenient for Llama-stack https://github.com/meta-llama/llama-stack/blob/529708215c5ad54e1ef41ba3e68d3a2af8d563b0/llama_stack/providers/inline/post_training/torchtune/post_training.py#L32

It would probably strategic for us to make this work nicely with llama-stack. I think that could be a great growth lever.

Yeah, but currently torchtune lacks support in multi-node training support: pytorch/torchtune#2018. We may need to wait for the multi-node training support or we can't scale our training on the Kubernetes.

Oh, that's unfortunate.

Hey @Electronic-Waste and @franciscojavierarceo, we just merged official multi-node support in pytorch/torchtune#2301. It's available in nightly builds and builds from source right now, but will be released in the next week or so in our v0.6 version.

This is great news @joecummings, thanks for letting us know!

I think, if we initially add Kubernetes blueprints for LLMs that torchtune supports, we should be good: https://github.com/pytorch/torchtune/tree/main?tab=readme-ov-file#models.

@joecummings Thanks for updating this. It's a really big news for us!

I noticed that in pytorch/torchtune#2348 the README added the description of multinode support, but only full finetuning supports multinode mode: https://github.com/pytorch/torchtune#finetuning-recipes. May I ask whether torchtune has plans to support multinode mode in other finetuning methods in the future?

I think, if we initially add Kubernetes blueprints for LLMs that torchtune supports, we should be good: https://github.com/pytorch/torchtune/tree/main?tab=readme-ov-file#models.

@andreyvelich SGTM.

Yep, we'll have multi node support for all our distributed recipes soon!

Technically, multi node is already supported in all our distributed recipes. The reason we don't claim support is b/c we haven't officially benchmarked it + we want to land other optimizations like initialization of the process group on GPU if available and 2D parallelism.

joecummings · 2025-02-10T18:19:56Z

docs/proposals/2401-llm-trainer-v2/README.md

+    mixed_precision: bool = True
+    use_fp16: bool = False
+    fsdp_cpu_offload: bool=False
+    sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD


You may want to utilize the new APIs for FSDP2 as this composes well with QLoRA (among other things). The docs are noticeably lacking, but the best resource is probably this ref from torchtitan's repo.

Amazing, thanks @joecummings!

Thanks for the info @joecummings!

joecummings · 2025-02-10T18:23:38Z

docs/proposals/2401-llm-trainer-v2/README.md

+    sharding_group_size: int = 0 # requires hsdp to be set.
+    replica_group_size: int = 0 #requires hsdp to be set.
+    checkpoint_type: StateDictType = StateDictType.SHARDED_STATE_DICT
+    fsdp_activation_checkpointing: bool = True


Cool opportunity here to support activation offloading, too, which @janeyx99 was nice enough to develop and implement in torchtune! pytorch/torchtune#1443

@joecummings That's amazing!

joecummings · 2025-02-10T18:24:21Z

docs/proposals/2401-llm-trainer-v2/README.md

+1|52|Loss: 2.3697006702423096:   0%|▏                     | 52/25880 [00:24<3:55:01,  1.83it/s]
+```
+
+(**Note**: We need to create a new plugin for `torchtune`, so that it can fit in the yaml-based fine-tuning configurations. And also we may need to explore how to integrate the recipes provided by `torchtune`.)


Happy to chat more about the best way to integrate recipes here!

Absolutely, @joecummings!
Do you have any community calls where we can discuss it further ?
If you are available, we can talk about it during the next Kubeflow Training Working Group call on Feb 19th at 5pm UTC.

Ahh, unfortunate timing - we just had a call last Thursday. More than willing to find some additional time for us to meet, but I'm also free to join your meeting on February 19th so we can talk more then :)

juliusvonkohout · 2025-02-12T10:36:33Z

Should security, so hard multi-tenancy, istio support and Podsecuritystandards restricted be part of the KEP?

Electronic-Waste added 9 commits January 31, 2025 15:47

doc: add initial doc for KEP-2401.

99acfaa

Signed-off-by: Electronic-Waste <[email protected]>

doc: update motivation.

1b58c2c

Signed-off-by: Electronic-Waste <[email protected]>

doc: add llm lifecycle picture.

66a4df1

Signed-off-by: Electronic-Waste <[email protected]>

doc: add goals and non-goals.

7cdcd68

Signed-off-by: Electronic-Waste <[email protected]>

doc: add alternatives.

f8cfeb3

Signed-off-by: Electronic-Waste <[email protected]>

doc: add proposal chapter.

88c9871

Signed-off-by: Electronic-Waste <[email protected]>

doc: add multiple frameworks support section in design chapter.

3c2e094

Signed-off-by: Electronic-Waste <[email protected]>

doc: add data preprocess section in design chapter.

c2f7307

Signed-off-by: Electronic-Waste <[email protected]>

doc: add fine-tuning config section in design details chapter.

1cdf053

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot requested review from franciscojavierarceo, deepanker13, thesuperzapper, juliusvonkohout, a team, rimolive, varodrig, Doris-xm, helenxie-bit and jbottum February 1, 2025 13:45

google-oss-prow bot added the size/L label Feb 1, 2025

fix(doc): remote all trailing whitespaces.

badc4f9

Signed-off-by: Electronic-Waste <[email protected]>

franciscojavierarceo reviewed Feb 5, 2025

View reviewed changes

andreyvelich mentioned this pull request Feb 7, 2025

Nominate @Electronic-Waste as a reviewer #2427

Merged

joecummings reviewed Feb 10, 2025

View reviewed changes

andreyvelich mentioned this pull request Feb 11, 2025

Enable GPU Testing for LLM Blueprints #2432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2401: Kubeflow LLM Trainer V2 #2410

KEP-2401: Kubeflow LLM Trainer V2 #2410

Electronic-Waste commented Feb 1, 2025

google-oss-prow bot commented Feb 1, 2025

Open Questions

google-oss-prow bot commented Feb 1, 2025

coveralls commented Feb 1, 2025 •

edited

Loading

franciscojavierarceo Feb 5, 2025

franciscojavierarceo Feb 5, 2025

Electronic-Waste Feb 5, 2025

franciscojavierarceo Feb 5, 2025

joecummings Feb 10, 2025

andreyvelich Feb 10, 2025

Electronic-Waste Feb 11, 2025 •

edited

Loading

joecummings Feb 11, 2025

joecummings Feb 10, 2025

franciscojavierarceo Feb 10, 2025

Electronic-Waste Feb 11, 2025

joecummings Feb 10, 2025

Electronic-Waste Feb 11, 2025

joecummings Feb 10, 2025

andreyvelich Feb 11, 2025

joecummings Feb 11, 2025

juliusvonkohout commented Feb 12, 2025


		And it's worthwhile to notice that we'll preprocess dataset for users with builtin dataset classes or a customized one. If users want to preprocess datasets by themselves, they need to implement a customized data class with specified methods implemented and pass it to the Python SDK.

		In the future, we'll provide users with more options on launchers (`torchtune`, `accelerate`), frameworks (TensorFlow, Jax, etc.) and fine-tuning techniques (RLHF, Distilation, etc.).

KEP-2401: Kubeflow LLM Trainer V2 #2410

Are you sure you want to change the base?

KEP-2401: Kubeflow LLM Trainer V2 #2410

Conversation

Electronic-Waste commented Feb 1, 2025

Open Questions

google-oss-prow bot commented Feb 1, 2025

Open Questions

google-oss-prow bot commented Feb 1, 2025

coveralls commented Feb 1, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13089269276

Details

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Electronic-Waste Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliusvonkohout commented Feb 12, 2025

coveralls commented Feb 1, 2025 •

edited

Loading

Electronic-Waste Feb 11, 2025 •

edited

Loading