Add volume and volume mounts arguments to TrainingClient.create_job API #2449

astefanutti · 2025-02-25T09:00:12Z

What this PR does / why we need it:

This PR adds the volumes and volume_mounts arguments to the create_job API.

This is particularly needed to mount shared storage for distributed jobs.

Checklist:

Docs included if any changes are user facing

astefanutti · 2025-02-25T09:00:50Z

@andreyvelich @tenzen-y @Electronic-Waste could you please take a look?

Signed-off-by: Antonin Stefanutti <[email protected]>

coveralls · 2025-02-25T09:12:47Z

Pull Request Test Coverage Report for Build 13517559347

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 13314191840:	0.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

Electronic-Waste

@astefanutti Thanks for this!

/lgtm

andreyvelich

Thank you for this @astefanutti!

andreyvelich · 2025-02-25T10:56:20Z

sdk/python/kubeflow/training/api/training_client.py

+        volumes: Optional[List[models.V1Volume]] = None,
+        volume_mounts: Optional[List[models.V1VolumeMount]] = None,


How we can make it easier to configure from the ML engineer perspective ?

I know that @truc0 has implemented it in a way that we can accept PVC, ConfigMap, or Secret as volume for Trial: kubeflow/katib#2508. But it still uses Kubernetes APIs.

@hbelmiro @rimolive @HumairAK Do we know if we have some sort of Kubernetes volume support in KFP V2 ?
I noticed that it is not intended to be implemented in V2: kubeflow/pipelines#8570
In that case, how users can attach volumes to their component in a workflow ?

+1 @HumairAK

In KFP users can access underlying k8s pod spec apis like mounting secrets/volumes/configmaps etc. (example for volumes here)
This exposes k8s apis to the KFP sdk, so it begs the question how friendly this is to the ML engineer persona, one who is not so k8s aware.

I'm not caught up on what use case this PR is looking to resolve, but in KFP we have a specific case of data passing between pipeline jobs, the storage of this data is abstracted via object store, but we see that users sometimes want to use their PVC for this instead without having to manually configure these PVC's in the pipeline. The end result being, you just declare what inputs/outputs you want in your python pipeline sdk, and under the hood we automatically utilize the PVC that you provided when you deployed KFP. This is a feature we'd like to introduce to KFP in the future, you can track it here.

The use case if for any distributed training job to be able to persist checkpoints. For example, this is what the InstructLab fine-tuning step does, and that can only be done via the PyTorchJob API at the moment, not via the SDK.
With this PR, it'll also be possible to do it with the SDK.

Stated differently, without a way to pass a PVC to the PyTorchJob via the SDK, the SDK is rather useless for any real distributed training jobs.

I agree with @astefanutti, I just think whether we should give MLE control to all Volume and VolumeMount APIs.
Maybe specifying the volumeName is Sufficient enough.
Maybe we can improve it in V2 SDK given the separation between TrainJob and TrainingRuntime.

andreyvelich · 2025-02-25T21:28:44Z

Thank you for this @astefanutti!
/lgtm
/approve

google-oss-prow · 2025-02-25T21:28:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, Electronic-Waste

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/python/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from jinchihe and kuizhiqing February 25, 2025 09:00

google-oss-prow bot added the size/M label Feb 25, 2025

astefanutti force-pushed the pr-14 branch from b5e2c55 to ce68d2c Compare February 25, 2025 09:07

Add volume arguments to TrainingClient.create_job API

876e2ab

Signed-off-by: Antonin Stefanutti <[email protected]>

astefanutti force-pushed the pr-14 branch from ce68d2c to 876e2ab Compare February 25, 2025 09:10

Electronic-Waste approved these changes Feb 25, 2025

View reviewed changes

google-oss-prow bot assigned Electronic-Waste Feb 25, 2025

google-oss-prow bot added the lgtm label Feb 25, 2025

andreyvelich reviewed Feb 25, 2025

View reviewed changes

Electronic-Waste mentioned this pull request Feb 25, 2025

feat(sdk): support volume mount in tune API kubeflow/katib#2508

Open

3 tasks

google-oss-prow bot assigned andreyvelich Feb 25, 2025

google-oss-prow bot added the approved label Feb 25, 2025

google-oss-prow bot merged commit 3860d3d into kubeflow:release-1.9 Feb 25, 2025
56 checks passed

astefanutti deleted the pr-14 branch February 26, 2025 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add volume and volume mounts arguments to TrainingClient.create_job API #2449

Add volume and volume mounts arguments to TrainingClient.create_job API #2449

astefanutti commented Feb 25, 2025

astefanutti commented Feb 25, 2025 •

edited

Loading

coveralls commented Feb 25, 2025 •

edited

Loading

Electronic-Waste left a comment

andreyvelich left a comment

andreyvelich Feb 25, 2025 •

edited

Loading

franciscojavierarceo Feb 25, 2025

HumairAK Feb 25, 2025

astefanutti Feb 25, 2025

andreyvelich Feb 25, 2025

andreyvelich commented Feb 25, 2025

google-oss-prow bot commented Feb 25, 2025

		volumes: Optional[List[models.V1Volume]] = None,
		volume_mounts: Optional[List[models.V1VolumeMount]] = None,

Add volume and volume mounts arguments to TrainingClient.create_job API #2449

Add volume and volume mounts arguments to TrainingClient.create_job API #2449

Conversation

astefanutti commented Feb 25, 2025

astefanutti commented Feb 25, 2025 • edited Loading

coveralls commented Feb 25, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13517559347

Details

💛 - Coveralls

Electronic-Waste left a comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

franciscojavierarceo Feb 25, 2025

Choose a reason for hiding this comment

HumairAK Feb 25, 2025

Choose a reason for hiding this comment

astefanutti Feb 25, 2025

Choose a reason for hiding this comment

andreyvelich Feb 25, 2025

Choose a reason for hiding this comment

andreyvelich commented Feb 25, 2025

google-oss-prow bot commented Feb 25, 2025

astefanutti commented Feb 25, 2025 •

edited

Loading

coveralls commented Feb 25, 2025 •

edited

Loading

andreyvelich Feb 25, 2025 •

edited

Loading