Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add volume and volume mounts arguments to TrainingClient.create_job API #2449

Merged
merged 1 commit into from
Feb 25, 2025

Conversation

astefanutti
Copy link
Contributor

What this PR does / why we need it:

This PR adds the volumes and volume_mounts arguments to the create_job API.

This is particularly needed to mount shared storage for distributed jobs.

Checklist:

  • Docs included if any changes are user facing

@astefanutti
Copy link
Contributor Author

astefanutti commented Feb 25, 2025

@andreyvelich @tenzen-y @Electronic-Waste could you please take a look?

@coveralls
Copy link

coveralls commented Feb 25, 2025

Pull Request Test Coverage Report for Build 13517559347

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 13314191840: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Thanks for this!

/lgtm

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @astefanutti!

Comment on lines +357 to +358
volumes: Optional[List[models.V1Volume]] = None,
volume_mounts: Optional[List[models.V1VolumeMount]] = None,
Copy link
Member

@andreyvelich andreyvelich Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we can make it easier to configure from the ML engineer perspective ?

I know that @truc0 has implemented it in a way that we can accept PVC, ConfigMap, or Secret as volume for Trial: kubeflow/katib#2508. But it still uses Kubernetes APIs.

@hbelmiro @rimolive @HumairAK Do we know if we have some sort of Kubernetes volume support in KFP V2 ?
I noticed that it is not intended to be implemented in V2: kubeflow/pipelines#8570
In that case, how users can attach volumes to their component in a workflow ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In KFP users can access underlying k8s pod spec apis like mounting secrets/volumes/configmaps etc. (example for volumes here)
This exposes k8s apis to the KFP sdk, so it begs the question how friendly this is to the ML engineer persona, one who is not so k8s aware.

I'm not caught up on what use case this PR is looking to resolve, but in KFP we have a specific case of data passing between pipeline jobs, the storage of this data is abstracted via object store, but we see that users sometimes want to use their PVC for this instead without having to manually configure these PVC's in the pipeline. The end result being, you just declare what inputs/outputs you want in your python pipeline sdk, and under the hood we automatically utilize the PVC that you provided when you deployed KFP. This is a feature we'd like to introduce to KFP in the future, you can track it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case if for any distributed training job to be able to persist checkpoints. For example, this is what the InstructLab fine-tuning step does, and that can only be done via the PyTorchJob API at the moment, not via the SDK.
With this PR, it'll also be possible to do it with the SDK.

Stated differently, without a way to pass a PVC to the PyTorchJob via the SDK, the SDK is rather useless for any real distributed training jobs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @astefanutti, I just think whether we should give MLE control to all Volume and VolumeMount APIs.
Maybe specifying the volumeName is Sufficient enough.
Maybe we can improve it in V2 SDK given the separation between TrainJob and TrainingRuntime.

@andreyvelich
Copy link
Member

Thank you for this @astefanutti!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, Electronic-Waste

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3860d3d into kubeflow:release-1.9 Feb 25, 2025
56 checks passed
@astefanutti astefanutti deleted the pr-14 branch February 26, 2025 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants