-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add volume and volume mounts arguments to TrainingClient.create_job API #2449
Conversation
@andreyvelich @tenzen-y @Electronic-Waste could you please take a look? |
Signed-off-by: Antonin Stefanutti <[email protected]>
Pull Request Test Coverage Report for Build 13517559347Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti Thanks for this!
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @astefanutti!
volumes: Optional[List[models.V1Volume]] = None, | ||
volume_mounts: Optional[List[models.V1VolumeMount]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we can make it easier to configure from the ML engineer perspective ?
I know that @truc0 has implemented it in a way that we can accept PVC, ConfigMap, or Secret as volume for Trial: kubeflow/katib#2508. But it still uses Kubernetes APIs.
@hbelmiro @rimolive @HumairAK Do we know if we have some sort of Kubernetes volume support in KFP V2 ?
I noticed that it is not intended to be implemented in V2: kubeflow/pipelines#8570
In that case, how users can attach volumes to their component in a workflow ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @HumairAK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In KFP users can access underlying k8s pod spec apis like mounting secrets/volumes/configmaps etc. (example for volumes here)
This exposes k8s apis to the KFP sdk, so it begs the question how friendly this is to the ML engineer persona, one who is not so k8s aware.
I'm not caught up on what use case this PR is looking to resolve, but in KFP we have a specific case of data passing between pipeline jobs, the storage of this data is abstracted via object store, but we see that users sometimes want to use their PVC for this instead without having to manually configure these PVC's in the pipeline. The end result being, you just declare what inputs/outputs you want in your python pipeline sdk, and under the hood we automatically utilize the PVC that you provided when you deployed KFP. This is a feature we'd like to introduce to KFP in the future, you can track it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case if for any distributed training job to be able to persist checkpoints. For example, this is what the InstructLab fine-tuning step does, and that can only be done via the PyTorchJob API at the moment, not via the SDK.
With this PR, it'll also be possible to do it with the SDK.
Stated differently, without a way to pass a PVC to the PyTorchJob via the SDK, the SDK is rather useless for any real distributed training jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @astefanutti, I just think whether we should give MLE control to all Volume and VolumeMount APIs.
Maybe specifying the volumeName is Sufficient enough.
Maybe we can improve it in V2 SDK given the separation between TrainJob and TrainingRuntime.
Thank you for this @astefanutti! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, Electronic-Waste The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR adds the
volumes
andvolume_mounts
arguments to thecreate_job
API.This is particularly needed to mount shared storage for distributed jobs.
Checklist: