-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382
base: master
Are you sure you want to change the base?
Conversation
Will review it later. /cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402 |
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/release-team, saileshd1402. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this @Doris-xm! I left some initial comments for you.
Btw, I recommend that you could learn more about the concept of Training V2. This will help you update the manifests overlay in training-operator and kubeflow/manifests:)
FYI: KubeCon 2024 NA Talk by Andrey and Yuki
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. Thank you for your great contributions!
/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @willb!
/assign @kubeflow/wg-manifests-leads @kubeflow/release-team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your great contribution and your patience!
I think, we should also add namespace configuration here:
to install these components in kubeflow-system
namespace.
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
…ace: kubeflow-system Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
9518f7b
to
1f1b0c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for updating this:)
/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @saileshd1402
Pull Request Test Coverage Report for Build 12742381626Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
# Conflicts: # manifests/v2/base/manager/kustomization.yaml # manifests/v2/base/rbac/kustomization.yaml # manifests/v2/base/webhook/kustomization.yaml # manifests/v2/overlays/only-manager/kustomization.yaml # manifests/v2/overlays/standalone/kustomization.yaml
Signed-off-by: Xinmin Du <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Electronic-Waste The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. PTAL if you have time @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @astefanutti
Btw, can you mark some addressed comments as resolved
@Doris-xm ? Some of them are outdated.
/lgtm
I think when you create the PR against kubeflow/manifests including the integrations tests to see how it behaves with the platform components and authorization, security etc. we can provide more input. |
- ../../base/manager | ||
- ../../base/rbac | ||
- ../../base/webhook | ||
- ../../base/runtimes/pretraining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtimes must be installed after controller-manager is ready.
Are we ok with that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear this is going to be "annoying" for every / a lot of downstream projects.
Maybe a "sub-optimal" solution would be to create the preset / in-tree runtimes before the webhooks, similar to what the default "legacy" sorting does.
But the webhooks will have to be forward compatible during upgrades.
Otherwise we move towards CEL based validation or ValidatingAdmissionPolicy, which may be more appropriate/suitable for (Cluster)TrainingRuntime resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a "sub-optimal" solution would be to create the preset / in-tree runtimes before the webhooks, similar to what the default "legacy" sorting does.
Unfortunately, even with sorting Kustomize doesn't wait for resource probs before deploying the next resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no, I meant creating the runtimes before the webhooks, and make the fairly plausible assumption that the in-tree runtimes are "valid".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
That might work. @tenzen-y @Electronic-Waste Any thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I try to solve it by modifying the order:
resources:
- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/runtimes/pretraining
- ../../base/webhook
- ../../third-party/jobset # Comment this line if JobSet is installed on the Kubernetes cluster.
- kubeflow-trainer-roles.yaml
not sure if this will deploy as expected. Look forward to your further guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be safer to set the sorting strategy to FIFO:
kind: Kustomization
sortOptions:
order: fifo
manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml
Outdated
Show resolved
Hide resolved
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRole | ||
metadata: | ||
name: kubeflow-trainer-edit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we also should have permission to read logs from TrainJob's pods.
Signed-off-by: Xinmin Du <[email protected]>
New changes are detected. LGTM label has been removed. |
Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
What this PR does / why we need it:
This PR adds the manifests overlay for Kubeflow Training V2, allowing to install it within Kubeflow Platform.
Which issue(s) this PR fixes :
Fixes #2381
Checklist: