-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382
base: master
Are you sure you want to change the base?
Changes from 7 commits
a46294f
963bfbc
1100986
a038ef2
1f1b0c2
6675d93
6509239
7a33c55
95788eb
32a3da6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRole | ||
metadata: | ||
name: kubeflow-trainer-admin | ||
labels: | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true" | ||
aggregationRule: | ||
clusterRoleSelectors: | ||
- matchLabels: | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-trainer-admin: "true" | ||
rules: [] | ||
|
||
--- | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRole | ||
metadata: | ||
name: kubeflow-trainer-edit | ||
labels: | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true" | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-trainer-admin: "true" | ||
rules: | ||
- apiGroups: | ||
- kubeflow.org | ||
resources: | ||
- clustertrainingruntimes | ||
- trainingruntimes | ||
- trainjobs | ||
verbs: | ||
- create | ||
- delete | ||
- get | ||
- list | ||
- patch | ||
- update | ||
- watch | ||
- apiGroups: | ||
- kubeflow.org | ||
resources: | ||
- trainjobs/status | ||
verbs: | ||
- get | ||
- apiGroups: | ||
- "" | ||
resources: | ||
- persistentvolumeclaims | ||
Doris-xm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
verbs: | ||
- create | ||
- delete | ||
- get | ||
- list | ||
- watch | ||
- apiGroups: | ||
- "" | ||
resources: | ||
- events | ||
verbs: | ||
- get | ||
- list | ||
- watch | ||
|
||
--- | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRole | ||
metadata: | ||
name: kubeflow-trainer-view | ||
labels: | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true" | ||
rules: | ||
- apiGroups: | ||
- kubeflow.org | ||
resources: | ||
- clustertrainingruntimes | ||
- trainingruntimes | ||
- trainjobs | ||
verbs: | ||
- get | ||
- list | ||
- watch | ||
- apiGroups: | ||
- kubeflow.org | ||
resources: | ||
- trainjobs/status | ||
verbs: | ||
- get |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
namespace: kubeflow | ||
resources: | ||
- ../../base/crds | ||
- ../../base/manager | ||
- ../../base/rbac | ||
- ../../base/webhook | ||
- ../../base/runtimes/pretraining | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The runtimes must be installed after controller-manager is ready. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I fear this is going to be "annoying" for every / a lot of downstream projects. Maybe a "sub-optimal" solution would be to create the preset / in-tree runtimes before the webhooks, similar to what the default "legacy" sorting does. Otherwise we move towards CEL based validation or ValidatingAdmissionPolicy, which may be more appropriate/suitable for (Cluster)TrainingRuntime resources. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Unfortunately, even with sorting Kustomize doesn't wait for resource probs before deploying the next resources. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah no, I meant creating the runtimes before the webhooks, and make the fairly plausible assumption that the in-tree runtimes are "valid". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, I try to solve it by modifying the order: resources:
- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/runtimes/pretraining
- ../../base/webhook
- ../../third-party/jobset # Comment this line if JobSet is installed on the Kubernetes cluster.
- kubeflow-trainer-roles.yaml not sure if this will deploy as expected. Look forward to your further guidance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That might be safer to set the sorting strategy to FIFO: kind: Kustomization
sortOptions:
order: fifo |
||
- ../../third-party/jobset # Comment this line if JobSet is installed on the Kubernetes cluster. | ||
- kubeflow-trainer-roles.yaml | ||
|
||
# Update the Kubeflow Trainer controller manager image tag. | ||
images: | ||
- name: kubeflow/trainer-controller-manager | ||
newTag: latest | ||
|
||
# Secret for the Kubeflow Training webhook. | ||
secretGenerator: | ||
- name: kubeflow-trainer-webhook-cert | ||
namespace: kubeflow | ||
options: | ||
disableNameSuffixHash: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we also should have permission to read logs from TrainJob's pods.