Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
84 changes: 84 additions & 0 deletions manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-trainer-admin
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true"
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-trainer-admin: "true"
rules: []

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-trainer-edit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we also should have permission to read logs from TrainJob's pods.

labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-trainer-admin: "true"
rules:
- apiGroups:
- kubeflow.org
resources:
- clustertrainingruntimes
- trainingruntimes
- trainjobs
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- kubeflow.org
resources:
- trainjobs/status
verbs:
- get
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- create
- delete
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-trainer-view
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true"
rules:
- apiGroups:
- kubeflow.org
resources:
- clustertrainingruntimes
- trainingruntimes
- trainjobs
verbs:
- get
- list
- watch
- apiGroups:
- kubeflow.org
resources:
- trainjobs/status
verbs:
- get
23 changes: 23 additions & 0 deletions manifests/overlays/kubeflow-platform/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/webhook
- ../../base/runtimes/pretraining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtimes must be installed after controller-manager is ready.
Are we ok with that ?

Copy link
Contributor

@astefanutti astefanutti Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear this is going to be "annoying" for every / a lot of downstream projects.

Maybe a "sub-optimal" solution would be to create the preset / in-tree runtimes before the webhooks, similar to what the default "legacy" sorting does.
But the webhooks will have to be forward compatible during upgrades.

Otherwise we move towards CEL based validation or ValidatingAdmissionPolicy, which may be more appropriate/suitable for (Cluster)TrainingRuntime resources.

Copy link
Member

@andreyvelich andreyvelich Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a "sub-optimal" solution would be to create the preset / in-tree runtimes before the webhooks, similar to what the default "legacy" sorting does.

Unfortunately, even with sorting Kustomize doesn't wait for resource probs before deploying the next resources.

Copy link
Contributor

@astefanutti astefanutti Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, I meant creating the runtimes before the webhooks, and make the fairly plausible assumption that the in-tree runtimes are "valid".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.
That might work. @tenzen-y @Electronic-Waste Any thoughts ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I try to solve it by modifying the order:

resources:
  - ../../base/crds
  - ../../base/manager
  - ../../base/rbac
  - ../../base/runtimes/pretraining
  - ../../base/webhook
  - ../../third-party/jobset # Comment this line if JobSet is installed on the Kubernetes cluster.
  - kubeflow-trainer-roles.yaml

not sure if this will deploy as expected. Look forward to your further guidance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be safer to set the sorting strategy to FIFO:

kind: Kustomization
sortOptions:
  order: fifo

- ../../third-party/jobset # Comment this line if JobSet is installed on the Kubernetes cluster.
- kubeflow-trainer-roles.yaml

# Update the Kubeflow Trainer controller manager image tag.
images:
- name: kubeflow/trainer-controller-manager
newTag: latest

# Secret for the Kubeflow Training webhook.
secretGenerator:
- name: kubeflow-trainer-webhook-cert
namespace: kubeflow
options:
disableNameSuffixHash: true
Loading