-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement webhook validations for the PyTorchJob #2035
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2396aeb
to
56c3669
Compare
Pull Request Test Coverage Report for Build 8634700799Details
💛 - Coveralls |
5812cb8
to
91dbff2
Compare
3ab0cb7
to
ffddf1b
Compare
trainingoperator.TFJobKind: scaffold, | ||
trainingoperator.MXJobKind: scaffold, | ||
trainingoperator.XGBoostJobKind: scaffold, | ||
trainingoperator.MPIJobKind: scaffold, | ||
trainingoperator.PaddleJobKind: scaffold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to implement these webhooks in other PRs so that we can keep this PR minimal required.
ffddf1b
to
e02439b
Compare
/hold |
/assign @andreyvelich @johnugeorge |
e02439b
to
dab5c39
Compare
dab5c39
to
f2d8404
Compare
@andreyvelich @johnugeorge I addressed all comments. PTAL! |
apiVersion: v1 | ||
kind: Secret | ||
metadata: | ||
name: training-operator-webhook-secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to keep it in the Training Operator install similar to Katib: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L38-L41 ?
In the future releases when we make the integration with Cert Manager for Kubeflow Install, we can remove this secret generator from Kubeflow Kustomize Overlay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to keep this secret since the cert-controller doesn't create any Secret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand that. My suggestion is instead of having separate folder: /internalcert
, update the standalone and Kubeflow Kustomize installs to generate that secret as follows:
secretGenerator:
- name: training-operator-webhook-secret
options:
disableNameSuffixHash: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
That makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@tenzen-y Do you prefer to introduce Controller Runtime logger for webhook in separate PR based on our discussion here: #2035 (comment) ? |
@andreyvelich Yes I do since We need to remove many loggers for the replacements. |
b5e23d0
to
a05ab1a
Compare
@andreyvelich @johnugeorge I believe that this PR is ready for the merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tenzen-y!
Just a small comment from me.
/assign @johnugeorge
manifests/base/webhook/service.yaml
Outdated
@@ -8,12 +7,16 @@ metadata: | |||
prometheus.io/port: "8080" | |||
labels: | |||
app: training-operator | |||
name: training-operator | |||
name: training-operator-service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Do you need to move this service to the webhook and name it as training-operator-service
?
Why we can't just keep service name as it is in /manifests/base/service.yaml
and just add another 9443 port ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like what we do in Katib: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/service.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is a good catch!
At the some points, we needed to move the Service to the webhook directory to apply the kustomize patch, but currently, moving is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
a05ab1a
to
bff5b56
Compare
manifests/base/service.yaml
Outdated
@@ -8,12 +7,16 @@ metadata: | |||
prometheus.io/port: "8080" | |||
labels: | |||
app: training-operator | |||
name: training-operator | |||
name: training-operator-service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this renaming ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this discussion, we changed this name. So, would you prefer to keep using the training-operator
?
I'm ok with either name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I think I made a mistake there, tbh we should keep this name as training-operator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries :)
I reverted this change.
df00cc5
to
e452469
Compare
# TODO (tenzen-y): Once we support cert-manager, we need to remove this secret generation. | ||
# REF: https://github.com/kubeflow/training-operator/issues/2049 | ||
secretGenerator: | ||
- name: training-operator-webhook-secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the secret name, would it be better to just name it: training-operator-webhook-cert
. Similar to Katib and not add the Kubernetes resource name (e.g. secret
) to the name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you already mentioned here: #2035 (comment)
Secret name: training-operator-webhook-cert
I'll try to refine this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks @tenzen-y for this contribution |
Signed-off-by: Yuki Iwai <[email protected]>
e452469
to
3f15cfc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this @tenzen-y!
/lgtm
/assign @kubeflow/wg-training-leads
/hold cancel |
@andreyvelich @johnugeorge Thank you for the review! |
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
What this PR does / why we need it:
I implemented webhook validations for the PyTorchJob.
Additionally, I didn't add any additional validations. The traininig-operator has the same validations the same as before.
Note that as the first iteration, I didn't support cert-manager in this PR. Maybe we can support the release after the next release.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Part-of #1993
Checklist: