-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Adding validation webhook for v2 trainjob #2307
Conversation
892a40b
to
f1a06c4
Compare
ce983eb
to
736a759
Compare
Pull Request Test Coverage Report for Build 11784298214Details
💛 - Coveralls |
f85da83
to
ba32e68
Compare
ba32e68
to
20136ef
Compare
20136ef
to
0aa9ee0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking this, and moving this forward.
And Sorry for the delay.
Namespace: new.Namespace, | ||
Name: new.Spec.RuntimeRef.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you ever seen the isseus when we use the old object names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we get new
object here and not old
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am validating updated object instead of the existing one
@@ -140,3 +143,115 @@ func (j *JobSet) ReconcilerBuilders() []runtime.ReconcilerBuilder { | |||
}, | |||
} | |||
} | |||
|
|||
func (j *JobSet) Validate(oldObj, newObj *kubeflowv2.TrainJob, runtimeInfo *runtime.Info) (admission.Warnings, field.ErrorList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there are some conflicts between @andreyvelich PR and this.
@akshaychitneni Could you consult with @andreyvelich, then which PRs should we merge into the main, first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased with @andreyvelich's changes
pkg/webhook.v2/setup.go
Outdated
@@ -31,7 +31,7 @@ func Setup(mgr ctrl.Manager, runtimes map[string]runtime.Runtime) (string, error | |||
return kubeflowv2.TrainingRuntimeKind, err | |||
} | |||
if err := setupWebhookForTrainJob(mgr, runtimes); err != nil { | |||
return "TrainJob", err | |||
return kubeflowv2.TrainJobKind, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch.
pkg/webhook.v2/trainjob_webhook.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! This is what I imagined architechture in my KubeflowJobPipeline framework design phase.
failedCtrlName, err := controllerv2.SetupControllers(mgr, runtimes) | ||
gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred(), "controller", failedCtrlName) | ||
gomega.ExpectWithOffset(1, failedCtrlName).To(gomega.BeEmpty()) | ||
if startControllers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you ever seen any issues like null pointer when we start the controllers for webhook testing, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have seen but we might not need to start the controllers just to validate create/update requests and leave to reconciler tests to cover reconciliation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, maybe we should do the same for Kueue/JobSet integration tests cc @kannon92 @ahg-g
@akshaychitneni Why we didn't remove controller start from here:
trainer/test/integration/framework/framework.go
Lines 105 to 115 in c1fc937
failedCtrlName, err := controller.SetupControllers(mgr, runtimes, ctrlpkg.Options{ | |
// controller-runtime v0.19+ validates controller names are unique, to make sure | |
// exported Prometheus metrics for each controller do not conflict. The current check | |
// relies on static state that's not compatible with testing execution model. | |
// See the following resources for more context: | |
// https://github.com/kubernetes-sigs/controller-runtime/pull/2902#issuecomment-2284194683 | |
// https://github.com/kubernetes-sigs/controller-runtime/issues/2994 | |
SkipNameValidation: ptr.To(true), | |
}) | |
gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred(), "controller", failedCtrlName) | |
gomega.ExpectWithOffset(1, failedCtrlName).To(gomega.BeEmpty()) |
0aa9ee0
to
1b675c5
Compare
a3ea261
to
4b4d76e
Compare
3737792
to
75caeeb
Compare
8668adf
to
a93176f
Compare
@tenzen-y @andreyvelich thanks for your help with reviewing this PR. I have addressed all comments. However, I see e2e with notebook is failing due to job not moving to complete state within wait time as I see 2 of 3 pods remain in running state. Is it a known issue? |
For some reason, some of the pods are restarting after the master node is complete:
I don't see same behaviour in other PRs: https://github.com/kubeflow/trainer/actions/runs/13859046765/job/38782615386?pr=2521 Did you try to run this Notebook locally on Kind cluster with your changes ? |
a93176f
to
94b0937
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great contribution @akshaychitneni!
/lgtm
/assign @tenzen-y @astefanutti @Electronic-Waste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!
I added small comments.
Signed-off-by: Akshay Chitneni <[email protected]>
94b0937
to
aba3e4b
Compare
@tenzen-y addressed your comments. Please take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]>
Adds validation webhook for v2 trainjob.
Relates to #2209
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes # #2209
Checklist: