Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Adding validation webhook for v2 trainjob #2307

Merged
merged 1 commit into from
Mar 16, 2025

Conversation

akshaychitneni
Copy link
Contributor

@akshaychitneni akshaychitneni commented Oct 24, 2024

Adds validation webhook for v2 trainjob.
Relates to #2209

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #2209

Checklist:

  • Docs included if any changes are user facing

@akshaychitneni
Copy link
Contributor Author

cc @tenzen-y @andreyvelich

@akshaychitneni akshaychitneni force-pushed the webhookv2 branch 5 times, most recently from 892a40b to f1a06c4 Compare October 25, 2024 16:36
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Oct 25, 2024
@akshaychitneni akshaychitneni force-pushed the webhookv2 branch 3 times, most recently from ce983eb to 736a759 Compare October 25, 2024 17:09
@coveralls
Copy link

coveralls commented Oct 25, 2024

Pull Request Test Coverage Report for Build 11784298214

Details

  • 6 of 6 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11758410179: 0.0%
Covered Lines: 78
Relevant Lines: 78

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this, and moving this forward.
And Sorry for the delay.

Comment on lines 69 to 76
Namespace: new.Namespace,
Name: new.Spec.RuntimeRef.Name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you ever seen the isseus when we use the old object names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we get new object here and not old ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am validating updated object instead of the existing one

@@ -140,3 +143,115 @@ func (j *JobSet) ReconcilerBuilders() []runtime.ReconcilerBuilder {
},
}
}

func (j *JobSet) Validate(oldObj, newObj *kubeflowv2.TrainJob, runtimeInfo *runtime.Info) (admission.Warnings, field.ErrorList) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there are some conflicts between @andreyvelich PR and this.
@akshaychitneni Could you consult with @andreyvelich, then which PRs should we merge into the main, first.

Copy link
Contributor Author

@akshaychitneni akshaychitneni Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased with @andreyvelich's changes

@@ -31,7 +31,7 @@ func Setup(mgr ctrl.Manager, runtimes map[string]runtime.Runtime) (string, error
return kubeflowv2.TrainingRuntimeKind, err
}
if err := setupWebhookForTrainJob(mgr, runtimes); err != nil {
return "TrainJob", err
return kubeflowv2.TrainJobKind, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! This is what I imagined architechture in my KubeflowJobPipeline framework design phase.

failedCtrlName, err := controllerv2.SetupControllers(mgr, runtimes)
gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred(), "controller", failedCtrlName)
gomega.ExpectWithOffset(1, failedCtrlName).To(gomega.BeEmpty())
if startControllers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you ever seen any issues like null pointer when we start the controllers for webhook testing, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have seen but we might not need to start the controllers just to validate create/update requests and leave to reconciler tests to cover reconciliation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, maybe we should do the same for Kueue/JobSet integration tests cc @kannon92 @ahg-g

@akshaychitneni Why we didn't remove controller start from here:

failedCtrlName, err := controller.SetupControllers(mgr, runtimes, ctrlpkg.Options{
// controller-runtime v0.19+ validates controller names are unique, to make sure
// exported Prometheus metrics for each controller do not conflict. The current check
// relies on static state that's not compatible with testing execution model.
// See the following resources for more context:
// https://github.com/kubernetes-sigs/controller-runtime/pull/2902#issuecomment-2284194683
// https://github.com/kubernetes-sigs/controller-runtime/issues/2994
SkipNameValidation: ptr.To(true),
})
gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred(), "controller", failedCtrlName)
gomega.ExpectWithOffset(1, failedCtrlName).To(gomega.BeEmpty())
?

@akshaychitneni akshaychitneni force-pushed the webhookv2 branch 2 times, most recently from 3737792 to 75caeeb Compare March 14, 2025 01:02
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Mar 14, 2025
@akshaychitneni akshaychitneni force-pushed the webhookv2 branch 3 times, most recently from 8668adf to a93176f Compare March 14, 2025 17:00
@akshaychitneni
Copy link
Contributor Author

@tenzen-y @andreyvelich thanks for your help with reviewing this PR. I have addressed all comments. However, I see e2e with notebook is failing due to job not moving to complete state within wait time as I see 2 of 3 pods remain in running state. Is it a known issue?

@andreyvelich
Copy link
Member

For some reason, some of the pods are restarting after the master node is complete:

+ kubectl get pods
NAME                                  READY   STATUS      RESTARTS      AGE
t89ae0aa340f-trainer-node-0-0-mg98q   0/1     Completed   0             45s
t89ae0aa340f-trainer-node-0-1-fjwsr   1/1     Running     1 (13s ago)   45s
t89ae0aa340f-trainer-node-0-2-ffm9r   1/1     Running     1 (13s ago)   45s

I don't see same behaviour in other PRs: https://github.com/kubeflow/trainer/actions/runs/13859046765/job/38782615386?pr=2521

Did you try to run this Notebook locally on Kind cluster with your changes ?

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contribution @akshaychitneni!
/lgtm
/assign @tenzen-y @astefanutti @Electronic-Waste

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!
I added small comments.

@akshaychitneni
Copy link
Contributor Author

@tenzen-y addressed your comments. Please take a look

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution!
/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Mar 16, 2025
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 250c116 into kubeflow:master Mar 16, 2025
15 checks passed
mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Mar 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants