Skip to content

Conversation

@emerbe
Copy link
Contributor

@emerbe emerbe commented Oct 13, 2025

…types of drivers

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR updates DRA testing to use V1 stable API since DRA is available in 1.34.
Also it modifies test logic a bit so it's parametrized to simplify running test with different drivers.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 13, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 13, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @emerbe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2025
@emerbe
Copy link
Contributor Author

emerbe commented Oct 13, 2025

/assign @mortent

@emerbe emerbe force-pushed the dra-update-to-stable-api branch 2 times, most recently from 71bf9f6 to 785b4f7 Compare October 13, 2025 09:23
@alaypatel07
Copy link
Contributor

/cc @alaypatel07

I can help with reviews here

@emerbe
Copy link
Contributor Author

emerbe commented Oct 14, 2025

/cc @alaypatel07

I can help with reviews here

Great @alaypatel07 ! I was planning to ask you for the review after we finish the initial round with Morten.

Feel free to review it.

@emerbe emerbe force-pushed the dra-update-to-stable-api branch from 785b4f7 to dbe248b Compare October 14, 2025 07:37
@emerbe emerbe requested a review from mortent October 15, 2025 08:11
@emerbe
Copy link
Contributor Author

emerbe commented Oct 17, 2025

Hello, @alaypatel07 could you please take a look?
I'd appreciate it.

return true, nil
}

func getReadyNodesCount(config *dependency.Config) (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we need this? I think this check will be erroneous.

Just because a node is not ready does not necessarily imply the driver pod on that node is not running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it based on my tests.

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

so the GetClientSets().GetClient().ResourceV1().ResourceSlices() was not equal to workerCount and the test didn't start

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

I understand the issue, but I am saying is this a reliable check for it?

When the Node was NotReady, was there a dra driver plugin pod running there?

Instead of Node Count, can we check if resourceslice count == driver plugin pod count ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways would work I believe.

Changed as you suggested, PTAL

@emerbe emerbe force-pushed the dra-update-to-stable-api branch from dbe248b to 49e3932 Compare October 23, 2025 11:47
ttlSecondsAfterFinished: 300
# In tests involving a large number of sequentially created, short-lived jobs, the spin-up time may be significant.
# A TTL of 1 hour should be sufficient to retain the jobs long enough for measurement checks.
ttlSecondsAfterFinished: 3600 # 1 hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which measurement depends on this config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The measurement that failed was: WaitForFinishedJobs with job-type = short-lived.
It failed when I've been running those tests in 5k nodes scale.

My understanding of the problem is:

  • job.yaml has ttl set to 300 seconds so 300s after a job completes, Kubernetes will automatically delete it.
  • There are 10 jobs created sequentially
  • I've checked and it takes around 10 minutes to complete.
  • As the first jobs complete, the ttlSecondsAfterFinished timer starts. Since this timer is shorter than the total time it takes for all jobs to be created and finished, the initial jobs are being deleted before the final jobs are even created.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state. The measurement will eventually time out and fail because it can't find the jobs it's looking for.

I've modified the test to increase ttlSecondsAfterFinished to 3600 seconds. It made test to pass.

I can also parametrize this and set it default to 300 but I didn't see a harm to increase it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state.

I think we should remove this change here and take it to the PR that modifies the cl2/testing/* files. I would be curious to see if WaitForFinishedJobs can account for deleted jobs that have completed. For all it cares is job that are present should be in Finished state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved to the second PR.

@alaypatel07
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025
Copy link
Contributor

@alaypatel07 alaypatel07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added couple of comments, one about the assertion for resourceslicecount and second about ttl config change.

Other things look good to me, thanks @emerbe

@emerbe emerbe force-pushed the dra-update-to-stable-api branch from 49e3932 to 5e6b282 Compare October 24, 2025 17:32
@emerbe emerbe force-pushed the dra-update-to-stable-api branch from 5e6b282 to 7673d03 Compare October 24, 2025 17:35
@alaypatel07
Copy link
Contributor

Thanks @emerbe, changes looks good here.

/lgtm
/hold for #3629 since this PR has urgent deadline

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alaypatel07, emerbe, mortent
Once this PR has been reviewed and has the lgtm label, please assign marseel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alaypatel07
Copy link
Contributor

/assign @mborsz

PTAL when you get a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants