Update DRA testing to stable API version and prepare it to test more … #3641

emerbe · 2025-10-13T08:22:51Z

…types of drivers

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR updates DRA testing to use V1 stable API since DRA is available in 1.34.
Also it modifies test logic a bit so it's parametrized to simplify running test with different drivers.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

k8s-ci-robot · 2025-10-13T08:23:01Z

Hi @emerbe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

emerbe · 2025-10-13T09:08:18Z

/assign @mortent

alaypatel07 · 2025-10-13T14:30:21Z

/cc @alaypatel07

I can help with reviews here

clusterloader2/testing/dra/job.yaml

clusterloader2/pkg/dependency/dra/dra.go

emerbe · 2025-10-14T07:36:14Z

/cc @alaypatel07

I can help with reviews here

Great @alaypatel07 ! I was planning to ask you for the review after we finish the initial round with Morten.

Feel free to review it.

emerbe · 2025-10-17T06:43:21Z

Hello, @alaypatel07 could you please take a look?
I'd appreciate it.

clusterloader2/pkg/dependency/dra/dra.go

alaypatel07 · 2025-10-22T15:06:47Z

clusterloader2/pkg/dependency/dra/dra.go

 	return true, nil
 }

+func getReadyNodesCount(config *dependency.Config) (int, error) {


Is there a reason why we need this? I think this check will be erroneous.

Just because a node is not ready does not necessarily imply the driver pod on that node is not running.

I've changed it based on my tests.

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

so the GetClientSets().GetClient().ResourceV1().ResourceSlices() was not equal to workerCount and the test didn't start

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

I understand the issue, but I am saying is this a reliable check for it?

When the Node was NotReady, was there a dra driver plugin pod running there?

Instead of Node Count, can we check if resourceslice count == driver plugin pod count ?

Both ways would work I believe.

Changed as you suggested, PTAL

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml

clusterloader2/testing/dra/config.yaml

alaypatel07 · 2025-10-23T16:41:53Z

clusterloader2/testing/dra/job.yaml

-  ttlSecondsAfterFinished: 300
+  # In tests involving a large number of sequentially created, short-lived jobs, the spin-up time may be significant.
+  # A TTL of 1 hour should be sufficient to retain the jobs long enough for measurement checks.
+  ttlSecondsAfterFinished: 3600 # 1 hour


Which measurement depends on this config?

The measurement that failed was: WaitForFinishedJobs with job-type = short-lived.
It failed when I've been running those tests in 5k nodes scale.

My understanding of the problem is:

job.yaml has ttl set to 300 seconds so 300s after a job completes, Kubernetes will automatically delete it.

There are 10 jobs created sequentially

I've checked and it takes around 10 minutes to complete.

As the first jobs complete, the ttlSecondsAfterFinished timer starts. Since this timer is shorter than the total time it takes for all jobs to be created and finished, the initial jobs are being deleted before the final jobs are even created.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state. The measurement will eventually time out and fail because it can't find the jobs it's looking for.

I've modified the test to increase ttlSecondsAfterFinished to 3600 seconds. It made test to pass.

I can also parametrize this and set it default to 300 but I didn't see a harm to increase it.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state.

I think we should remove this change here and take it to the PR that modifies the cl2/testing/* files. I would be curious to see if WaitForFinishedJobs can account for deleted jobs that have completed. For all it cares is job that are present should be in Finished state.

OK, moved to the second PR.

alaypatel07 · 2025-10-23T16:43:21Z

/ok-to-test

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml

alaypatel07

Added couple of comments, one about the assertion for resourceslicecount and second about ttl config change.

Other things look good to me, thanks @emerbe

…types of drivers

alaypatel07 · 2025-10-24T18:06:56Z

Thanks @emerbe, changes looks good here.

/lgtm
/hold for #3629 since this PR has urgent deadline

k8s-ci-robot · 2025-10-24T18:07:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alaypatel07, emerbe, mortent
Once this PR has been reviewed and has the lgtm label, please assign marseel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

clusterloader2/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alaypatel07 · 2025-10-24T18:07:40Z

/assign @mborsz

PTAL when you get a chance

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 13, 2025

k8s-ci-robot requested review from mborsz and wojtek-t October 13, 2025 08:22

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 13, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2025

k8s-ci-robot assigned mortent Oct 13, 2025

emerbe force-pushed the dra-update-to-stable-api branch 2 times, most recently from 71bf9f6 to 785b4f7 Compare October 13, 2025 09:23

k8s-ci-robot requested a review from alaypatel07 October 13, 2025 14:30

mortent reviewed Oct 13, 2025

View reviewed changes

clusterloader2/testing/dra/job.yaml Outdated Show resolved Hide resolved

clusterloader2/pkg/dependency/dra/dra.go Show resolved Hide resolved

emerbe force-pushed the dra-update-to-stable-api branch from 785b4f7 to dbe248b Compare October 14, 2025 07:37

emerbe requested a review from mortent October 15, 2025 08:11

mortent approved these changes Oct 15, 2025

View reviewed changes

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/pkg/dependency/dra/dra.go Outdated Show resolved Hide resolved

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml Show resolved Hide resolved

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/testing/dra/config.yaml Outdated Show resolved Hide resolved

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/testing/dra/config.yaml Outdated Show resolved Hide resolved

emerbe force-pushed the dra-update-to-stable-api branch from dbe248b to 49e3932 Compare October 23, 2025 11:47

alaypatel07 reviewed Oct 23, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025

alaypatel07 reviewed Oct 23, 2025

View reviewed changes

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml Show resolved Hide resolved

alaypatel07 reviewed Oct 23, 2025

View reviewed changes

emerbe force-pushed the dra-update-to-stable-api branch from 49e3932 to 5e6b282 Compare October 24, 2025 17:32

Update DRA testing to stable API version and prepare it to test more …

7673d03

…types of drivers

emerbe force-pushed the dra-update-to-stable-api branch from 5e6b282 to 7673d03 Compare October 24, 2025 17:35

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2025

k8s-ci-robot assigned alaypatel07 Oct 24, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2025

k8s-ci-robot assigned mborsz Oct 24, 2025

Update DRA testing to stable API version and prepare it to test more … #3641

Are you sure you want to change the base?

Update DRA testing to stable API version and prepare it to test more … #3641

Conversation

emerbe commented Oct 13, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

k8s-ci-robot commented Oct 13, 2025

Uh oh!

emerbe commented Oct 13, 2025

Uh oh!

alaypatel07 commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

emerbe commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emerbe commented Oct 17, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alaypatel07 commented Oct 23, 2025

Uh oh!

Uh oh!

alaypatel07 left a comment

Choose a reason for hiding this comment

Uh oh!

alaypatel07 commented Oct 24, 2025

Uh oh!

k8s-ci-robot commented Oct 24, 2025

Uh oh!

alaypatel07 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

emerbe commented Oct 14, 2025 •

edited

Loading