-
Notifications
You must be signed in to change notification settings - Fork 610
Update DRA testing to stable API version and prepare it to test more … #3641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @emerbe. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @mortent |
71bf9f6 to
785b4f7
Compare
|
/cc @alaypatel07 I can help with reviews here |
Great @alaypatel07 ! I was planning to ask you for the review after we finish the initial round with Morten. Feel free to review it. |
785b4f7 to
dbe248b
Compare
|
Hello, @alaypatel07 could you please take a look? |
| return true, nil | ||
| } | ||
|
|
||
| func getReadyNodesCount(config *dependency.Config) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we need this? I think this check will be erroneous.
Just because a node is not ready does not necessarily imply the driver pod on that node is not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it based on my tests.
Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.
so the GetClientSets().GetClient().ResourceV1().ResourceSlices() was not equal to workerCount and the test didn't start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.
I understand the issue, but I am saying is this a reliable check for it?
When the Node was NotReady, was there a dra driver plugin pod running there?
Instead of Node Count, can we check if resourceslice count == driver plugin pod count ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both ways would work I believe.
Changed as you suggested, PTAL
clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml
Show resolved
Hide resolved
dbe248b to
49e3932
Compare
clusterloader2/testing/dra/job.yaml
Outdated
| ttlSecondsAfterFinished: 300 | ||
| # In tests involving a large number of sequentially created, short-lived jobs, the spin-up time may be significant. | ||
| # A TTL of 1 hour should be sufficient to retain the jobs long enough for measurement checks. | ||
| ttlSecondsAfterFinished: 3600 # 1 hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which measurement depends on this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The measurement that failed was: WaitForFinishedJobs with job-type = short-lived.
It failed when I've been running those tests in 5k nodes scale.
My understanding of the problem is:
- job.yaml has ttl set to 300 seconds so 300s after a job completes, Kubernetes will automatically delete it.
- There are 10 jobs created sequentially
- I've checked and it takes around 10 minutes to complete.
- As the first jobs complete, the
ttlSecondsAfterFinishedtimer starts. Since this timer is shorter than the total time it takes for all jobs to be created and finished, the initial jobs are being deleted before the final jobs are even created.
I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state. The measurement will eventually time out and fail because it can't find the jobs it's looking for.
I've modified the test to increase ttlSecondsAfterFinished to 3600 seconds. It made test to pass.
I can also parametrize this and set it default to 300 but I didn't see a harm to increase it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state.
I think we should remove this change here and take it to the PR that modifies the cl2/testing/* files. I would be curious to see if WaitForFinishedJobs can account for deleted jobs that have completed. For all it cares is job that are present should be in Finished state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moved to the second PR.
|
/ok-to-test |
clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added couple of comments, one about the assertion for resourceslicecount and second about ttl config change.
Other things look good to me, thanks @emerbe
49e3932 to
5e6b282
Compare
5e6b282 to
7673d03
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alaypatel07, emerbe, mortent The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/assign @mborsz PTAL when you get a chance |
…types of drivers
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR updates DRA testing to use V1 stable API since DRA is available in 1.34.
Also it modifies test logic a bit so it's parametrized to simplify running test with different drivers.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: