Skip to content

Patches based implementation for DRA snapshot. #8090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mtrqq
Copy link
Contributor

@mtrqq mtrqq commented May 5, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR enhances the performance of DRA snapshot which directly impacts scheduling simulations speed and cluster-autoscaler decision making process overall. In this PR we are changing the approach for its state management from deep copies based into patches based one.

Which issue(s) this PR fixes:

Fixes #7681

Special notes for your reviewer:

This PR removes the original implementation for dynamicresources.Snapshot and replaces it with the patches based approach, while we can keep the original implementation for the sake of safety and ability to switch store implementation in the running cluster autoscaler, but it would require maintaining two implementations, I've attempted to use clone-based Snapshot as a baseline for the new changes, but it only resulted in complex code while yielding minimal benefits.

In the change you may find a benchmark test which uses exagerrated scheduling scenario to test the performance of two implementations, what I've found that patches based option is roughly 50x times faster in terms of overall runtime while allocating 40x less memory on the heap primarily because Fork/Commit/Revert operations are used A LOT in the suite

Here's a few profiling insights in differences:

CPU Profile / Forking

CPU Profile / Forking

CPU Profile / GC

CPU Profile / GC

Memory Profile / Allocated Space

Memory Profile

Memory Profile / Allocated Objects

image

Grab a copy of profiling samples -> Profiles.zip

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler labels May 5, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 5, 2025
@mtrqq mtrqq marked this pull request as ready for review May 5, 2025 11:46
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2025
@k8s-ci-robot k8s-ci-robot requested a review from x13n May 5, 2025 11:46
@mtrqq
Copy link
Contributor Author

mtrqq commented May 5, 2025

/assign towca

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mtrqq
Once this PR has been reviewed and has the lgtm label, please ask for approval from towca. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025
merged[key] = value
}

for key := range patch.Deleted {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to address map access race conditions between:

  1. enumerate through the set of patches and begin copying into a new, merged map
  2. one or more patches in the set begin a delete operation
func (p *patch[K, V]) Delete(key K) {
	p.Deleted[key] = true
	delete(p.Modified, key)
}
  1. ensure that our source patch wasn't copied from a state in between the 1st of the above 2 statements by double-checking the same key against any corresponding existence in the Deleted map

?

(If so have we considered the tradeoffs of inverting the order of operations in the Delete() method?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that Snapshot is not thread-safe in general, but that's a fair call, I'll move the order of operations in Delete() method so that it'll become consistent with other data manipulation functions. In case we need it to be truly suitable for concurrent usage (and as I see based on simulations code - we don't) - we'll probably need to use sync.Map to actually store the data, in current implementation we are using bare maps just the sake of performance gain

The reason why we account for Deleted keys here - is to handle following situations:

Prerequisite: Type in the example -> PatchSet[int, int]

Patch#1: Modified: {1: 1, 2: 2, 3: 3}, Deleted: {}
Patch#2: Modified: {}, Deleted: {1, 2}
Patch#3: Modified: {1: 5}, Deleted: {}

The result of the AsMap() call on the PatchSet holding these 3 patches should be: {1: 5, 3: 3}, because keys 1 and 2 are getting deleted in the second patch, but key 1 is getting reintroduced in Patch#3

If I misunderstood your comment - LMK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're aligned, do we have a UT case for that example scenario?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get the difference between the order of operations in Delete(), we're not planning for any concurrency here for now, right?

(I do get the need for iterating over Deleted and deleting from merged here, but not sure how it relates to Jack's comment 😅)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to invert the order in Delete() (actually delete first, take an accounting of that deletion second) is so that we don't have to double-check that the data is really deleted before composing the merge from the data.

I agree with @towca's other comment that this is a great addition, and was really fun to review. I did have to try really hard to find something to criticize! 😀😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least at this point of time I don't see Snapshot being used concurrently anywhere - and if it will be, I think DRA Snapshot would be the only blocker there, so let's keep it as is at least for now, WDYT?

I agree with @towca's other comment that this is a great addition, and was really fun to review. I did have to try really hard to find something to criticize! 😀😂

I forgot the last time I had to implement something that technical and fun so it's coming from both sides :P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least at this point of time I don't see Snapshot being used concurrently anywhere - and if it will be, I think DRA Snapshot would be the only blocker there, so let's keep it as is at least for now, WDYT?

Yup, the current version looks good to me! I was just curious about the comment.

@mtrqq mtrqq force-pushed the dra-shapshot-patch branch 2 times, most recently from 1516669 to 58fe449 Compare May 10, 2025 07:17
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2025
@mtrqq mtrqq force-pushed the dra-shapshot-patch branch from 58fe449 to a1650f9 Compare May 10, 2025 07:21
@mtrqq mtrqq requested a review from jackfrancis May 10, 2025 07:36
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2025
Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed everything but the tests (I'll do that in a second pass). In general the PR looks very good, the only bigger comments are about the caching strategy.

Two general comments:

  1. The patchSet is a very elegant solution, I'm a big fan! (also yay generics, this would be a nightmare otherwise..)
  2. (for future reference) Breaking the PR down into smaller, meaningful commits would have seriously helped the review here. For example a structure like: change Snapshot to be a pointer (this just pollutes the diff) -> introduce patch&patchset -> add caching to patchset -> use patchset in Snapshot -> use the new dra.Snapshot in ClusterSnapshot implementations -> add benchmark.

@@ -22,18 +22,16 @@ import (
resourceapi "k8s.io/api/resource/v1beta1"
)

type snapshotClassLister Snapshot
type snapshotClassLister struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this (here and for the other subtypes)? IMO the previous pattern was more readable - no need for redundant methods on Snapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a specific preference tbh, but first iteration implementation was way harder in terms of interaction with patchSet and I wanted to keep all the interaction with it inside the Snapshot itself, while this introduces some duplication - it makes me thinking about these wrapper objects as just wrappers without any logic apart of snapshot function calls.

If you want - I may change it back, right now it doesn't make a lot of difference from my point of view

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a specific preference tbh, but first iteration implementation was way harder in terms of interaction with patchSet

Interesting, I'm really curious why?

If you want - I may change it back, right now it doesn't make a lot of difference from my point of view

I think right now it only introduces some code duplication for no clear reason. I'd prefer to revert back to the previous approach unless it's less extensible for the future or something (I still don't understand the limitations here).

// AsMap merges all patches into a single map representing the current effective state.
// It iterates through all patches from bottom to top, applying modifications and deletions.
// The cache is populated with the results during this process.
func (ps *patchSet[K, V]) AsMap() map[K]V {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about caching the LIST responses? The DRA scheduler plugin only GETs claims, the rest is listed (deviceclasses - which CA doesn't affect so caching would be very effective, resourceclaims filtered to allocated devices, resourceslices). Not sure how frequently each of the operations are done though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call, from the benchmarks conducted the piece which takes most of the runtime in fetch operations are actually Get operations for ResourceClaims and ResourceSlices which are made as part of WrapSchedulerNodeInfo, but adding caching for listing also makes a lot of sense.

Added cacheInSync flag which denotes whether cache within patchSet is actually up-to-date, if so - AsMap will simply take data out of there, otherwise will start building the map from scratch.

I've been thinking to add this into Snapshot itself, but it wouldn't be that easy compared to few lines modification in the patchSet itself while also giving a slight performance gain to all the patchSets in the Snapshot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, the CA side will definitely be getting a lot - makes sense you focused on that. The list caching is a great addition though, thanks!!

@mtrqq mtrqq force-pushed the dra-shapshot-patch branch 2 times, most recently from bceda7e to 11b7fd2 Compare May 20, 2025 09:07
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2025
mtrqq added 2 commits May 20, 2025 09:28
Instead of exposing DeepCloning API - dynamicresources.Snapshot now exposes Fork/Commit/Revert methods mimicking operations in ClusterSnapshot. Now instead of storing full-scale copies of DRA snapshot we are only storing a single object with patches list stored inside, which allows very effective Clone/Fork/Revert operations while sacrificing some performance and memory allocation while doing fetch requests and inplace objects modifications (ResourceClaims).
@mtrqq mtrqq force-pushed the dra-shapshot-patch branch from 11b7fd2 to c1148f4 Compare May 20, 2025 09:29
@mtrqq mtrqq requested a review from towca May 20, 2025 09:48
Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last comments to the implementation but generally LGTM! Took a look at the tests this time, got some more comments there unfortunately 😅

currentPatch := ps.patches[len(ps.patches)-1]
ps.patches = ps.patches[:len(ps.patches)-1]

for key := range currentPatch.Modified {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added section, it's really helpful! Could you also add a mention to the DeltaSnapshotStore comment that the complexitites in the underlying DRA snapshot are potentially different (but still optimized for typical CA usage) and listed here?

I agree that additional map deletions in Revert() seem like a fair trade-off for not dropping the whole cache, especially if the benchmarks confirm this. With the addition of list caching this seems like a fairly optimal strategy overall, thank you!

merged[key] = value
}

for key := range patch.Deleted {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least at this point of time I don't see Snapshot being used concurrently anywhere - and if it will be, I think DRA Snapshot would be the only blocker there, so let's keep it as is at least for now, WDYT?

Yup, the current version looks good to me! I was just curious about the comment.

previousPatch := ps.patches[len(ps.patches)-2]
mergedPatch := mergePatchesInPlace(previousPatch, currentPatch)
ps.patches = ps.patches[:len(ps.patches)-1]
ps.patches[len(ps.patches)-1] = mergedPatch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no-op, but I wanted to be extra explicit over what's going on there.

Definitely get that, but IMO the helper function having InPlace in its name is plenty explicit already. At least for me, redundancy like this would actually make me dig deeper into the code. Reading this, I'd immediately be wondering why we need to update the map if the merge is supposed to be in place - and probably inspect mergePatchesInPlace to check if the name is misleading 😅

// AsMap merges all patches into a single map representing the current effective state.
// It iterates through all patches from bottom to top, applying modifications and deletions.
// The cache is populated with the results during this process.
func (ps *patchSet[K, V]) AsMap() map[K]V {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, the CA side will definitely be getting a lot - makes sense you focused on that. The list caching is a great addition though, thanks!!

@@ -22,18 +22,16 @@ import (
resourceapi "k8s.io/api/resource/v1beta1"
)

type snapshotClassLister Snapshot
type snapshotClassLister struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a specific preference tbh, but first iteration implementation was way harder in terms of interaction with patchSet

Interesting, I'm really curious why?

If you want - I may change it back, right now it doesn't make a lot of difference from my point of view

I think right now it only introduces some code duplication for no clear reason. I'd prefer to revert back to the previous approach unless it's less extensible for the future or something (I still don't understand the limitations here).

for _, request := range claim.Spec.Devices.Requests {
for devicesRequired := request.Count; devicesRequired > 0; devicesRequired-- {

for sliceIndex < len(slices) && deviceIndex >= len(slices[sliceIndex].Spec.Devices) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very hard to understand 😅 Why the for, shouldn't this be an if?

// - The number of snapshot operations (Fork, Commit, Revert) performed before/after scheduling.
//
// For each configuration and snapshot type, the benchmark performs the following steps:
// 1. Initializes a cluster snapshot with a predefined set of nodes, ResourceSlices, DeviceClasses, and pre-allocated ResourceClaims (both shared and potentially pod-owned).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite the typical way CA uses the snapshot. The typical scale-up scenario I see would be:

  1. Initialize the snapshot with some initial state, define a number of pending Pods to process.
  2. Iterate through a configurable number of NodeGroups.
  3. For each NodeGroup: Fork(), iterate through a configurable number of Nodes, Revert().
  4. For each Node: add the Node to the snapshot, schedule the next N pending Pods to the Node.

And we could parameterize the number of NodeGroups, the number of pending Pods, the number of Nodes per Nodegroup, how many Pods fit on a single Node, how many claims referenced per Pod.

A typical scale-down scenario would be:

  1. Initialize the snapshot with some initial state.
  2. Iterate through a configurable number of Nodes.
  3. For each Node: Fork(); for every scheduled Pod, unschedule the Pod and schedule it on a different Node, remove the Node from the snapshot. Either Revert() or Commit() afterwards.

We could parameterize the number of Nodes, the number of Pods per Node, and the ratio between Revert and Commit.

I feel like testing these scenarios would allow us to measure the performance more accurately. WDYT?

)

ownedClaim = drautils.TestClaimWithPodOwnership(pod, ownedClaim)
ownedClaim, satisfied := allocateResourceSlicesForClaim(ownedClaim, nodeName, nodeSlice)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be benchmarking on pre-allocated claims. In a typical scenario, at least most of the pod owned claims should be unallocated and require the DRA scheduler plugin to compute the allocation.


package snapshot

// patch represents a single layer of modifications (additions/updates)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried about not having unit tests for patch and patchSet. The new Snapshot tests cover most of this logic but I'm not sure this is enough. @jackfrancis WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely have UT given how fundamental it is. I think a good model are the two _test.go files here:

In fact (assuming no equivalent exists in apimachinery already) I would say we can matriculate this code to apimachinery at some point, this has general purpose value.

deviceIndex = 0
}

if sliceIndex >= len(slices) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could sliceIndex ever be greater than or equal to len(slices) if we only ever increment it (starting from zero) if sliceIndex is less than len(slices) (L97) ?

Or maybe I'm running into @towca's observation that this is a tricky nested boolean for loop inside two enumeration for loops inside a named for loop. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot
4 participants