Patches based implementation for DRA snapshot. #8090

mtrqq · 2025-05-05T11:44:41Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR enhances the performance of DRA snapshot which directly impacts scheduling simulations speed and cluster-autoscaler decision making process overall. In this PR we are changing the approach for its state management from deep copies based into patches based one.

Which issue(s) this PR fixes:

Fixes #7681

Special notes for your reviewer:

This PR removes the original implementation for dynamicresources.Snapshot and replaces it with the patches based approach, while we can keep the original implementation for the sake of safety and ability to switch store implementation in the running cluster autoscaler, but it would require maintaining two implementations, I've attempted to use clone-based Snapshot as a baseline for the new changes, but it only resulted in complex code while yielding minimal benefits.

In the change you may find a benchmark test which uses exagerrated scheduling scenario to test the performance of two implementations, what I've found that patches based option is roughly 50x times faster in terms of overall runtime while allocating 40x less memory on the heap primarily because Fork/Commit/Revert operations are used A LOT in the suite

Here's a few profiling insights in differences:

CPU Profile / Forking

CPU Profile / GC

Memory Profile / Allocated Space

Memory Profile / Allocated Objects

Grab a copy of profiling samples -> Profiles.zip

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-05-05T11:44:43Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mtrqq · 2025-05-05T11:46:32Z

/assign towca

k8s-ci-robot · 2025-05-05T12:55:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mtrqq
Once this PR has been reviewed and has the lgtm label, please ask for approval from towca. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

jackfrancis · 2025-05-09T18:07:35Z

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

+			merged[key] = value
+		}
+
+		for key := range patch.Deleted {


Is this meant to address map access race conditions between:

enumerate through the set of patches and begin copying into a new, merged map

one or more patches in the set begin a delete operation

func (p *patch[K, V]) Delete(key K) { p.Deleted[key] = true delete(p.Modified, key) }

ensure that our source patch wasn't copied from a state in between the 1st of the above 2 statements by double-checking the same key against any corresponding existence in the Deleted map

?

(If so have we considered the tradeoffs of inverting the order of operations in the Delete() method?)

I would say that Snapshot is not thread-safe in general, but that's a fair call, I'll move the order of operations in Delete() method so that it'll become consistent with other data manipulation functions. In case we need it to be truly suitable for concurrent usage (and as I see based on simulations code - we don't) - we'll probably need to use sync.Map to actually store the data, in current implementation we are using bare maps just the sake of performance gain

The reason why we account for Deleted keys here - is to handle following situations:

Prerequisite: Type in the example -> PatchSet[int, int]

Patch#1: Modified: {1: 1, 2: 2, 3: 3}, Deleted: {}
Patch#2: Modified: {}, Deleted: {1, 2}
Patch#3: Modified: {1: 5}, Deleted: {}

The result of the AsMap() call on the PatchSet holding these 3 patches should be: {1: 5, 3: 3}, because keys 1 and 2 are getting deleted in the second patch, but key 1 is getting reintroduced in Patch#3

If I misunderstood your comment - LMK

Yeah, we're aligned, do we have a UT case for that example scenario?

I'm not sure I get the difference between the order of operations in Delete(), we're not planning for any concurrency here for now, right?

(I do get the need for iterating over Deleted and deleting from merged here, but not sure how it relates to Jack's comment 😅)

The reason to invert the order in Delete() (actually delete first, take an accounting of that deletion second) is so that we don't have to double-check that the data is really deleted before composing the merge from the data.

I agree with @towca's other comment that this is a great addition, and was really fun to review. I did have to try really hard to find something to criticize! 😀😂

At least at this point of time I don't see Snapshot being used concurrently anywhere - and if it will be, I think DRA Snapshot would be the only blocker there, so let's keep it as is at least for now, WDYT?

I agree with @towca's other comment that this is a great addition, and was really fun to review. I did have to try really hard to find something to criticize! 😀😂

I forgot the last time I had to implement something that technical and fun so it's coming from both sides :P

At least at this point of time I don't see Snapshot being used concurrently anywhere - and if it will be, I think DRA Snapshot would be the only blocker there, so let's keep it as is at least for now, WDYT?

Yup, the current version looks good to me! I was just curious about the comment.

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

towca

Reviewed everything but the tests (I'll do that in a second pass). In general the PR looks very good, the only bigger comments are about the caching strategy.

Two general comments:

The patchSet is a very elegant solution, I'm a big fan! (also yay generics, this would be a nightmare otherwise..)
(for future reference) Breaking the PR down into smaller, meaningful commits would have seriously helped the review here. For example a structure like: change Snapshot to be a pointer (this just pollutes the diff) -> introduce patch&patchset -> add caching to patchset -> use patchset in Snapshot -> use the new dra.Snapshot in ClusterSnapshot implementations -> add benchmark.

cluster-autoscaler/simulator/dynamicresources/snapshot/snapshot.go

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

towca · 2025-05-15T17:52:19Z

cluster-autoscaler/simulator/dynamicresources/snapshot/snapshot_class_lister.go

@@ -22,18 +22,16 @@ import (
 	resourceapi "k8s.io/api/resource/v1beta1"
 )

-type snapshotClassLister Snapshot
+type snapshotClassLister struct {


Why change this (here and for the other subtypes)? IMO the previous pattern was more readable - no need for redundant methods on Snapshot.

I don't have a specific preference tbh, but first iteration implementation was way harder in terms of interaction with patchSet and I wanted to keep all the interaction with it inside the Snapshot itself, while this introduces some duplication - it makes me thinking about these wrapper objects as just wrappers without any logic apart of snapshot function calls.

If you want - I may change it back, right now it doesn't make a lot of difference from my point of view

I don't have a specific preference tbh, but first iteration implementation was way harder in terms of interaction with patchSet

Interesting, I'm really curious why?

If you want - I may change it back, right now it doesn't make a lot of difference from my point of view

I think right now it only introduces some code duplication for no clear reason. I'd prefer to revert back to the previous approach unless it's less extensible for the future or something (I still don't understand the limitations here).

cluster-autoscaler/simulator/dynamicresources/snapshot/test_utils.go

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

towca · 2025-05-15T18:15:37Z

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

+// AsMap merges all patches into a single map representing the current effective state.
+// It iterates through all patches from bottom to top, applying modifications and deletions.
+// The cache is populated with the results during this process.
+func (ps *patchSet[K, V]) AsMap() map[K]V {


Have you thought about caching the LIST responses? The DRA scheduler plugin only GETs claims, the rest is listed (deviceclasses - which CA doesn't affect so caching would be very effective, resourceclaims filtered to allocated devices, resourceslices). Not sure how frequently each of the operations are done though.

Fair call, from the benchmarks conducted the piece which takes most of the runtime in fetch operations are actually Get operations for ResourceClaims and ResourceSlices which are made as part of WrapSchedulerNodeInfo, but adding caching for listing also makes a lot of sense.

Added cacheInSync flag which denotes whether cache within patchSet is actually up-to-date, if so - AsMap will simply take data out of there, otherwise will start building the map from scratch.

I've been thinking to add this into Snapshot itself, but it wouldn't be that easy compared to few lines modification in the patchSet itself while also giving a slight performance gain to all the patchSets in the Snapshot.

Ah yeah, the CA side will definitely be getting a lot - makes sense you focused on that. The list caching is a great addition though, thanks!!

Instead of exposing DeepCloning API - dynamicresources.Snapshot now exposes Fork/Commit/Revert methods mimicking operations in ClusterSnapshot. Now instead of storing full-scale copies of DRA snapshot we are only storing a single object with patches list stored inside, which allows very effective Clone/Fork/Revert operations while sacrificing some performance and memory allocation while doing fetch requests and inplace objects modifications (ResourceClaims).

towca

Some last comments to the implementation but generally LGTM! Took a look at the tests this time, got some more comments there unfortunately 😅

towca · 2025-05-27T17:57:16Z

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

+	currentPatch := ps.patches[len(ps.patches)-1]
+	ps.patches = ps.patches[:len(ps.patches)-1]
+
+	for key := range currentPatch.Modified {


Thanks for the added section, it's really helpful! Could you also add a mention to the DeltaSnapshotStore comment that the complexitites in the underlying DRA snapshot are potentially different (but still optimized for typical CA usage) and listed here?

I agree that additional map deletions in Revert() seem like a fair trade-off for not dropping the whole cache, especially if the benchmarks confirm this. With the addition of list caching this seems like a fairly optimal strategy overall, thank you!

towca · 2025-05-27T17:57:53Z

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

+			merged[key] = value
+		}
+
+		for key := range patch.Deleted {


At least at this point of time I don't see Snapshot being used concurrently anywhere - and if it will be, I think DRA Snapshot would be the only blocker there, so let's keep it as is at least for now, WDYT?

Yup, the current version looks good to me! I was just curious about the comment.

towca · 2025-05-27T18:04:36Z

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

+	previousPatch := ps.patches[len(ps.patches)-2]
+	mergedPatch := mergePatchesInPlace(previousPatch, currentPatch)
+	ps.patches = ps.patches[:len(ps.patches)-1]
+	ps.patches[len(ps.patches)-1] = mergedPatch


It is no-op, but I wanted to be extra explicit over what's going on there.

Definitely get that, but IMO the helper function having InPlace in its name is plenty explicit already. At least for me, redundancy like this would actually make me dig deeper into the code. Reading this, I'd immediately be wondering why we need to update the map if the merge is supposed to be in place - and probably inspect mergePatchesInPlace to check if the name is misleading 😅

towca · 2025-05-27T18:06:14Z

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

+// AsMap merges all patches into a single map representing the current effective state.
+// It iterates through all patches from bottom to top, applying modifications and deletions.
+// The cache is populated with the results during this process.
+func (ps *patchSet[K, V]) AsMap() map[K]V {


Ah yeah, the CA side will definitely be getting a lot - makes sense you focused on that. The list caching is a great addition though, thanks!!

towca · 2025-05-27T18:08:43Z

cluster-autoscaler/simulator/dynamicresources/snapshot/snapshot_class_lister.go

@@ -22,18 +22,16 @@ import (
 	resourceapi "k8s.io/api/resource/v1beta1"
 )

-type snapshotClassLister Snapshot
+type snapshotClassLister struct {


I don't have a specific preference tbh, but first iteration implementation was way harder in terms of interaction with patchSet

Interesting, I'm really curious why?

If you want - I may change it back, right now it doesn't make a lot of difference from my point of view

I think right now it only introduces some code duplication for no clear reason. I'd prefer to revert back to the previous approach unless it's less extensible for the future or something (I still don't understand the limitations here).

towca · 2025-05-28T14:25:28Z

cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot_dra_benchmark_test.go

+	for _, request := range claim.Spec.Devices.Requests {
+		for devicesRequired := request.Count; devicesRequired > 0; devicesRequired-- {
+
+			for sliceIndex < len(slices) && deviceIndex >= len(slices[sliceIndex].Spec.Devices) {


This is very hard to understand 😅 Why the for, shouldn't this be an if?

towca · 2025-05-28T14:54:36Z

cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot_dra_benchmark_test.go

+// - The number of snapshot operations (Fork, Commit, Revert) performed before/after scheduling.
+//
+// For each configuration and snapshot type, the benchmark performs the following steps:
+//  1. Initializes a cluster snapshot with a predefined set of nodes, ResourceSlices, DeviceClasses, and pre-allocated ResourceClaims (both shared and potentially pod-owned).


This isn't quite the typical way CA uses the snapshot. The typical scale-up scenario I see would be:

Initialize the snapshot with some initial state, define a number of pending Pods to process.

Iterate through a configurable number of NodeGroups.

For each NodeGroup: Fork(), iterate through a configurable number of Nodes, Revert().

For each Node: add the Node to the snapshot, schedule the next N pending Pods to the Node.

And we could parameterize the number of NodeGroups, the number of pending Pods, the number of Nodes per Nodegroup, how many Pods fit on a single Node, how many claims referenced per Pod.

A typical scale-down scenario would be:

Initialize the snapshot with some initial state.

Iterate through a configurable number of Nodes.

For each Node: Fork(); for every scheduled Pod, unschedule the Pod and schedule it on a different Node, remove the Node from the snapshot. Either Revert() or Commit() afterwards.

We could parameterize the number of Nodes, the number of Pods per Node, and the ratio between Revert and Commit.

I feel like testing these scenarios would allow us to measure the performance more accurately. WDYT?

towca · 2025-05-28T14:55:51Z

cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot_dra_benchmark_test.go

+			)
+
+			ownedClaim = drautils.TestClaimWithPodOwnership(pod, ownedClaim)
+			ownedClaim, satisfied := allocateResourceSlicesForClaim(ownedClaim, nodeName, nodeSlice)


I don't think we should be benchmarking on pre-allocated claims. In a typical scenario, at least most of the pod owned claims should be unallocated and require the DRA scheduler plugin to compute the allocation.

towca · 2025-05-28T14:58:47Z

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go

+
+package snapshot
+
+// patch represents a single layer of modifications (additions/updates)


I'm slightly worried about not having unit tests for patch and patchSet. The new Snapshot tests cover most of this logic but I'm not sure this is enough. @jackfrancis WDYT?

This should definitely have UT given how fundamental it is. I think a good model are the two _test.go files here:

https://github.com/kubernetes/apimachinery/tree/master/pkg/util/sets

In fact (assuming no equivalent exists in apimachinery already) I would say we can matriculate this code to apimachinery at some point, this has general purpose value.

jackfrancis · 2025-05-28T23:04:10Z

cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot_dra_benchmark_test.go

+				deviceIndex = 0
+			}
+
+			if sliceIndex >= len(slices) {


How could sliceIndex ever be greater than or equal to len(slices) if we only ever increment it (starting from zero) if sliceIndex is less than len(slices) (L97) ?

Or maybe I'm running into @towca's observation that this is a tricky nested boolean for loop inside two enumeration for loops inside a named for loop. :)

k8s-ci-robot requested review from feiskyer and vadasambar May 5, 2025 11:44

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 5, 2025

mtrqq marked this pull request as ready for review May 5, 2025 11:46

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2025

k8s-ci-robot requested a review from x13n May 5, 2025 11:46

k8s-ci-robot assigned towca May 5, 2025

mtrqq force-pushed the dra-shapshot-patch branch from 40597ae to d04e764 Compare May 5, 2025 12:55

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025

jackfrancis reviewed May 9, 2025

View reviewed changes

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go Outdated Show resolved Hide resolved

jackfrancis reviewed May 9, 2025

View reviewed changes

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go Outdated Show resolved Hide resolved

jackfrancis reviewed May 9, 2025

View reviewed changes

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go Outdated Show resolved Hide resolved

jackfrancis reviewed May 9, 2025

View reviewed changes

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go Outdated Show resolved Hide resolved

jackfrancis reviewed May 9, 2025

View reviewed changes

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go Outdated Show resolved Hide resolved

jackfrancis reviewed May 9, 2025

View reviewed changes

cluster-autoscaler/simulator/dynamicresources/snapshot/patchset.go Outdated Show resolved Hide resolved

mtrqq force-pushed the dra-shapshot-patch branch 2 times, most recently from 1516669 to 58fe449 Compare May 10, 2025 07:17

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2025

mtrqq force-pushed the dra-shapshot-patch branch from 58fe449 to a1650f9 Compare May 10, 2025 07:21

mtrqq requested a review from jackfrancis May 10, 2025 07:36

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2025

towca reviewed May 15, 2025

View reviewed changes

mtrqq force-pushed the dra-shapshot-patch branch 2 times, most recently from bceda7e to 11b7fd2 Compare May 20, 2025 09:07

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2025

mtrqq added 2 commits May 20, 2025 09:28

Add PredicateSnapshot benchmark for scheduling with DRA objects.

c1148f4

mtrqq force-pushed the dra-shapshot-patch branch from 11b7fd2 to c1148f4 Compare May 20, 2025 09:29

mtrqq requested a review from towca May 20, 2025 09:48

towca reviewed May 28, 2025

View reviewed changes

jackfrancis reviewed May 28, 2025

View reviewed changes


		package snapshot

		// patch represents a single layer of modifications (additions/updates)

Patches based implementation for DRA snapshot. #8090

Are you sure you want to change the base?

Patches based implementation for DRA snapshot. #8090

Conversation

mtrqq commented May 5, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

CPU Profile / Forking

CPU Profile / GC

Memory Profile / Allocated Space

Memory Profile / Allocated Objects

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented May 5, 2025

Uh oh!

mtrqq commented May 5, 2025

Uh oh!

k8s-ci-robot commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

towca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

towca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!