Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpa-recommender can issue 0 as memory recommendation after restart #7726

Open
plkokanov opened this issue Jan 20, 2025 · 6 comments
Open

vpa-recommender can issue 0 as memory recommendation after restart #7726

plkokanov opened this issue Jan 20, 2025 · 6 comments
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@plkokanov
Copy link

Which component are you using?:

/area vertical-pod-autoscaler

What version of the component are you using?:

Component version: vpa-recommender1.2.1

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version

Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.5

What environment is this in?:

Not relevant.

What did you expect to happen?:

vpa-recommender to always issue correct memory recommendations.

What happened instead?:
After restarts of the vpa-recommender it could set very low (equal to minAllowed) memory recommendations resulting in constant OOMKills. After such an incorrect recommendation occurs, vpa-recommender does not recommend higher memory again and this could persist even after additional restarts of vpa-recommender. This seems to happen only when using VerticalPodAutoscalerCheckpoints to work with historical histogram data.

In more details, vpa-recommender tries to set 0 memory recommendations which is then capped to the configured minimum allowed resources. The 0 memory recommendations come from the histogram.Percentile(...) function returning 0. This happens because the histogram.IsEmpty() function returns true, even though there are buckets in the histogram with weights larger than epsilon (0.0001).

To investigate this, we used a forked version of vpa-recommender with increased logging verbosity to also display data from the histogram's buckets before they are saved in a VPACheckpoint.
Here is the logs that we got when this issue occurred:

{"log":"Computed \"memory\" resources for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" were 0 in percentile estimator of \"targetEstimator\"!","pid":"1","severity":"WARN","source":"estimator.go:133"}
2024-12-17 06:41:24	
{"log":"Computed \"memory\" resources for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" were 0 after applying margin in margin estimator of \"targetEstimator\", they were 0 before that","pid":"1","severity":"WARN","source":"estimator.go:194"}
2024-12-17 06:41:24	
{"log":"Computed \"memory\" resources for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" were below minimum in min resource estimator of \"targetEstimator\"! Computed 0, minimum is 10485760.","pid":"1","severity":"WARN","source":"estimator.go:212"}
2024-12-17 06:41:24	
{"log":"The memory histogram for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" is empty!","pid":"1","severity":"WARN","source":"estimator.go:240"}
2024-12-17 06:41:24	
{"log":"Here's the string representation of the memory histogram for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\": referenceTimestamp: 2024-12-18 00:00:00 +0000 UTC, halfLife: 24h0m0s; minBucket: 1, maxBucket: 38, totalWeight: 9.050249754647215; %-tile value: ; 0: 0; 5: 0; 10: 0; 15: 0; 20: 0; 25: 0; 30: 0; 35: 0; 40: 0; 45: 0; 50: 0; 55: 0; 60: 0; 65: 0; 70: 0; 75: 0; 80: 0; 85: 0; 90: 0; 95: 0; 100: 0; buckets value; 0: 0; 1: 9.994727728962825e-05; 2: 1.5697793785237693; 3: 0.37959975914600813; 4: 0.17430805159311166; 5: 0; 6: 0; 7: 0; 8: 0; 9: 0; 10: 0; 11: 0; 12: 0; 13: 0; 14: 0; 15: 0; 16: 0; 17: 0; 18: 0.19719597809243655; 19: 0; 20: 0; 21: 0; 22: 0.001599156436634052; 23: 0.5493102359837969; 24: 0.6782422236874174; 25: 9.994727728962825e-05; 26: 0.27465511799189846; 27: 0.9994727728962826; 28: 0.6677477595720064; 29: 0.027685395809227027; 30: 0.9786837392200398; 31: 1.21727424476889; 32: 0.6442601494089437; 33: 0.6688471796221923; 34: 0; 35: 0; 36: 0; 37: 0; 38: 0.021388717339980445; 39: 0; 40: 0; 41: 0; 42: 0; 43: 0; 44: 0; 45: 0; 46: 0; 47: 0; 48: 0; 49: 0; 50: 0; 51: 0; 52: 0; 53: 0; 54: 0; 55: 0; 56: 0; 57: 0; 58: 0; 59: 0; 60: 0; 61: 0; 62: 0; 63: 0; 64: 0; 65: 0; 66: 0; 67: 0; 68: 0; 69: 0; 70: 0; 71: 0; 72: 0; 73: 0; 74: 0; 75: 0; 76: 0; 77: 0; 78: 0; 79: 0; 80: 0; 81: 0; 82: 0; 83: 0; 84: 0; 85: 0; 86: 0; 87: 0; 88: 0; 89: 0; 90: 0; 91: 0; 92: 0; 93: 0; 94: 0; 95: 0; 96: 0; 97: 0; 98: 0; 99: 0; 100: 0; 101: 0; 102: 0; 103: 0; 104: 0; 105: 0; 106: 0; 107: 0; 108: 0; 109: 0; 110: 0; 111: 0; 112: 0; 113: 0; 114: 0; 115: 0; 116: 0; 117: 0; 118: 0; 119: 0; 120: 0; 121: 0; 122: 0; 123: 0; 124: 0; 125: 0; 126: 0; 127: 0; 128: 0; 129: 0; 130: 0; 131: 0; 132: 0; 133: 0; 134: 0; 135: 0; 136: 0; 137: 0; 138: 0; 139: 0; 140: 0; 141: 0; 142: 0; 143: 0; 144: 0; 145: 0; 146: 0; 147: 0; 148: 0; 149: 0; 150: 0; 151: 0; 152: 0; 153: 0; 154: 0; 155: 0; 156: 0; 157: 0; 158: 0; 159: 0; 160: 0; 161: 0; 162: 0; 163: 0; 164: 0; 165: 0; 166: 0; 167: 0; 168: 0; 169: 0; 170: 0; 171: 0; 172: 0; 173: 0; 174: 0; 175: 0","pid":"1","severity":"WARN","source":"estimator.go:243"}
2024-12-17 06:41:24	
{"log":"Here's the checkpoint/state for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\": {\"lastUpdateTime\":\"2024-12-17T06:41:24Z\",\"version\":\"v3\",\"cpuHistogram\":{\"referenceTimestamp\":\"2024-12-17T00:00:00Z\",\"bucketWeights\":{\"0\":10000,\"1\":637,\"10\":223,\"11\":202,\"12\":138,\"13\":77,\"14\":86,\"15\":79,\"16\":40,\"17\":42,\"18\":25,\"19\":27,\"2\":3816,\"20\":27,\"21\":27,\"22\":20,\"23\":16,\"24\":5,\"25\":4,\"26\":11,\"27\":13,\"28\":7,\"29\":5,\"3\":2836,\"30\":5,\"31\":1,\"32\":4,\"33\":4,\"34\":1,\"35\":1,\"36\":1,\"37\":4,\"38\":1,\"39\":1,\"4\":717,\"40\":1,\"5\":311,\"6\":183,\"7\":57,\"8\":146,\"9\":183},\"totalWeight\":487.1371144429181},\"memoryHistogram\":{\"referenceTimestamp\":\"2024-12-18T00:00:00Z\",\"bucketWeights\":{\"1\":1,\"18\":1256,\"2\":10000,\"22\":10,\"23\":3499,\"24\":4321,\"25\":1,\"26\":1750,\"27\":6367,\"28\":4254,\"29\":176,\"3\":2418,\"30\":6235,\"31\":7754,\"32\":4104,\"33\":4261,\"38\":136,\"4\":1110},\"totalWeight\":9.050249754647215},\"firstSampleStart\":\"2023-03-16T12:35:34Z\",\"lastSampleStart\":\"2024-12-17T06:40:46Z\",\"totalSamplesCount\":1623603}","pid":"1","severity":"WARN","source":"estimator.go:251"}
2024-12-17 06:42:12

You can see that in the memory histogram, the first bucket has a weight of 9.994727728962825e-05 which is less than 0.0001 and hence histogram.IsEmtpy()would return true when called.

Below I try to explain how this can happen:

The main reason for this issue seems to be the transformation of the weights in the histograms from floating point numbers to integer values when saving them to VPACheckpoints and then back to float from integer when loading them from VPACheckpoints. Basically, a weight that was very close to epsilon, e.g. 0.00013, could get rounded to 1 when saved in the VPACheckpoint. When loaded from the VPACheckpoint the same weight could become less than epsilon (e.g. 9.994727728962825e-05 as in the example above). This is because weights are multiplied by different factors when saving and loading and there is some precision loss.

Note also that In our clusters we run vpa-recommender under VPA as well (managed by separate VPA components) which means that it will be restarted around every 12 hours to bring the resource requests closer to the recommendations. The frequency of the restarts contributes to the issue.

Here are the steps that need to happen for the issue to occur:

  1. vpa-recommender is restarted
  2. Histograms from VPACheckpioints are loaded and saved in the ContainersInitialAggregateState variable
  3. New metrics are ingested and added to the aggregateStateMap and the aggregateContainerStates for the corresponding VPA
  4. During the recommendation, the container states from the ContainersInitialAggregateState and aggregateContainerStates are merged. The referenceTimestamps in the histogram in the aggregateContainerStates happens to be ahead of the referenceTimestamp in the initial state so the referenceTimestamp has to be shifted forward and the weights that were loaded from the checkpoint are scaled down. Those less than epsilon are "discarded" via the histogram.updateMinAndMaxBucket() function.
  5. A correct target recommendation is issued
  6. The histograms are saved in the VPACheckpoints with the weight of the min bucket being saved as 1
  7. vpa-recommender is restarted again
  8. Steps 2, 3 and 4 are repeated. The weight of the min bucket is loaded as 9.994727728962825e-05 However, this time during the merge weights are not scaled as the referenceTimestamp of the histograms match (referenceTimestamp of the histogram was already shifted during the previous restart, merge and save). This means that no scale down and removal of potential weights less than epsilon has occurred.
  9. VPA issues an incorrect recommendation as the min bucket is now less than `epsilon

Basically, what this means is that if there is no referenceTimestamp shift and no scale down of weights after they are loaded from a checkpoint, there is a possibility for the histogram to be incorrectly detected as being empty.

How to reproduce it (as minimally and precisely as possible):
This issue can only be reproduced if vpa-recommender is restarted twice in the frame of 24 hours (more concretely twice in the frame of the decay half life setting of the histograms)
I wrote this small main function which approximates what VPA does to reproduce the issue: https://gist.github.com/plkokanov/339ddfe33d05cdc028b273bbac5bdb0a

Anything else we need to know?:

So far We've come up with two ways that seem to fix the issue:

  1. Call histogram.updateMinAndMaxBucket() after loading a checkpoint. Seems to be working ok, but there might be some additional precision loss - weights that would contribute to the histogram's totalWeight before being saved and loaded from the VPACheckpoint, no longer do so.
  2. Another way that I've managed to fix this issue is to ensure that when referenceTimestamps are set, they are set to a time before the timestamp of the added samples. This way (if my calculations are correct), the weight of the most recent sample will never be less than 1 after weights are scaled. This seems to ensure that the ratios used when loading/saving weights to the checkpoint are such that the issue doesn't occur.

/cc @voelzmo and @ialidzhikov as they also took part during the investigation of this issue.

@plkokanov plkokanov added the kind/bug Categorizes issue or PR as related to a bug. label Jan 20, 2025
@adrianmoisey
Copy link
Member

Would it be possible to write a failing e2e for this?

@raywainman
Copy link
Contributor

Thanks a lot for the really fantastic write up.

I'm still digging through the code to try and make sense of this but one thing came to mind...

Shouldn't VPA be increasing the memory recommendation when it sees OOM events?

@adrianmoisey
Copy link
Member

8. Steps 2, 3 and 4 are repeated. The weight of the min bucket is loaded as 9.994727728962825e-05 However, this time during the merge weights are not scaled as the referenceTimestamp of the histograms match (referenceTimestamp of the histogram was already shifted during the previous restart, merge and save). This means that no scale down and removal of potential weights less than epsilon has occurred.

I'm still trying to wrap my head around the issue, so forgive me if this suggestion is wrong.
What if there was a way to force the loaded bucket timestamp to be shifted back? I assume we would need a way to ensure this only happens on startup, and no other time.

@adrianmoisey
Copy link
Member

  1. Call histogram.updateMinAndMaxBucket() after loading a checkpoint. Seems to be working ok, but there might be some additional precision loss - weights that would contribute to the histogram's totalWeight before being saved and loaded from the VPACheckpoint, no longer do so.

Thinking about it a little more, besides the side effect, this option seems its the cleaner solution

@raywainman
Copy link
Contributor

  1. Call histogram.updateMinAndMaxBucket() after loading a checkpoint. Seems to be working ok, but there might be some additional precision loss - weights that would contribute to the histogram's totalWeight before being saved and loaded from the VPACheckpoint, no longer do so.

I'm trying to understand the what you mean by "weights that would contribute to the histogram's totalWeight before being saved and loaded from the VPACheckpoint, no longer do so", why are we losing precision here? Is this for any additional samples that may have come in between the time the checkpoint is loaded and when we hypothetically call updateMinAndMaxBucket()?

@plkokanov
Copy link
Author

plkokanov commented Jan 29, 2025

Thanks for taking a look into this @adrianmoisey @raywainman!

Would it be possible to write a failing e2e for this?

Hmm, there should be a way to craft something like that, perhaps if we use a pre-configured histogram or checkpoint from the occurrences that I shared in the issue. Let me try to come up with something this week.

Shouldn't VPA be increasing the memory recommendation when it sees OOM events?

That was our initial thought as well, however, the recommendation is still calculated based on the merged histogram from step 8. So if the min bucket in that histogram is still less than epsilon, the histogram.Percentile function would still return 0.

What if there was a way to force the loaded bucket timestamp to be shifted back? I assume we would need a way to ensure this only happens on startup, and no other time.

It could work. The main goal is to have the following sequence every time after histogram.LaodFromCheckpoint(...) is called: shift of the reference time stamp -> scale of weights -> filtering of buckets weights lower than epsilon

I'm trying to understand the what you mean by "weights that would contribute to the histogram's totalWeight before being saved and loaded from the VPACheckpoint, no longer do so", why are we losing precision here? Is this for any additional samples that may have come in between the time the checkpoint is loaded and when we hypothetically call updateMinAndMaxBucket()?

What I meant is that before saving the VPACheckpoint you could have multiple buckets with weight just slightly above epsilon e.g. 0.00013 in the histogram. These weights contribute to the calculation done in the histogram.Percentile(...) function. In cases where the issue occurs, when saved to the VPACheckpoint these weights are multiplied by such a factor that they are rounded to 1. When these weights are loaded from the VPACheckpoint, they are multiplied by a different factor and could now become less than epsilon, e.g. 0.00009. If we then call updateMinAndMaxBucket() those weights would be dropped. In the end, we are losing a weight of 0.00013 which was contributing to the percentile estimation before the save and load operations. However, this weight is anyway very close to epsilon so shouldn't be too big of an issue I think.

In a perfect fix, the initial weight of 0.00013 would remain as 0.00013 after the save and load operations - but I'm not sure if this is possible with the current algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants