[VPA] Usage of VPA helm chart >2.0.0 leads to missing recommendations #1296

Pionerd · 2023-08-16T15:54:20Z

What happened?

We upgraded Goldilocks (tried 7.0.0, 7.1.0 and 7.1.1) and the VPA (2.2.0 and 2.3.0) after which we saw intermittent drops in the recommendations to the minimum we set (e.g. 25 MB for memory) and sometimes even lower than that (8.33 MB).

Reverting the VPA to the latest 1.x.x (1.7.5) seems to undo this behaviour, even though the underlying VPA image version (0.13.0) did not change. We are a bit at a loss here, because also the logs are not really giving us a lot of helpful information.

We see a lot of log lines like below, but that is also still the case after downgrading.

│ I0816 15:52:27.494178       1 request.go:533] Waited for 170.386822ms due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/autoscaling.k8s.io/v1/namespaces/ │

Any ideas how to move forward?
Posting here, since the underlying VPA version did not change.

What did you expect to happen?

Continous recommendations being shown.

How can we reproduce this?

EKS 1.26

Version

VPA helm chart 2.3.0

Search

I did search for other open and closed issues before opening this.

Code of Conduct

I agree to follow this project's Code of Conduct

Additional context

No response

The text was updated successfully, but these errors were encountered:

sudermanjr · 2023-08-16T16:02:37Z

How are you pulling these metrics into Grafana? Is it possible there's actually just an issue with the metrics reporting rather than the actual VPA recommendation itself? The changes from 1.7.5 to 2.x are almost entirely unrelated to the recommender deployment itself.

sudermanjr · 2023-08-16T16:05:21Z

Additionally, are you using long-term storage with prometheus to feed VPA?

Pionerd · 2023-08-16T16:06:43Z

We use kube-state-metrics to scrape the VPA recommendations. The values in the Grafana dashboard are the same as when checking using kubectl get vpa.

I also cannot understand why this change would lead to this behaviour. You have not seen anything like this before?

Pionerd · 2023-08-16T16:07:18Z

Additionally, are you using long-term storage with prometheus to feed VPA?

Yes we use thanos

sudermanjr · 2023-08-16T16:09:31Z

The only time I've seen erratic recommendations is when I'm not using Prometheus data to feed the recommendations and I don't wait long enough for VPA to generate a good recommendation. Here's a cluster with 53 VPAs, using prometheus data, and the latest chart. (also using kube-state-metrics to poll the VPA data)

Pionerd · 2023-08-16T16:10:19Z

Additional remark: we have multiple clients using our setup and only but all EKS clients are suffering from this, the AKS customers are not after the same upgrade.

sudermanjr · 2023-08-16T16:11:43Z

Maybe try turning the log level on the recommender up to 10?

sudermanjr · 2023-08-16T16:19:03Z

I just realized the cluster that I'm showing in that graph above uses the vpa 0.14.0 image. Perhaps there's a bugfix in that version. Worth trying.

It would help if you could share your exact values for me to try to reproduce the issue

Pionerd · 2023-08-16T16:22:41Z

Relevant parameters:

I0816 16:14:07.067381       1 flags.go:57] FLAG: --add-dir-header="false"
I0816 16:14:07.067486       1 flags.go:57] FLAG: --address=":8942"
I0816 16:14:07.067492       1 flags.go:57] FLAG: --alsologtostderr="false"
I0816 16:14:07.067495       1 flags.go:57] FLAG: --checkpoints-gc-interval="10m0s"
I0816 16:14:07.067499       1 flags.go:57] FLAG: --checkpoints-timeout="1m0s"
I0816 16:14:07.067504       1 flags.go:57] FLAG: --container-name-label="container"
I0816 16:14:07.067509       1 flags.go:57] FLAG: --container-namespace-label="namespace"
I0816 16:14:07.067514       1 flags.go:57] FLAG: --container-pod-name-label="pod"
I0816 16:14:07.067517       1 flags.go:57] FLAG: --cpu-histogram-decay-half-life="24h0m0s"
I0816 16:14:07.067522       1 flags.go:57] FLAG: --cpu-integer-post-processor-enabled="false"
I0816 16:14:07.067526       1 flags.go:57] FLAG: --history-length="8d"
I0816 16:14:07.067531       1 flags.go:57] FLAG: --history-resolution="1h"
I0816 16:14:07.067535       1 flags.go:57] FLAG: --kube-api-burst="10"
I0816 16:14:07.067541       1 flags.go:57] FLAG: --kube-api-qps="5"
I0816 16:14:07.067547       1 flags.go:57] FLAG: --kubeconfig=""
I0816 16:14:07.067552       1 flags.go:57] FLAG: --log-backtrace-at=":0"
I0816 16:14:07.067566       1 flags.go:57] FLAG: --log-dir=""
I0816 16:14:07.067571       1 flags.go:57] FLAG: --log-file=""
I0816 16:14:07.067575       1 flags.go:57] FLAG: --log-file-max-size="1800"
I0816 16:14:07.067579       1 flags.go:57] FLAG: --logtostderr="true"
I0816 16:14:07.067584       1 flags.go:57] FLAG: --memory-aggregation-interval="24h0m0s"
I0816 16:14:07.067589       1 flags.go:57] FLAG: --memory-aggregation-interval-count="8"
I0816 16:14:07.067593       1 flags.go:57] FLAG: --memory-histogram-decay-half-life="24h0m0s"
I0816 16:14:07.067597       1 flags.go:57] FLAG: --memory-saver="false"
I0816 16:14:07.067601       1 flags.go:57] FLAG: --metric-for-pod-labels="kube_pod_labels{job=\"kube-state-metrics\"}[8d]"
I0816 16:14:07.067605       1 flags.go:57] FLAG: --min-checkpoints="10"
I0816 16:14:07.067609       1 flags.go:57] FLAG: --one-output="false"
I0816 16:14:07.067613       1 flags.go:57] FLAG: --oom-bump-up-ratio="1.2"
I0816 16:14:07.067618       1 flags.go:57] FLAG: --oom-min-bump-up-bytes="1.048576e+08"
I0816 16:14:07.067623       1 flags.go:57] FLAG: --pod-label-prefix=""
I0816 16:14:07.067627       1 flags.go:57] FLAG: --pod-name-label="pod"
I0816 16:14:07.067631       1 flags.go:57] FLAG: --pod-namespace-label="namespace"
I0816 16:14:07.067635       1 flags.go:57] FLAG: --pod-recommendation-min-cpu-millicores="5"
I0816 16:14:07.067640       1 flags.go:57] FLAG: --pod-recommendation-min-memory-mb="25"
I0816 16:14:07.067645       1 flags.go:57] FLAG: --prometheus-address="http://thanos-query-frontend.prometheus-stack:9090"
I0816 16:14:07.067649       1 flags.go:57] FLAG: --prometheus-cadvisor-job-name="kubelet"
I0816 16:14:07.067653       1 flags.go:57] FLAG: --prometheus-query-timeout="5m"
I0816 16:14:07.067657       1 flags.go:57] FLAG: --recommendation-margin-fraction="0.15"
I0816 16:14:07.067662       1 flags.go:57] FLAG: --recommender-interval="1m0s"
I0816 16:14:07.067667       1 flags.go:57] FLAG: --recommender-name="default"
I0816 16:14:07.067671       1 flags.go:57] FLAG: --skip-headers="false"
I0816 16:14:07.067675       1 flags.go:57] FLAG: --skip-log-headers="false"
I0816 16:14:07.067679       1 flags.go:57] FLAG: --stderrthreshold="2"
I0816 16:14:07.067683       1 flags.go:57] FLAG: --storage="prometheus"
I0816 16:14:07.067686       1 flags.go:57] FLAG: --target-cpu-percentile="0.9"
I0816 16:14:07.067690       1 flags.go:57] FLAG: --v="10"
I0816 16:14:07.067693       1 flags.go:57] FLAG: --vmodule=""
I0816 16:14:07.067697       1 flags.go:57] FLAG: --vpa-object-namespace=""
I0816 16:14:07.067702       1 main.go:82] Vertical Pod Autoscaler 0.13.0 Recommender: 0xc00004d820

Full logs in your mail :) not to leak any sensitive info here.

Pionerd · 2023-08-16T16:24:56Z

Helm values are not much different:

vpa:
  recommender:
    extraArgs:
      storage: "prometheus"
      # The prometheus_server_endpoint should have the form http://<service-name>.<namespace-name>.svc:portnumber
      prometheus-address: "http://thanos-query-frontend.prometheus-stack:9090"
      prometheus-cadvisor-job-name: kubelet
      pod-label-prefix: ""
      pod-namespace-label: namespace
      pod-name-label: pod
      container-pod-name-label: pod
      container-name-label: container
      metric-for-pod-labels: kube_pod_labels{job="kube-state-metrics"}[8d]
      pod-recommendation-min-cpu-millicores: 5
      pod-recommendation-min-memory-mb: 25
      v: 10
  updater:
    enabled: false
  admissionController:
    enabled: false

sudermanjr · 2023-08-16T16:34:19Z

Aha. You're using uncappedTarget which does not respect limits set on the VPA or in the defaults

kubernetes/autoscaler#2747 (comment)

Uncapped Target gives the recommendation before applying constraints specified in the VPA spec, such as min or max.

I would imagine that switching that metrics to target would provide more consistent data (that's what my graph above uses)

Pionerd · 2023-08-16T16:37:39Z

That was just the first graph being shown by Grafana :) similar images for Target:

sudermanjr · 2023-08-16T16:46:22Z

Well now I'm at a loss. Perhaps the VPA folks can help explain why the recommendation status would oscillate so much. I personally haven't seen it do this in my various tests.

I'm guessing that the actual chart change actually has nothing to do with it, but it's something that is triggered by the re-deploy of the VPA pods. But that's just a hunch.

Pionerd added bug Something isn't working triage This bug needs triage labels Aug 16, 2023

sudermanjr added the needs more information Additional info is needed to assess the issue label Aug 28, 2023

github-actions bot added the stale Marked as stale by stalebot label Oct 28, 2023

github-actions bot closed this as completed Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VPA] Usage of VPA helm chart >2.0.0 leads to missing recommendations #1296

[VPA] Usage of VPA helm chart >2.0.0 leads to missing recommendations #1296

Pionerd commented Aug 16, 2023 •

edited

Loading

sudermanjr commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

Pionerd commented Aug 16, 2023

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023 •

edited

Loading

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

Pionerd commented Aug 16, 2023

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023 •

edited

Loading

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

[VPA] Usage of VPA helm chart >2.0.0 leads to missing recommendations #1296

[VPA] Usage of VPA helm chart >2.0.0 leads to missing recommendations #1296

Comments

Pionerd commented Aug 16, 2023 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce this?

Version

Search

Code of Conduct

Additional context

sudermanjr commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

Pionerd commented Aug 16, 2023

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023 • edited Loading

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

Pionerd commented Aug 16, 2023

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023 • edited Loading

Pionerd commented Aug 16, 2023

sudermanjr commented Aug 16, 2023

Pionerd commented Aug 16, 2023 •

edited

Loading

sudermanjr commented Aug 16, 2023 •

edited

Loading

sudermanjr commented Aug 16, 2023 •

edited

Loading