Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPA] Usage of VPA helm chart >2.0.0 leads to missing recommendations #1296

Closed
2 tasks done
Pionerd opened this issue Aug 16, 2023 · 13 comments
Closed
2 tasks done

[VPA] Usage of VPA helm chart >2.0.0 leads to missing recommendations #1296

Pionerd opened this issue Aug 16, 2023 · 13 comments
Labels
bug Something isn't working needs more information Additional info is needed to assess the issue stale Marked as stale by stalebot triage This bug needs triage

Comments

@Pionerd
Copy link
Contributor

Pionerd commented Aug 16, 2023

What happened?

We upgraded Goldilocks (tried 7.0.0, 7.1.0 and 7.1.1) and the VPA (2.2.0 and 2.3.0) after which we saw intermittent drops in the recommendations to the minimum we set (e.g. 25 MB for memory) and sometimes even lower than that (8.33 MB).

image

Reverting the VPA to the latest 1.x.x (1.7.5) seems to undo this behaviour, even though the underlying VPA image version (0.13.0) did not change. We are a bit at a loss here, because also the logs are not really giving us a lot of helpful information.

We see a lot of log lines like below, but that is also still the case after downgrading.

│ I0816 15:52:27.494178       1 request.go:533] Waited for 170.386822ms due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/autoscaling.k8s.io/v1/namespaces/ │

Any ideas how to move forward?
Posting here, since the underlying VPA version did not change.

What did you expect to happen?

Continous recommendations being shown.

How can we reproduce this?

EKS 1.26

Version

VPA helm chart 2.3.0

Search

  • I did search for other open and closed issues before opening this.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Additional context

No response

@Pionerd Pionerd added bug Something isn't working triage This bug needs triage labels Aug 16, 2023
@sudermanjr
Copy link
Member

How are you pulling these metrics into Grafana? Is it possible there's actually just an issue with the metrics reporting rather than the actual VPA recommendation itself? The changes from 1.7.5 to 2.x are almost entirely unrelated to the recommender deployment itself.

@sudermanjr
Copy link
Member

Additionally, are you using long-term storage with prometheus to feed VPA?

@Pionerd
Copy link
Contributor Author

Pionerd commented Aug 16, 2023

We use kube-state-metrics to scrape the VPA recommendations. The values in the Grafana dashboard are the same as when checking using kubectl get vpa.

I also cannot understand why this change would lead to this behaviour. You have not seen anything like this before?

@Pionerd
Copy link
Contributor Author

Pionerd commented Aug 16, 2023

Additionally, are you using long-term storage with prometheus to feed VPA?

Yes we use thanos

@sudermanjr
Copy link
Member

sudermanjr commented Aug 16, 2023

The only time I've seen erratic recommendations is when I'm not using Prometheus data to feed the recommendations and I don't wait long enough for VPA to generate a good recommendation. Here's a cluster with 53 VPAs, using prometheus data, and the latest chart. (also using kube-state-metrics to poll the VPA data)

Screenshot 2023-08-16 at 10 06 27 AM

@Pionerd
Copy link
Contributor Author

Pionerd commented Aug 16, 2023

Additional remark: we have multiple clients using our setup and only but all EKS clients are suffering from this, the AKS customers are not after the same upgrade.

@sudermanjr
Copy link
Member

Maybe try turning the log level on the recommender up to 10?

@sudermanjr
Copy link
Member

I just realized the cluster that I'm showing in that graph above uses the vpa 0.14.0 image. Perhaps there's a bugfix in that version. Worth trying.

It would help if you could share your exact values for me to try to reproduce the issue

@Pionerd
Copy link
Contributor Author

Pionerd commented Aug 16, 2023

Relevant parameters:

I0816 16:14:07.067381       1 flags.go:57] FLAG: --add-dir-header="false"
I0816 16:14:07.067486       1 flags.go:57] FLAG: --address=":8942"
I0816 16:14:07.067492       1 flags.go:57] FLAG: --alsologtostderr="false"
I0816 16:14:07.067495       1 flags.go:57] FLAG: --checkpoints-gc-interval="10m0s"
I0816 16:14:07.067499       1 flags.go:57] FLAG: --checkpoints-timeout="1m0s"
I0816 16:14:07.067504       1 flags.go:57] FLAG: --container-name-label="container"
I0816 16:14:07.067509       1 flags.go:57] FLAG: --container-namespace-label="namespace"
I0816 16:14:07.067514       1 flags.go:57] FLAG: --container-pod-name-label="pod"
I0816 16:14:07.067517       1 flags.go:57] FLAG: --cpu-histogram-decay-half-life="24h0m0s"
I0816 16:14:07.067522       1 flags.go:57] FLAG: --cpu-integer-post-processor-enabled="false"
I0816 16:14:07.067526       1 flags.go:57] FLAG: --history-length="8d"
I0816 16:14:07.067531       1 flags.go:57] FLAG: --history-resolution="1h"
I0816 16:14:07.067535       1 flags.go:57] FLAG: --kube-api-burst="10"
I0816 16:14:07.067541       1 flags.go:57] FLAG: --kube-api-qps="5"
I0816 16:14:07.067547       1 flags.go:57] FLAG: --kubeconfig=""
I0816 16:14:07.067552       1 flags.go:57] FLAG: --log-backtrace-at=":0"
I0816 16:14:07.067566       1 flags.go:57] FLAG: --log-dir=""
I0816 16:14:07.067571       1 flags.go:57] FLAG: --log-file=""
I0816 16:14:07.067575       1 flags.go:57] FLAG: --log-file-max-size="1800"
I0816 16:14:07.067579       1 flags.go:57] FLAG: --logtostderr="true"
I0816 16:14:07.067584       1 flags.go:57] FLAG: --memory-aggregation-interval="24h0m0s"
I0816 16:14:07.067589       1 flags.go:57] FLAG: --memory-aggregation-interval-count="8"
I0816 16:14:07.067593       1 flags.go:57] FLAG: --memory-histogram-decay-half-life="24h0m0s"
I0816 16:14:07.067597       1 flags.go:57] FLAG: --memory-saver="false"
I0816 16:14:07.067601       1 flags.go:57] FLAG: --metric-for-pod-labels="kube_pod_labels{job=\"kube-state-metrics\"}[8d]"
I0816 16:14:07.067605       1 flags.go:57] FLAG: --min-checkpoints="10"
I0816 16:14:07.067609       1 flags.go:57] FLAG: --one-output="false"
I0816 16:14:07.067613       1 flags.go:57] FLAG: --oom-bump-up-ratio="1.2"
I0816 16:14:07.067618       1 flags.go:57] FLAG: --oom-min-bump-up-bytes="1.048576e+08"
I0816 16:14:07.067623       1 flags.go:57] FLAG: --pod-label-prefix=""
I0816 16:14:07.067627       1 flags.go:57] FLAG: --pod-name-label="pod"
I0816 16:14:07.067631       1 flags.go:57] FLAG: --pod-namespace-label="namespace"
I0816 16:14:07.067635       1 flags.go:57] FLAG: --pod-recommendation-min-cpu-millicores="5"
I0816 16:14:07.067640       1 flags.go:57] FLAG: --pod-recommendation-min-memory-mb="25"
I0816 16:14:07.067645       1 flags.go:57] FLAG: --prometheus-address="http://thanos-query-frontend.prometheus-stack:9090"
I0816 16:14:07.067649       1 flags.go:57] FLAG: --prometheus-cadvisor-job-name="kubelet"
I0816 16:14:07.067653       1 flags.go:57] FLAG: --prometheus-query-timeout="5m"
I0816 16:14:07.067657       1 flags.go:57] FLAG: --recommendation-margin-fraction="0.15"
I0816 16:14:07.067662       1 flags.go:57] FLAG: --recommender-interval="1m0s"
I0816 16:14:07.067667       1 flags.go:57] FLAG: --recommender-name="default"
I0816 16:14:07.067671       1 flags.go:57] FLAG: --skip-headers="false"
I0816 16:14:07.067675       1 flags.go:57] FLAG: --skip-log-headers="false"
I0816 16:14:07.067679       1 flags.go:57] FLAG: --stderrthreshold="2"
I0816 16:14:07.067683       1 flags.go:57] FLAG: --storage="prometheus"
I0816 16:14:07.067686       1 flags.go:57] FLAG: --target-cpu-percentile="0.9"
I0816 16:14:07.067690       1 flags.go:57] FLAG: --v="10"
I0816 16:14:07.067693       1 flags.go:57] FLAG: --vmodule=""
I0816 16:14:07.067697       1 flags.go:57] FLAG: --vpa-object-namespace=""
I0816 16:14:07.067702       1 main.go:82] Vertical Pod Autoscaler 0.13.0 Recommender: 0xc00004d820

Full logs in your mail :) not to leak any sensitive info here.

@Pionerd
Copy link
Contributor Author

Pionerd commented Aug 16, 2023

Helm values are not much different:

vpa:
  recommender:
    extraArgs:
      storage: "prometheus"
      # The prometheus_server_endpoint should have the form http://<service-name>.<namespace-name>.svc:portnumber
      prometheus-address: "http://thanos-query-frontend.prometheus-stack:9090"
      prometheus-cadvisor-job-name: kubelet
      pod-label-prefix: ""
      pod-namespace-label: namespace
      pod-name-label: pod
      container-pod-name-label: pod
      container-name-label: container
      metric-for-pod-labels: kube_pod_labels{job="kube-state-metrics"}[8d]
      pod-recommendation-min-cpu-millicores: 5
      pod-recommendation-min-memory-mb: 25
      v: 10
  updater:
    enabled: false
  admissionController:
    enabled: false

@sudermanjr
Copy link
Member

sudermanjr commented Aug 16, 2023

Aha. You're using uncappedTarget which does not respect limits set on the VPA or in the defaults

kubernetes/autoscaler#2747 (comment)

Uncapped Target gives the recommendation before applying constraints specified in the VPA spec, such as min or max.

I would imagine that switching that metrics to target would provide more consistent data (that's what my graph above uses)

@Pionerd
Copy link
Contributor Author

Pionerd commented Aug 16, 2023

That was just the first graph being shown by Grafana :) similar images for Target:
image

@sudermanjr
Copy link
Member

Well now I'm at a loss. Perhaps the VPA folks can help explain why the recommendation status would oscillate so much. I personally haven't seen it do this in my various tests.

I'm guessing that the actual chart change actually has nothing to do with it, but it's something that is triggered by the re-deploy of the VPA pods. But that's just a hunch.

@sudermanjr sudermanjr added the needs more information Additional info is needed to assess the issue label Aug 28, 2023
@github-actions github-actions bot added the stale Marked as stale by stalebot label Oct 28, 2023
@github-actions github-actions bot closed this as completed Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs more information Additional info is needed to assess the issue stale Marked as stale by stalebot triage This bug needs triage
Projects
None yet
Development

No branches or pull requests

2 participants