agent: Add per-VM metric for desired CU(s) #1108

sharnoff · 2024-10-12T01:48:25Z

This commit adds a new per-VM metric: autoscaling_vm_desired_cu.

It's based on the same "desired CU" information exposed by the scaling event reporting, but updated continuously instead of being rate limited to avoid spamming our reporting.

The metric has the same base labels as the other per-VM metrics, with the addition of the "component" label, which is one of:

total - the goal CU, after taking the maximum of the individual parts and rounding up to the next unit.
cpu - goal CU size in order to fit the current CPU usage
mem - goal CU size in order to fit the current memory usage (including some information derived from LFC, to make sure there's room for cache too)
lfc - goal CU size in order to fit the estimated working set size

All of these values are also multiplied by the same Compute Unit factor as with the normal scaling event reporting, so that Neon's fractional compute units are exposed as such in the metrics, even as we use integer compute units in the autoscaler-agent.

Also note that all values except "total" are NOT rounded, and instead show the fractional amounts to allow better comparison.

KNOWN LIMITATION: If ReportDesiredScaling is disabled at runtime for a particular VM, the metrics will not be cleared, and instead will just cease to be updated. I figured this is a reasonable trade-off for simplicity.

Notes for review: Tested this locally with the following patch to vm-deploy.yaml:

diff --git a/vm-deploy.yaml b/vm-deploy.yaml
index 09588f60..dcf9f0f0 100644
--- a/vm-deploy.yaml
+++ b/vm-deploy.yaml
@@ -10,6 +10,8 @@ metadata:
     autoscaling.neon.tech/enabled: "true"
     # Set to "true" to continuously migrate the VM (TESTING ONLY)
     autoscaling.neon.tech/testing-only-always-migrate: "false"
+    autoscaling.neon.tech/report-desired-scaling: "true"
+    neon/endpoint-id: "foobar"
 spec:
   schedulerName: autoscale-scheduler
   enableSSH: true

AFAICT it works as intended, but metrics are sometimes tricky. I plan to test it on staging before merging.

~~Also note: This PR builds on #1107 and must not be merged before it.~~

github-actions · 2024-10-12T01:51:49Z

No changes to the coverage.

HTML Report

Click to open

pkg/agent/prommetrics.go

This commit adds a new per-VM metric: autoscaling_vm_desired_cu. It's based on the same "desired CU" information exposed by the scaling event reporting, but updated continuously instead of being rate limited to avoid spamming our reporting. The metric has the same base labels as the other per-VM metrics, with the addition of the "reason" label, which is one of: * "total" - the goal CU, after taking the maximum of the individual parts and rounding up to the next unit. * "cpu" - goal CU size in order to fit the current CPU usage * "mem" - goal CU size in order to fit the current memory usage, which includes some assesssment * "lfc" - goal CU size in order to fit the estimated working set size All of these values are also multiplied by the same Compute Unit factor as with the normal scaling event reporting, so that Neon's fractional compute units are exposed as such in the metrics, even as we use integer compute units in the autoscaler-agent. Also note that all values except "total" are NOT rounded, and instead show the fractional amounts to allow better comparison. KNOWN LIMITATION: If ReportDesiredScaling is disabled at runtime for a particular VM, the metrics will not be cleared, and instead will just cease to be updated. I figured this is a reasonable trade-off for simplicity.

Omrigan

Only nits.

pkg/agent/prommetrics.go

pkg/agent/watch.go

sharnoff · 2025-02-10T22:28:33Z

Tested on staging, looks good :D

ref https://neondb.slack.com/archives/C03TN5G758R/p1739221210836759

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 693b601 to a3cf0fa Compare October 12, 2024 21:39

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from c9f1d75 to fe3b019 Compare October 12, 2024 21:39

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from a3cf0fa to 16c0917 Compare October 12, 2024 22:16

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from fe3b019 to 244473b Compare October 12, 2024 22:21

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 16c0917 to d2b4d45 Compare October 17, 2024 17:13

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 244473b to 8008093 Compare October 17, 2024 17:14

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from d2b4d45 to 8c60b7f Compare November 18, 2024 04:01

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 8008093 to 8639da0 Compare November 18, 2024 04:01

Omrigan self-requested a review December 16, 2024 17:14

Omrigan self-assigned this Dec 16, 2024

Omrigan requested review from mikhail-sakhnov and petuhovskiy December 16, 2024 17:15

Omrigan reviewed Dec 20, 2024

View reviewed changes

pkg/agent/prommetrics.go Show resolved Hide resolved

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 8639da0 to 4b94cae Compare December 27, 2024 18:41

sharnoff mentioned this pull request Dec 27, 2024

agent: Add scaling event reporting #1107

Merged

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from f4f9b64 to 4b94cae Compare January 23, 2025 18:25

Base automatically changed from sharnoff/scaling-event-reporting-2 to main January 24, 2025 21:59

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 4b94cae to ed0cf86 Compare January 24, 2025 22:06

fix test compilation errors

38c8ea5

Omrigan approved these changes Feb 10, 2025

View reviewed changes

pkg/agent/prommetrics.go Show resolved Hide resolved

pkg/agent/prommetrics.go Outdated Show resolved Hide resolved

pkg/agent/prommetrics.go Outdated Show resolved Hide resolved

pkg/agent/watch.go Show resolved Hide resolved

Omrigan assigned sharnoff and unassigned Omrigan Feb 10, 2025

sharnoff added 5 commits February 10, 2025 12:39

Merge branch 'main' into agent-desired-cu-metrics

10a67ae

standardize on "component"

4edc175

s/conversionFactor/cuMultiplier/

cc3057e

extract into updateActive, deleteActive

e9eb716

Merge branch 'main' into agent-desired-cu-metrics

392650f

sharnoff merged commit f9403d0 into main Feb 10, 2025
24 checks passed

sharnoff deleted the sharnoff/agent-desired-cu-metrics branch February 10, 2025 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: Add per-VM metric for desired CU(s) #1108

agent: Add per-VM metric for desired CU(s) #1108

sharnoff commented Oct 12, 2024 •

edited

Loading

github-actions bot commented Oct 12, 2024 •

edited

Loading

Omrigan left a comment

sharnoff commented Feb 10, 2025

agent: Add per-VM metric for desired CU(s) #1108

agent: Add per-VM metric for desired CU(s) #1108

Conversation

sharnoff commented Oct 12, 2024 • edited Loading

github-actions bot commented Oct 12, 2024 • edited Loading

HTML Report

Omrigan left a comment

Choose a reason for hiding this comment

sharnoff commented Feb 10, 2025

sharnoff commented Oct 12, 2024 •

edited

Loading

github-actions bot commented Oct 12, 2024 •

edited

Loading