RFC: move decision-making of desired VM size to VM monitor #8111

hlinnaka · 2024-06-19T13:50:05Z

See also draft implementation of this in:

The new protocol message allows the vm-monitor to directly specify the desired size of the VM. With that, the agent doesn't need the metrics anymore, it will just try to make the vm-monitor's wish true. This is the autoscaler agent implementation of the RFC I proposed here: neondatabase/neon#8111. In order to use the new API, see the corresponding VM monitor changes at: https://github.com/neondatabase/neon/tree/heikki/wip-autoscale-api

This is the VM monitor implementation of the RFC at #8111. I tried to keep the user-visible behavior unchanged from what we have today. Improving the autoscaling algorithm is a separate topic, the point of this work is just to move the algorihm from the autoscaler agent to the VM monitor. That lays the groundwork for improving it later, based on more metrics and signals inside the VM. Some notable changes: - I removed all the cgroup managing stuff. Instead of polling the cgroup memory threshold, this polls the overall system memory usage. - The scaling algorithm is based on sliding window of load average and memory usage over the last minute. I'm not sure how close that is to the algorithm used by the autoscaler agent, I couldn't find a description of what exactly the algorithm used there is. I think this is close, but if not, it can be changed to match the agent's current algorithm more closely. I copied the LoadAverageFractionTarget and MemoryUsageFractionTarget settings from the autoscaler agent, with the defaults I found in the repo, but I'm not sure if we use different settings in production. - I also didn't fully understand how the memory history logging in VM monitor, which was used to trigger upscaling. There is only one memory scaling codepath now, based on the max over 1-minute sliding window.

problame

+1 on the high-level idea that the workload should request the compute size, not an external observer.

I'm missing details on the ScaleRequest semantics. Is it a synchronous call? Is it just a "would be nice to have but until you give it to me, I will work with existing resources"? Is the response to the ScaleRequest an estimate for how long it's going to take until the upscaling is complete?

hlinnaka · 2024-06-19T14:15:01Z

+1 on the high-level idea that the workload should request the compute size, not an external observer.

I'm missing details on the ScaleRequest semantics. Is it a synchronous call? Is it just a "would be nice to have but until you give it to me, I will work with existing resources"? Is the response to the ScaleRequest an estimate for how long it's going to take until the upscaling is complete?

It's "would be nice to have but until you give it to me, I will work with existing resources". The agent doesn't send any response to the ScaleRequest. If the ScaleRequest results in upscaling or downscaling, however, the agent will send a DownScaleRequest or UpscaleNotification to the VM monitor, just like it does today when it decides to perform an upscale or downscale.

I'm not sure that's the best protocol, but I think it's the path of least resistence, because it's very close to how the current UpscaleRequest mesage works.

github-actions · 2024-06-19T15:40:46Z

3228 tests run: 3111 passed, 0 failed, 117 skipped (full report)

Code coverage* (full report)

functions: 32.4% (6848 of 21165 functions)
lines: 49.7% (53410 of 107360 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
8d8c728 at 2024-06-19T15:40:45.689Z :recycle:}

RFC: move decision-making of desired VM size to VM monitor

8d8c728

hlinnaka requested review from a team, sharnoff, kelvich, problame and stradig and removed request for a team June 19, 2024 13:50

hlinnaka mentioned this pull request Jun 19, 2024

Support ScaleRequest message, to allow vm-monitor to control autoscaling neondatabase/autoscaling#980

Draft

hlinnaka mentioned this pull request Jun 19, 2024

Implement new vm-monitor - controller autoscaling protocol #8113

Draft

problame reviewed Jun 19, 2024

View reviewed changes

ololobus self-requested a review June 21, 2024 18:54

Omrigan self-requested a review June 26, 2024 10:11

Omrigan mentioned this pull request Jun 26, 2024

Internal feature: Stateless scheduler neondatabase/autoscaling#995

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: move decision-making of desired VM size to VM monitor #8111

RFC: move decision-making of desired VM size to VM monitor #8111

hlinnaka commented Jun 19, 2024 •

edited

Loading

problame left a comment

hlinnaka commented Jun 19, 2024

github-actions bot commented Jun 19, 2024

RFC: move decision-making of desired VM size to VM monitor #8111

Are you sure you want to change the base?

RFC: move decision-making of desired VM size to VM monitor #8111

Conversation

hlinnaka commented Jun 19, 2024 • edited Loading

problame left a comment

Choose a reason for hiding this comment

hlinnaka commented Jun 19, 2024

github-actions bot commented Jun 19, 2024

3228 tests run: 3111 passed, 0 failed, 117 skipped (full report)

Code coverage* (full report)

hlinnaka commented Jun 19, 2024 •

edited

Loading