[clusteragent/autoscaling] Defer autoscaling stack startup until first DPA#50305
[clusteragent/autoscaling] Defer autoscaling stack startup until first DPA#50305
Conversation
Go Package Import DifferencesBaseline: 4bb5357
|
|
🎯 Code Coverage (details) 🔗 Commit SHA: 4a2acf7 | Docs | Datadog PR Page | Give us feedback! |
Files inventory check summaryFile checks results against ancestor 4bb5357e: Results for datadog-agent_7.80.0~devel.git.474.4a2acf7.pipeline.111694927-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
17 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 4bb5357 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -1.53 | [-4.52, +1.45] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_metrics_logs | memory utilization | +0.93 | [+0.68, +1.19] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.87 | [+0.77, +0.97] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.46 | [+0.41, +0.50] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.28 | [+0.10, +0.46] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.16 | [-0.07, +0.40] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.14 | [+0.08, +0.20] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.12 | [+0.07, +0.18] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.02 | [-0.18, +0.22] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.10, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.01 | [-0.47, +0.44] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.02 | [-0.54, +0.49] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.22, +0.18] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.03 | [-0.23, +0.17] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.03 | [-0.44, +0.37] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.05 | [-0.15, +0.05] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.05 | [-0.19, +0.09] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.11 | [-0.27, +0.05] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.14 | [-0.19, -0.09] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -0.15 | [-1.13, +0.82] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | -0.19 | [-0.34, -0.03] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.20 | [-0.25, -0.15] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.11 | [-1.31, -0.92] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -1.53 | [-4.52, +1.45] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 716 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 241.90MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 734 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.18GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 143.50MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 472.27MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 178.56MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 351.57 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 389.38MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
2e6d3d3 to
6592584
Compare
| w.patcherMutex.RLock() | ||
| p := w.patcher | ||
| w.patcherMutex.RUnlock() | ||
| if p == nil { | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
minor: if I understand correctly here we check if the webhook is essentially active. We can move this check earlier into the WebhookFunc such that we avoid serializing and deserializing admission payload until patcher is set.
| } | ||
|
|
||
| // SetPatcher installs the PodPatcher used to apply recommendations. | ||
| func (w *Webhook) SetPatcher(p workload.PodPatcher) { |
There was a problem hiding this comment.
Since you are doing runtime check for the webhook enablement, I think you would need to set isEnabled=true unconditionally.
There was a problem hiding this comment.
I think the current check is correct.
Although, in the future if we flip autoscaling.workload.enabled to true by default, it would be equivalent to what you mentioned.
| gate *autoscalinggate.Gate, | ||
| ) error { | ||
| enable := func(_ any) { gate.Enable() } | ||
| handlers := cache.ResourceEventHandlerFuncs{AddFunc: enable} |
There was a problem hiding this comment.
minor: it's worth checking whether initial load calls add function or update function. Also I guess we could shut down informers after the gate is enabled.
There was a problem hiding this comment.
Initial lists call the handlers defined. We cannot shut down the informers, because the autoscaling controller needs them.
|
|
||
| // WaitForEnable blocks until Enable is called or ctx is cancelled. Returns true | ||
| // if Enable was called, false if ctx was cancelled first. | ||
| func (g *Gate) WaitForEnable(ctx context.Context) bool { |
There was a problem hiding this comment.
nit: alternatively you could make it return context error to simplify the call sites and error logging from
if !gate.WaitForEnable(ctx) || ctx.Err() != nil {
return
}to
if err := gate.WaitForEnable(ctx); err != nil {
log.Errorf("...%v", err)
return
}There was a problem hiding this comment.
I left this one as-is. The reason is that not all context cancellations are errors (for example when DCA is shutting down) so I don't think we should always log.
| for _, webhook := range webhooks { | ||
| if aw, ok := webhook.(*admissionautoscaling.Webhook); ok { | ||
| autoscalingWebhook = aw | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
nit: here you could introduce admissionautoscaling.GetWebhook([]Webhook) *admissionautoscaling.Webhook helper to extract auto-scaling webhook and use it in place where it's needed:
if config.GetBool("autoscaling.workload.enabled") {
autoscalingWebhook := admissionautoscaling.GetWebhook(webhooks)
...
}There was a problem hiding this comment.
Extracting the helper to the admissionautoscaling package is not possible because it would create a circular dependency.
I extracted the helper to the same file, to clean this part a bit.
| gateWired := c.autoscalingGate != nil | ||
| needsGateSync := gateWired && c.config.GetBool("autoscaling.workload.enabled") | ||
|
|
||
| // Pod store handling | ||
| if shouldStartPodStoreEagerly(c.config, gateWired) { | ||
| reflector, store := newPodStore(ctx, wlmetaStore, c.config, client) | ||
| objectStores = append(objectStores, store) | ||
| go reflector.Run(ctx.Done()) | ||
| if needsGateSync { | ||
| go c.markPodCollectionSyncedWhenReady(ctx, store) | ||
| } | ||
| } else if needsGateSync { | ||
| // Defer starting the pod reflector until the first DatadogPodAutoscaler | ||
| // or DatadogPodAutoscalerClusterProfile has been deployed. | ||
| go c.startPodStoreOnGate(ctx, wlmetaStore, client, newPodStore) | ||
| } |
There was a problem hiding this comment.
Here I find it hard to understand the logic because we use gateWired as (c.config.GetBool("autoscaling.workload.enabled") && c.autoscalingGate == nil) inside of shouldStartPodStoreEagerly but also needsGateSync (c.config.GetBool("autoscaling.workload.enabled") && c.autoscalingGate != nil) in both branches.
I think it should be possible to refactor this conditions somehow to simplify it.
There was a problem hiding this comment.
I am wondering if we can come up with something like EnabledGate (enabled by default) in case c.autoscalingGate is nil and then avoid branching at all
gate := c.autoscalingGate
if gate == nil {
gate = autoscalinggate.NewEnabledGate()
}
...
// use gate which could delay pod store or create it right awayThere was a problem hiding this comment.
I agree this was part a bit hard to follow.
I've rewritten it a little bit to try to make it simpler to understand.
| } else if needsGateSync { | ||
| // Defer starting the pod reflector until the first DatadogPodAutoscaler | ||
| // or DatadogPodAutoscalerClusterProfile has been deployed. | ||
| go c.startPodStoreOnGate(ctx, wlmetaStore, client, newPodStore) |
There was a problem hiding this comment.
Here we do not add created store to the objectStores, is that correct?
There was a problem hiding this comment.
Yes, it's correct. objectStores is only used for the startup readiness. When starting the store lazily we don't want to wait for it at startup.
6592584 to
4a2acf7
Compare
|
@AlexanderYastrebov thanks for the review. I addressed your comments. |
What does this PR do?
The goal of this PR is to allow enabling workload autoscaling without extra cost when it's not in use.
Right now, when autoscaling is enabled, the kubeapiserver workloadmeta collector starts a pod reflector among other things. In large clusters, this can use a lot of memory. This happens even if no DPA is deployed.
We want to avoid this memory usage when no DPAs are deployed.
This will let us enable workload autoscaling by default without extra cost when it's not in use. Users who want it will be able to create DPAs directly, without having to enable the option in the Cluster Agent.
This PR does not flip the
autoscaling.workload.enableddefault to true. That will be done in a separate PR so it can be reverted independently if needed.Describe how you validated your changes
Unit tests plus tests on a local kind cluster.
For the kind tests, I used kwok to simulate a large number of pods so the memory impact of the pod reflector would be measurable.
First, I deployed with autoscaling disabled to get a memory baseline. Then I deployed with autoscaling enabled but no DPAs, and checked that memory usage stayed close to the baseline. Finally, I created a DPA and checked that memory usage went up as expected.