feat(autoscaling): CPURequestsRemoveLimitsMemoryRequestsAndLimits + cluster burstable default#49314
feat(autoscaling): CPURequestsRemoveLimitsMemoryRequestsAndLimits + cluster burstable default#49314clamoriniere wants to merge 5 commits into
Conversation
Files inventory check summaryFile checks results against ancestor 37ca696e: Results for datadog-agent_7.80.0~devel.git.429.bbaa1d6.pipeline.111425872-1_amd64.deb:No change detected |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 37ca696 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +2.23 | [-0.71, +5.17] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +2.23 | [-0.71, +5.17] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.42 | [+0.32, +0.52] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.26 | [+0.02, +0.50] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.17 | [-0.80, +1.14] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | +0.07 | [-0.12, +0.26] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.07 | [-0.19, +0.32] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.02 | [-0.50, +0.55] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.02 | [-0.42, +0.46] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.00 | [-0.41, +0.42] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.00 | [-0.20, +0.20] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.20, +0.20] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.00 | [-0.05, +0.05] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.11, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.01 | [-0.13, +0.12] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.02 | [-0.18, +0.14] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.22, +0.18] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.15 | [-0.20, -0.10] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.18 | [-0.27, -0.08] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.20 | [-0.36, -0.05] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.25 | [-0.30, -0.20] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.29 | [-0.35, -0.23] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.33 | [-0.37, -0.29] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.94 | [-2.21, -1.66] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 672 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 245.45MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 670 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 141.65MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 478.36MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 175.63MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 345.27 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 379.07MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
|
This pull request has been automatically marked as stale because it has not had activity in the past 15 days. It will be closed in 30 days if no further activity occurs. If this pull request is still relevant, adding a comment or pushing new commits will keep it open. Also, you can always reopen the pull request if you missed the window. Thank you for your contributions! |
…Limits controlled value Bump datadog-operator/api to v0.0.0-20260414104914-c59fc90bbc2c which introduces the new CPURequestsRemoveLimitsMemoryRequestsAndLimits enum value for container controlledValues. When a container constraint sets this value the autoscaler applies different strategies per resource: - CPU: request recommendation applied, existing CPU limits actively removed from the live pod so the container can burst freely. - Memory: both requests and limits are controlled (RequestsAndLimits semantics unchanged). Changes: - applyVerticalConstraints: strip CPU from recommendation limits when CPURequestsRemoveLimitsMemoryRequestsAndLimits is set so that the backend never pushes a new CPU limit. - patchContainerResources: actively delete any pre-existing CPU limit from the live pod for the same controlled value. - getContainerControlledValues: new helper resolving ControlledValues from spec constraints (specific name > wildcard). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…estsAndLimits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…val to pod patcher
Replace the `controlledValues` parameter threaded through `patchPod` /
`patchContainerResources` with a sentinel approach: `applyVerticalConstraints`
inserts `resource.MustParse("-1")` into `ContainerResources.Limits[cpu]` to
signal that any pre-existing CPU limit must be actively deleted from the live
pod. `patchContainerResources` detects the sentinel via `Cmp()` and deletes
the limit entry, keeping the function signatures clean.
The insertion is split into two phases: phase 1 deletes the CPU limit before
the clamping and `limits >= requests` invariant check; phase 2 inserts the
sentinel after the invariant check to prevent it from being overwritten.
`BuildStatus` is updated to call `ContainerResourcesForStatus()` (new helper
on `VerticalScalingValues`) which strips any negative-quantity limit entries
before writing to the DPA status, so the sentinel never leaks into the CRD.
Assisted-by: Claude:claude-sonnet-4-6
…ernalBuilder Introduce `autoscaling.workload.options.burstable` (default: true) as the lowest-priority fallback in the IsBurstable() priority chain: spec.options.burstable > preview annotation > cluster config default The config value is read once at startup in provider.go and injected into a *PodAutoscalerInternalBuilder, which is threaded through to all three PodAutoscalerInternal construction paths (NewFromKubernetes, NewFromSettings, NewFromProfile), replacing the former standalone constructor functions. This avoids per-reconcile config lookups and ensures a single shared builder instance across the controller, config retriever, and autoscaler syncer. Also bump datadog-operator/api to v0.0.0-20260503193602-adf766128732. Assisted-by: Claude:claude-sonnet-4-6
cd2fa28 to
6a083de
Compare
|
🎯 Code Coverage (details) 🔗 Commit SHA: bbaa1d6 | Docs | Datadog PR Page | Give us feedback! |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
29 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
When `spec.options.burstable` and the preview annotation are both unset, the effective burstable mode now falls back to `false` for pods that are in Guaranteed QOS (requests == limits for all containers), rather than propagating the cluster-level default. This prevents the autoscaler from silently changing a pod's QOS class by removing its CPU limit when the user has not explicitly requested burstable mode. Priority chain in IsBurstable(): 1. spec.options.burstable (explicit, highest) 2. preview annotation (explicit via profile) 3. podsGuaranteedQOS=true → false (new: protects Guaranteed QOS) 4. cluster-level default (lowest) Implementation: - Add podsGuaranteedQOS bool field to PodAutoscalerInternal and a SetPodsGuaranteedQOS setter called by the vertical controller each sync cycle after fetching the current pods. - Add isPodsGuaranteedQOS helper that returns true only when every pod for the workload has QOSClass == Guaranteed. - Move the pod fetch before applyVerticalConstraints in sync() so the QOS state is available when IsBurstable() is first consulted. Assisted-by: Claude:claude-sonnet-4-6
Summary
1.
CPURequestsRemoveLimitsMemoryRequestsAndLimitscontrolled valueImplements the new
CPURequestsRemoveLimitsMemoryRequestsAndLimitscontainercontrolledValuesenum introduced in the datadog-operator (c59fc90).When a container constraint sets this value, the autoscaler applies different strategies per resource:
RequestsAndLimitssemantics)Changes:
controller_vertical_helpers.go: strip CPU limit from the recommendation inapplyVerticalConstraintswhenCPURequestsRemoveLimitsMemoryRequestsAndLimitsis setpod_patcher.go: addgetContainerControlledValueshelper; passcontrolledValuesdown topatchContainerResourceswhich actively deletes any pre-existing CPU limit from the live pod2. Cluster-level burstable default via
PodAutoscalerInternalBuilderIntroduces
autoscaling.workload.options.burstable(default:true) as the lowest-priority fallback in theIsBurstable()priority chain:The config value is read once at startup in
provider.goand injected into a*PodAutoscalerInternalBuilder, which is threaded through to all threePodAutoscalerInternalconstruction paths (NewFromKubernetes,NewFromSettings,NewFromProfile), replacing the former standalone constructor functions. This avoids per-reconcile config lookups and ensures a single shared builder instance across the controller, config retriever, and autoscaler syncer.Changes:
model/pod_autoscaler.go: addclusterBurstableDefaultfield; introducePodAutoscalerInternalBuilderwithNewFromKubernetes,NewFromSettings,NewFromProfilemethodsconfig/setup/common_settings.go: registerautoscaling.workload.options.burstablewith defaulttrueprovider/provider.go: create builder once from config, pass to all construction sitescontroller.go,config_retriever.go,config_retriever_settings.go,profile/autoscaler_syncer.go: accept and store*PodAutoscalerInternalBuilder3. Dependency bump
go.mod: bumpdatadog-operator/apitov0.0.0-20260503193602-adf766128732Test plan
TestApplyVerticalConstraints_CPURequestsRemoveLimits— CPU limit stripped from recommendation, memory limit preservedTestPatchContainerResources— CPU limit removed from pod, idempotent when already absentTestPatchPod— end-to-end CPU limit removal on a live pod containerTestIsBurstable— priority chain: spec > annotation > cluster defaultTestNewPodAutoscalerFromProfile/TestUpdateFromProfile— builder wiring and annotation toggling./pkg/clusteragent/autoscaling/workload/...🤖 PR description and code assisted by Claude:claude-sonnet-4-6