Add analyze-quality-gate-security-mean-fs-load skill#50375
Add analyze-quality-gate-security-mean-fs-load skill#50375preinlein wants to merge 1 commit intopaul.reinlein/explain-ladingfrom
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Files inventory check summaryFile checks results against ancestor 8f67edd8: Results for datadog-agent_7.80.0~devel.git.486.d1bab2e.pipeline.111593521-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates 32 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 8f67edd Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.05 | [-1.87, +3.96] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.05 | [-1.87, +3.96] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.78 | [+0.72, +0.85] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.67 | [+0.42, +0.92] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_security_idle | memory utilization | +0.26 | [+0.19, +0.33] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.25 | [+0.21, +0.29] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_security_no_fs_load | memory utilization | +0.24 | [+0.15, +0.34] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | +0.22 | [+0.03, +0.42] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.21 | [+0.03, +0.39] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.20 | [+0.10, +0.30] | 1 | Logs |
| ➖ | quality_gate_security_mean_fs_load | memory utilization | +0.17 | [+0.13, +0.20] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.07 | [-0.36, +0.50] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.07 | [-0.08, +0.23] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.51, +0.53] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.01 | [-0.19, +0.22] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.08, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.03 | [-0.17, +0.12] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.04 | [-0.26, +0.18] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.04 | [-0.44, +0.36] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.09 | [-0.33, +0.15] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.15 | [-0.31, +0.01] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.27 | [-0.32, -0.22] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.43 | [-0.48, -0.38] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -0.74 | [-1.73, +0.24] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.75 | [-0.96, -0.55] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.92 | [-1.04, -0.81] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 576 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 241.85MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 723 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 140.34MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 471.50MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 174.58MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 386.83 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 384.29MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_security_idle | cpu_usage | 10/10 | 24.93 ≤ 40 | bounds checks dashboard |
| ✅ | quality_gate_security_idle | memory_usage | 10/10 | 287.44MiB ≤ 330MiB | bounds checks dashboard |
| ✅ | quality_gate_security_mean_fs_load | cpu_usage | 10/10 | 55.12 ≤ 70 | bounds checks dashboard |
| ✅ | quality_gate_security_mean_fs_load | memory_usage | 10/10 | 265.17MiB ≤ 320MiB | bounds checks dashboard |
| ✅ | quality_gate_security_no_fs_load | cpu_usage | 10/10 | 20.94 ≤ 40 | bounds checks dashboard |
| ✅ | quality_gate_security_no_fs_load | memory_usage | 10/10 | 274.20MiB ≤ 320MiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_idle, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_security_no_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_no_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_mean_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_mean_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
7177145 to
cb02849
Compare
fd9ffcc to
ac43e30
Compare
ac43e30 to
c44c9ae
Compare
23b3368 to
11959ed
Compare
c44c9ae to
e9a34d9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e9a34d9f73
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
e9a34d9 to
134a9e3
Compare
| pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load} by {job_id}.as_rate()' --from 1d --to now | ||
| ``` | ||
|
|
||
| For each returned series, read its `scope` (contains `job_id:<UUID>`) and its `pointlist`. Null-safe: drop points whose value is `None`. Per job, record the timestamp of the **last non-zero point**. The `job_id` with the most recent last-non-zero timestamp is "the latest job" for that experiment. |
There was a problem hiding this comment.
This part seems potentially scriptable in the future. I don't think it's worth blocking on writing a script for this operation -- I'd rather have the skill in the repo sooner -- but worth noting as a potential variance-reducing optimization.
| When reporting `capture_first_ts` / `capture_last_ts`, do not compute ISO strings by hand, use: | ||
|
|
||
| ```bash | ||
| python3 -c "from datetime import datetime, timezone; print(datetime.fromtimestamp($(( MS / 1000 )), timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))" |
There was a problem hiding this comment.
+1; Python is probably more cross-platform than date from GNU Coreutils.
| ### 3c. Sanity checks | ||
|
|
||
| - If fewer than ~5 non-zero points come back for a latest job, the capture may be in flight or interrupted. Try the second-most-recent `job_id` and report both so the reviewer can judge. | ||
| - If the `mean_fs_load` and `no_fs_load` latest `job_id`s ran more than 24 h apart, flag the staleness — background noise floor drifts over time. |
There was a problem hiding this comment.
The background noise floor drift is an interesting observation. I wonder if we can model that in the future.
|
|
||
| - If fewer than ~5 non-zero points come back for a latest job, the capture may be in flight or interrupted. Try the second-most-recent `job_id` and report both so the reviewer can judge. | ||
| - If the `mean_fs_load` and `no_fs_load` latest `job_id`s ran more than 24 h apart, flag the staleness — background noise floor drifts over time. | ||
| - Never fall back to wider windows with coarser rollups (≥300 s intervals) as a substitute — rollup gap-fill deflates per-job means when a capture only partially occupies a bucket. |
11959ed to
2fd3f25
Compare
134a9e3 to
ac40a1f
Compare
ac40a1f to
d1bab2e
Compare

What does this PR do?
Adds a new Claude Code skill
analyze-quality-gate-security-mean-fs-loadthat compares production CWSperf_buffer.events.writeopen rates against thequality_gate_security_mean_fs_loadSMP experiment, usingquality_gate_security_no_fs_loadas the no-load baseline, and proposes lading config changes when a mismatch is found.Motivation
The
quality_gate_security_mean_fs_loadlading config should reflect production filesystem workload so the SMP regression detector catches real-world CWS overhead. This skill formalizes the comparison procedure — pinning analysis to a specific SMPjob_id, subtracting the no-load noise floor, and proposing concrete YAML changes — so tuning decisions are traceable and reproducible across sessions.Describe how you validated your changes
Skill is read-only documentation that drives
pup metrics queryand the existingexplain-lading-configskill; no code paths are exercised. Validated by inspection of the procedure against the two regression cases undertest/regression/cases/quality_gate_security_{mean,no}_fs_load/.Additional Notes
@DataDog/single-machine-performanceand@DataDog/agent-security.