Add CWS Quality Gates by preinlein · Pull Request #48139 · DataDog/datadog-agent

preinlein · 2026-03-20T17:34:21Z

What does this PR do?

Adds three SMP regression quality gates for CWS (Workload Protection) under test/regression/cases/:

quality_gate_security_idle — CWS on, shipped default policy, no lading generator. Baseline "turn it on and leave it alone" floor.
quality_gate_security_no_fs_load — CWS on, experiment-supplied default.policy, no generator. Isolates policy + approver overhead at zero application load.
quality_gate_security_mean_fs_load — CWS on, same experiment default.policy, file_tree generator sized to org2's per-host mean perf_buffer.events.write rate.

Each cases enforces a memory & cpu bound.

Also adds two Claude skills used to author and maintain these gates:

.claude/skills/explain-lading-config/SKILL.md
.claude/skills/analyze-quality-gate-security-mean-fs-load-experiment/SKILL.md — compares lading config vs. SMP capture vs. production event_type:open for the mean-FS-load gate

Removes the file_tree experiment:

It was not a quality gate
It had no bounds checks
It lacked a readme
It somewhat tests similar functionality as these quality gates
It wasn't really maintained (all updates are from SMP and not really related to what's being tested)

Motivation

CWS lacks quality-gate coverage in SMP. The three gates form a ladder — idle → no-load-with-policy → mean-FS-load — so a memory regression can be attributed to policy loading, approver overhead, or event-processing overhead rather than lumped together. The mean-FS-load rate (open_per_second: 41) reflects the org2 per-host weekly mean for perf_buffer.events.write so the gate tracks production rather than an arbitrary stressor.

Describe how you validated your changes

Ran the three experiments via SMP and iterated on agent config & bounds
Used /analyze-quality-gate-security-mean-fs-load-experiment to confirm the lading-configured open rate, the SMP-captured open rate, and production per-host weekly open mean match within noise.
Many tests of the quality gates using this notebook to verify what I'm seeing.

The majority of the work involved understanding how things work: what the CWS SLOs are, what they measure, what that is/means in production, configuring the agent to properly exercise the load we're putting on it, etc.

Additional Notes

file_tree max_nodes: 500000 is sized to exceed open_per_second × run_duration so the tree never exhausts mid-capture (once all nodes exist, subsequent opens are O_RDONLY and rejected by CWS flag approvers).
activity_dump.enabled: false in both security-agent.yaml and system-probe.yaml — activity dump is being reworked and adds unpredictable kernel-event volume.

preinlein · 2026-03-20T17:34:37Z

Add CWS Quality Gates #48139 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

preinlein · 2026-03-20T17:36:15Z

+- **Compliance Config** (CSPM host benchmarks)
+- **SBOM Scanning** (host vulnerability management)
+
+## Owners


This is a new concept for QGs.

preinlein · 2026-03-20T17:36:55Z

+
+- Memory usage is below a threshold
+
+## Additional Information


This is a new concept for QGs.

preinlein · 2026-03-20T17:37:31Z

+
+The emitted metric from SMP should have a similar value to the production data we source.
+
+### Verifying the Experiment Configuration


This is a new concept for QGs. This will allow us to close the loop and maintain QGs in a way that remain representative of real usage.

preinlein · 2026-03-20T17:52:19Z

 /test/benchmarks/apm_scripts/                 @DataDog/agent-apm
 /test/regression/                             @DataDog/single-machine-performance
 /test/regression/cases/docker_containers*     @DataDog/single-machine-performance @DataDog/container-integrations
+/test/regression/cases/quality_gate_security_base* @DataDog/single-machine-performance @DataDog/agent-security


would want to drop SMP eventually (maybe after approval?)

agent-platform-auto-pr · 2026-03-20T18:10:21Z

Files inventory check summary

File checks results against ancestor baafab8c:

Results for datadog-agent_7.80.0~devel.git.462.fe74a2b.pipeline.111329495-1_amd64.deb:

No change detected

agent-platform-auto-pr · 2026-03-20T18:17:45Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor baafab8
📊 Static Quality Gates Dashboard
🔗 SQG Job

32 successful checks with minimal change (< 2 KiB)

	Quality gate	Current Size
✅	agent_deb_amd64	740.960 MiB
✅	agent_deb_amd64_fips	699.144 MiB
✅	agent_heroku_amd64	309.076 MiB
✅	agent_rpm_amd64	740.943 MiB
✅	agent_rpm_amd64_fips	699.127 MiB
✅	agent_rpm_arm64	719.018 MiB
✅	agent_rpm_arm64_fips	680.293 MiB
✅	agent_suse_amd64	740.943 MiB
✅	agent_suse_amd64_fips	699.127 MiB
✅	agent_suse_arm64	719.018 MiB
✅	agent_suse_arm64_fips	680.293 MiB
✅	docker_agent_amd64	801.337 MiB
✅	docker_agent_arm64	804.239 MiB
✅	docker_agent_jmx_amd64	992.257 MiB
✅	docker_agent_jmx_arm64	983.938 MiB
✅	docker_cluster_agent_amd64	206.602 MiB
✅	docker_cluster_agent_arm64	220.634 MiB
✅	docker_cws_instrumentation_amd64	7.142 MiB
✅	docker_cws_instrumentation_arm64	6.689 MiB
✅	docker_host_profiler_amd64	301.106 MiB
✅	docker_host_profiler_arm64	312.618 MiB
✅	docker_dogstatsd_amd64	39.370 MiB
✅	docker_dogstatsd_arm64	37.565 MiB
✅	dogstatsd_deb_amd64	30.024 MiB
✅	dogstatsd_deb_arm64	28.169 MiB
✅	dogstatsd_rpm_amd64	30.024 MiB
✅	dogstatsd_suse_amd64	30.024 MiB
✅	iot_agent_deb_amd64	44.454 MiB
✅	iot_agent_deb_arm64	41.439 MiB
✅	iot_agent_deb_armhf	42.179 MiB
✅	iot_agent_rpm_amd64	44.455 MiB
✅	iot_agent_suse_amd64	44.455 MiB

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	-14.85 KiB (0.01% reduction)	175.287 → 175.273 → 179.160
✅	agent_deb_amd64_fips	-15.8 KiB (0.01% reduction)	166.997 → 166.981 → 174.440
✅	agent_heroku_amd64	neutral	74.953 MiB → 80.310
✅	agent_rpm_amd64	-6.07 KiB (0.00% reduction)	177.300 → 177.294 → 182.080
✅	agent_rpm_amd64_fips	-18.58 KiB (0.01% reduction)	168.379 → 168.361 → 174.140
✅	agent_rpm_arm64	-25.2 KiB (0.02% reduction)	159.389 → 159.364 → 163.610
✅	agent_rpm_arm64_fips	-23.09 KiB (0.01% reduction)	151.730 → 151.707 → 156.850
✅	agent_suse_amd64	-6.07 KiB (0.00% reduction)	177.300 → 177.294 → 182.080
✅	agent_suse_amd64_fips	-18.58 KiB (0.01% reduction)	168.379 → 168.361 → 174.140
✅	agent_suse_arm64	-25.2 KiB (0.02% reduction)	159.389 → 159.364 → 163.610
✅	agent_suse_arm64_fips	-23.09 KiB (0.01% reduction)	151.730 → 151.707 → 156.850
✅	docker_agent_amd64	neutral	267.696 MiB → 272.990
✅	docker_agent_arm64	+2.13 KiB (0.00% increase)	254.717 → 254.719 → 261.470
✅	docker_agent_jmx_amd64	neutral	336.355 MiB → 341.610
✅	docker_agent_jmx_arm64	-2.65 KiB (0.00% reduction)	319.364 → 319.362 → 326.050
✅	docker_cluster_agent_amd64	neutral	72.415 MiB → 73.460
✅	docker_cluster_agent_arm64	neutral	67.867 MiB → 68.680
✅	docker_cws_instrumentation_amd64	neutral	2.999 MiB → 3.330
✅	docker_cws_instrumentation_arm64	neutral	2.729 MiB → 3.090
✅	docker_host_profiler_amd64	neutral	110.750 MiB → 125.600
✅	docker_host_profiler_arm64	neutral	105.078 MiB → 120.000
✅	docker_dogstatsd_amd64	neutral	15.237 MiB → 15.870
✅	docker_dogstatsd_arm64	neutral	14.555 MiB → 14.890
✅	dogstatsd_deb_amd64	neutral	7.943 MiB → 8.830
✅	dogstatsd_deb_arm64	neutral	6.827 MiB → 7.750
✅	dogstatsd_rpm_amd64	neutral	7.951 MiB → 8.840
✅	dogstatsd_suse_amd64	neutral	7.951 MiB → 8.840
✅	iot_agent_deb_amd64	neutral	11.703 MiB → 13.210
✅	iot_agent_deb_arm64	neutral	9.998 MiB → 11.620
✅	iot_agent_deb_armhf	neutral	10.204 MiB → 11.780
✅	iot_agent_rpm_amd64	-2.81 KiB (0.02% reduction)	11.719 → 11.716 → 13.230
✅	iot_agent_suse_amd64	-2.81 KiB (0.02% reduction)	11.719 → 11.716 → 13.230

cit-pr-commenter-54b7da · 2026-03-20T18:36:32Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 1d3e4760-58d5-4ba8-8347-593dca98d57b

Baseline: baafab8
Comparison: fe74a2b
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	-0.54	[-3.41, +2.34]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	tcp_syslog_to_blackhole	ingress throughput	+0.62	[+0.45, +0.80]	1	Logs
➖	ddot_logs	memory utilization	+0.49	[+0.43, +0.55]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.48	[+0.44, +0.52]	1	Logs bounds checks dashboard
➖	docker_containers_memory	memory utilization	+0.46	[+0.36, +0.56]	1	Logs
➖	quality_gate_security_no_fs_load	memory utilization	+0.38	[+0.28, +0.48]	1	Logs bounds checks dashboard
➖	otlp_ingest_logs	memory utilization	+0.37	[+0.27, +0.48]	1	Logs
➖	ddot_metrics	memory utilization	+0.29	[+0.09, +0.49]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	+0.18	[-0.00, +0.36]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	+0.15	[-0.00, +0.31]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.06	[-0.51, +0.62]	1	Logs
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	+0.03	[-0.20, +0.27]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.02	[-0.38, +0.43]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	+0.00	[-0.20, +0.21]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.20, +0.20]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.01	[-0.10, +0.09]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.03	[-0.16, +0.10]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.04	[-0.20, +0.12]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.06	[-0.51, +0.39]	1	Logs
➖	quality_gate_security_idle	memory utilization	-0.08	[-0.15, -0.01]	1	Logs bounds checks dashboard
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.13	[-0.18, -0.07]	1	Logs
➖	quality_gate_idle	memory utilization	-0.27	[-0.33, -0.22]	1	Logs bounds checks dashboard
➖	quality_gate_security_mean_fs_load	memory utilization	-0.29	[-0.32, -0.25]	1	Logs bounds checks dashboard
➖	quality_gate_metrics_logs	memory utilization	-0.40	[-0.65, -0.15]	1	Logs bounds checks dashboard
➖	docker_containers_cpu	% cpu utilization	-0.54	[-3.41, +2.34]	1	Logs
➖	quality_gate_logs	% cpu utilization	-1.72	[-2.68, -0.75]	1	Logs bounds checks dashboard

Bounds Checks: ❌ Failed

perf	experiment	bounds_check_name	replicates_passed	observed_value	links
✅	docker_containers_cpu	simple_check_run	10/10	697 ≥ 26
✅	docker_containers_memory	memory_usage	10/10	244.00MiB ≤ 370MiB
✅	docker_containers_memory	simple_check_run	10/10	727 ≥ 26
✅	file_to_blackhole_0ms_latency	memory_usage	10/10	0.16GiB ≤ 1.20GiB
✅	file_to_blackhole_0ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10	0.21GiB ≤ 1.20GiB
✅	file_to_blackhole_1000ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_100ms_latency	memory_usage	10/10	0.17GiB ≤ 1.20GiB
✅	file_to_blackhole_100ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_500ms_latency	memory_usage	10/10	0.19GiB ≤ 1.20GiB
✅	file_to_blackhole_500ms_latency	missed_bytes	10/10	0B = 0B
✅	quality_gate_idle	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	143.42MiB ≤ 147MiB	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	469.66MiB ≤ 495MiB	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	178.42MiB ≤ 195MiB	bounds checks dashboard
✅	quality_gate_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	346.97 ≤ 2000	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	369.79MiB ≤ 430MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_security_idle	cpu_usage	10/10	24.93 ≤ 40	bounds checks dashboard
✅	quality_gate_security_idle	memory_usage	10/10	285.01MiB ≤ 330MiB	bounds checks dashboard
❌	quality_gate_security_mean_fs_load	cpu_usage	0/10	53.96 > 40	bounds checks dashboard
✅	quality_gate_security_mean_fs_load	memory_usage	10/10	269.44MiB ≤ 320MiB	bounds checks dashboard
✅	quality_gate_security_no_fs_load	cpu_usage	10/10	23.14 ≤ 40	bounds checks dashboard
✅	quality_gate_security_no_fs_load	memory_usage	10/10	274.91MiB ≤ 320MiB	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

❌ Failed. Some Quality Gates were violated.

quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_security_no_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_security_no_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_security_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_security_idle, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_security_mean_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_security_mean_fs_load, bounds check cpu_usage: 0/10 replicas passed. Failed 10 which is > 0. Gate FAILED.

preinlein · 2026-04-21T18:51:49Z

This will likely need some iteration after merge. Right now this works well considering I'm manually triggering the Quality Gates.

Ideally, we use data from regression detector runs off of main (ideally with some kind of tag that we can filter on) and nothing from CI itself as we don't want PR data to influence things.

preinlein · 2026-04-21T18:52:48Z

 /test/benchmarks/apm_scripts/                 @DataDog/agent-apm
 /test/regression/                             @DataDog/single-machine-performance
 /test/regression/cases/docker_containers*     @DataDog/single-machine-performance @DataDog/container-integrations
+/test/regression/cases/quality_gate_security_* @DataDog/single-machine-performance @DataDog/agent-security


Let me know if you folks want exclusive ownership of the quality gates. I started with joint ownership but I'd love to be able to fully hand this over.

preinlein · 2026-04-21T18:54:08Z

+      # Must exceed open_per_second × run_duration_seconds so that lading never
+      # exhausts the tree during a capture. Once all nodes exist on disk, opens
+      # become O_RDONLY and are rejected by CWS kernel-side flag approvers.
+      max_nodes: 500000


This is something I'm going to follow up on. We'll need to change how file system load gets generated so that there's a way to continuously generate unique files instead of using a cached set of data (I can elaborate if folks have questions).

Right now, this is a workaround that works just fine. Just not ideal.

preinlein · 2026-04-21T18:54:50Z

+      # become O_RDONLY and are rejected by CWS kernel-side flag approvers.
+      max_nodes: 500000
+      open_per_second: 41
+      rename_per_second: 1


Ideally I'd like to get this to 0 since opens are the majority of traffic. This is a limitation in lading's current implementation.

Something that I'll be following up as well.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2226bd8fdd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

goxberry · 2026-04-22T23:15:56Z

@@ -0,0 +1,87 @@
+---
+name: explain-lading-config


Should there be a skill in Lading that this skill delegates to? Is that possible to do across repositories, given that a prerequisite for this skill seems to be a clone of the Lading repository?

Should there be a skill in Lading that this skill delegates to?
Yes. 100% agree.

Is that possible to do across repositories, given that a prerequisite for this skill seems to be a clone of the Lading repository?
Maybe. TBD.

All I know is that this has proven to be very useful for me and I'd like to figure out how to expose this more globally. Don't know how yet. I know that I don't want to copy-paste this skill in every repo.

fwiw I'd like to get the lading CLI to be a sub-command of the SMP CLI and if the SMP CLI is brew tappable, suddenly we can get "this" into a lot of people's hands.

I don't know if we should try to package skills or we should try to push as much functionality into the lading CLI itself. I think the latter, that way it's deterministic and users can wrap the CLI with skills themselves if the CLI exposes enough.

preinlein

Thanks for the review @goxberry 🙇

preinlein · 2026-04-23T11:57:36Z

@@ -0,0 +1,87 @@
+---
+name: explain-lading-config


Should there be a skill in Lading that this skill delegates to?
Yes. 100% agree.

Is that possible to do across repositories, given that a prerequisite for this skill seems to be a clone of the Lading repository?
Maybe. TBD.

All I know is that this has proven to be very useful for me and I'd like to figure out how to expose this more globally. Don't know how yet. I know that I don't want to copy-paste this skill in every repo.

fwiw I'd like to get the lading CLI to be a sub-command of the SMP CLI and if the SMP CLI is brew tappable, suddenly we can get "this" into a lot of people's hands.

I don't know if we should try to package skills or we should try to push as much functionality into the lading CLI itself. I think the latter, that way it's deterministic and users can wrap the CLI with skills themselves if the CLI exposes enough.

preinlein · 2026-04-23T17:38:05Z

I aggressively tested this by exercising all possible inputs, results are here: https://datadoghq.atlassian.net/wiki/x/uoMTigE

test matrix — 11 cases covering the various invocation paths: Happy paths 1. Exact name → quality_gate_idle 2. Substring match → security_mean 3. Glob with ≤4 matches → *dogstatsd* (3 hits → AskUserQuestion path) 4. Exact name of disabled experiment → ddot_traces (x-disabled-cases flag) 5. Direct file path (absolute) Error / edge paths 6. No match, no near-match → totally_fake_xyz 7. Typo triggering "did you mean?" → quality_gate_idlee 8. Ambiguous with >4 matches → quality_gate (bulleted list path) 9. No arguments → full listing path (>4 → bulleted list) 10. Non-existent file path → /tmp/nonexistent-xyz.yaml 11. Existing non-lading file → /etc/hosts I'll run them one at a time so we can observe each outcome...

Summary All 11 invocation paths exercised successfully: ┌─────┬─────────────────────┬────────┬──────────────────────────┬────────┐ │ # │ Case │ Script │ Skill behavior │ Result │ │ │ │ exit │ │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 1 │ Exact name │ 0 │ Resolved + explained │ PASS │ │ │ quality_gate_idle │ │ │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 2 │ Substring │ 0 │ Resolved unambiguously + │ PASS │ │ │ security_mean │ │ explained │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 3 │ Glob *dogstatsd* (3 │ 3 │ AskUserQuestion with 3 │ PASS │ │ │ hits) │ │ options │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ │ Disabled │ │ Resolved + flagged │ │ │ 4 │ ddot_traces │ 0 │ x-disabled-cases (also │ PASS │ │ │ │ │ caught a schema drift) │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 5 │ Direct absolute │ 0 │ Passed through + │ PASS │ │ │ path │ │ explained │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 6 │ Bogus │ 2 │ Not-found, no │ PASS │ │ │ totally_fake_xyz │ │ suggestions, stop │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 7 │ Typo │ 2 │ "did you mean?" → │ PASS │ │ │ quality_gate_idlee │ │ AskUserQuestion │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 8 │ Broad quality_gate │ 3 │ Plain bulleted list, no │ PASS │ │ │ (7 hits) │ │ AskUserQuestion │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 9 │ No argument (31 │ 0 │ Plain bulleted list │ PASS │ │ │ configs) │ │ │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 10 │ Non-existent path │ 2 │ Direct-file branch "not │ PASS │ │ │ │ │ found" │ │ ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤ │ 11 │ Non-lading file │ 2 │ Non-lading guard fires │ PASS │ │ │ /etc/hosts │ │ │ │ └─────┴─────────────────────┴────────┴──────────────────────────┴────────┘

I think for this PR, that level of testing makes sense.

Outside of scope for this PR, I'd like to see skills in the repo go through automated evals (behavioral first, then also triggering evals if we want AIs to call these skills). I'll reach out to some people internally to see what I can find.

I'm going to bring it up with Agent DevX as well. I'd like to know the immediate term stance on how to review skills and the acceptance criteria.

Otherwise, it puts reviewers in a bind.

spikat · 2026-04-24T12:31:17Z

The system-probe.yaml is good, this one should only contain runtime_security_config.enabled=true (activity_dump and remote_config should only be present system-probe side)
Same for the other 2 experiments

Thanks for the callout, fixed!

GeorgeHahn · 2026-04-29T01:21:05Z

+Two SMP experiments form a pair. Both run with the same custom `default.policy`; the axis that distinguishes them is whether lading generates filesystem load:
+
+- **`quality_gate_security_no_fs_load`** — CWS enabled, custom `default.policy`, `generator: []`. Measures the floor for this policy: background event noise and policy-loaded memory footprint with no application-generated filesystem events.
+- **`quality_gate_security_mean_fs_load`** — CWS enabled, custom `default.policy`, `file_tree` generator. Measures overhead under a production-representative mean filesystem load.


Is mean an interesting level of load? I would expect this to be a severely left skewed distribution, which mean will under-represent. Could we capture a higher percentile of the observed workload?

I had a look. To give an idea of how skewed that distribution is, some hosts top out above 400k write events per second.

I got there with this query: top(max:datadog.runtime_security.perf_buffer.events.write{event_type:open} by {host}.as_rate(), 100, 'max', 'desc')

More interesting is something along these lines: percentile(max:datadog.runtime_security.perf_buffer.events.write{event_type:open} by {host}.as_rate(), 'p95', { * })

Also, fwiw, I've been using this notebook to help visualize some of what we're looking at: https://app.datadoghq.com/notebook/13998267/cws-quality-gates?cell-eh89gz4d-from_ts=1776088673376&cell-eh89gz4d-refresh_mode=sliding&cell-eh89gz4d-to_ts=1776693473376&refresh_mode=paused&tpl_var_event_type=%2A&tpl_var_experiment=quality_gate_security_idle&utc_override=false&from_ts=1776710092745&to_ts=1776713692745

Is mean an interesting level of load?

I think it is. Being able to say "on average, this is the cost" is interesting to me. Having said that, being able to say "what's the expected cost at the 95th percentile" is also interesting. I could see us having both.

I also think mean is a lot easier to reason about and visualize in DataDog than doing a percentile of the maxes on hosts. What I would really like is for the underlying metric to be a distribution so we could use a percentile directly. Alas, it's not.

Personally, I'd like to revisit this later this year once I have a better understanding of COAT and other telemetry. I'd like to provide tooling that works for any Quality Gate. This is very much an intermediate step till then.

I'm going to push back a little bit on this ask for now. Folks from CWS were open to mean as an initial QG target. I'd like to get them started with mean and in the meeting next week with CWS, I'll make sure to encourage them to adapt the QGs according to the use cases they deem most important. I'll communicate that we'll/I'll be available to assist.

Ishirui

I'm reviewing specifically the agent-devx-owned files, i.e. the Claude skills you are adding:

Could you split this into a different PR ? This one is big enough as-is
Could you add yourselves as CODEOWNERS for those skills ? We in Agent DevX don't have much context on how these work (esp. w.r.t lading) 😅
I think there was already a comment mentioning this, but it would be better imo to have these skills living in lading, with maybe a skill in the Agent that delegates to the lading skill.
Would it be possible to avoid using complex bash scripts here ? Anything longer than a few lines should imo be in a more easily testable language, maybe an invoke task. I think there is also quite a bit of logic (e.g. resolving the git top-level) that should be put in a "library" (and might even already exist in tasks/libs, haven't checked) to avoid duplication.
In general, I think we should avoid using AI for skills, or even intermediate skills steps, that can be replaced by standard scripting. In the analysis skill for instance, steps 1 through 4 are purely deterministic and do not require an LLM IIUC.

Sorry for the big set of comments, but we're still figuring out our policies regarding new skills in the repo, and would rather not have to do expensive cleanup to get everything up to standard once they are fully determined 🙏

preinlein · 2026-05-04T13:27:01Z

I'm reviewing specifically the agent-devx-owned files, i.e. the Claude skills you are adding:

Could you split this into a different PR ? This one is big enough as-is

Could you add yourselves as CODEOWNERS for those skills ? We in Agent DevX don't have much context on how these work (esp. w.r.t lading) 😅

I think there was already a comment mentioning this, but it would be better imo to have these skills living in lading, with maybe a skill in the Agent that delegates to the lading skill.

Would it be possible to avoid using complex bash scripts here ? Anything longer than a few lines should imo be in a more easily testable language, maybe an invoke task. I think there is also quite a bit of logic (e.g. resolving the git top-level) that should be put in a "library" (and might even already exist in tasks/libs, haven't checked) to avoid duplication.

In general, I think we should avoid using AI for skills, or even intermediate skills steps, that can be replaced by standard scripting. In the analysis skill for instance, steps 1 through 4 are purely deterministic and do not require an LLM IIUC.

Sorry for the big set of comments, but we're still figuring out our policies regarding new skills in the repo, and would rather not have to do expensive cleanup to get everything up to standard once they are fully determined 🙏

Since everything you commented is about the skills, I'll move them out into their own PR and open a fresh PR with you folks. That will address 1, I'll address 2.

I'll take the rest offline.

preinlein · 2026-05-05T17:55:48Z

I've decided to split this PR into 3:

This is so that the quality gates are separated from the introduced skills. I'll be poking folks for reviews/approvals on those PRs.

dd-octo-sts Bot added the internal Identify a non-fork PR label Mar 20, 2026

preinlein changed the title ~~Add CWS Quality Gate for the base scenario~~ (DO NOT REVIEW YET) Add CWS Quality Gate for the base scenario Mar 20, 2026

dd-octo-sts Bot added the team/agent-devx label Mar 20, 2026

github-actions Bot added the medium review PR review might take time label Mar 20, 2026

preinlein commented Mar 20, 2026

View reviewed changes

preinlein force-pushed the paul.reinlein/cws-quality-gate branch 3 times, most recently from e5523fc to cbe02d9 Compare March 20, 2026 17:51

preinlein commented Mar 20, 2026

View reviewed changes

github-actions Bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Mar 20, 2026

dd-octo-sts Bot added the team/agent-security label Mar 20, 2026

dd-octo-sts Bot added the stale label Apr 4, 2026

preinlein removed the stale label Apr 7, 2026

DataDog deleted a comment from dd-octo-sts Bot Apr 17, 2026

preinlein force-pushed the paul.reinlein/cws-quality-gate branch from 16accf1 to 6bc92d5 Compare April 20, 2026 13:07

preinlein added the qa/skip-qa label Apr 20, 2026

preinlein changed the title ~~(DO NOT REVIEW YET) Add CWS Quality Gate for the base scenario~~ (DO NOT REVIEW) Add CWS Quality Gates Apr 20, 2026

preinlein added changelog/no-changelog No changelog entry needed qa/no-code-change No code change in Agent code requiring validation labels Apr 20, 2026

preinlein commented Apr 21, 2026

View reviewed changes

preinlein changed the title ~~(DO NOT REVIEW) Add CWS Quality Gates~~ Add CWS Quality Gates Apr 22, 2026

preinlein marked this pull request as ready for review April 22, 2026 19:22

preinlein requested review from a team as code owners April 22, 2026 19:22

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml Outdated

goxberry reviewed Apr 22, 2026

View reviewed changes

Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated

goxberry reviewed Apr 22, 2026

View reviewed changes

Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated

goxberry reviewed Apr 22, 2026

View reviewed changes

Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated

goxberry reviewed Apr 22, 2026

View reviewed changes

preinlein commented Apr 23, 2026

View reviewed changes

spikat reviewed Apr 24, 2026

View reviewed changes

blt approved these changes Apr 28, 2026

View reviewed changes

GeorgeHahn reviewed Apr 29, 2026

View reviewed changes

preinlein force-pushed the paul.reinlein/cws-quality-gate branch from e75da91 to 6b980eb Compare May 1, 2026 17:56

This comment has been minimized.

Sign in to view

Ishirui reviewed May 4, 2026

View reviewed changes

Setup CWS Quality Gates

fe74a2b

preinlein force-pushed the paul.reinlein/cws-quality-gate branch from e57817b to fe74a2b Compare May 4, 2026 18:44

preinlein closed this May 5, 2026


		- Memory usage is below a threshold

		## Additional Information


		The emitted metric from SMP should have a similar value to the production data we source.

		### Verifying the Experiment Configuration

Conversation

preinlein commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Describe how you validated your changes

Additional Notes

Uh oh!

preinlein commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agent-platform-auto-pr Bot commented Mar 20, 2026 • edited by dd-octo-sts Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files inventory check summary

Results for datadog-agent_7.80.0~devel.git.462.fe74a2b.pipeline.111329495-1_amd64.deb:

Uh oh!

agent-platform-auto-pr Bot commented Mar 20, 2026 • edited by dd-octo-sts Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Uh oh!

cit-pr-commenter-54b7da Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ❌ Failed

Explanation

CI Pass/Fail Decision

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preinlein left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preinlein commented Mar 20, 2026 •

edited

Loading

agent-platform-auto-pr Bot commented Mar 20, 2026 •

edited by dd-octo-sts Bot

Loading

agent-platform-auto-pr Bot commented Mar 20, 2026 •

edited by dd-octo-sts Bot

Loading

cit-pr-commenter-54b7da Bot commented Mar 20, 2026 •

edited

Loading

preinlein left a comment •

edited

Loading