Skip to content

Add CWS Quality Gates#48139

Closed
preinlein wants to merge 1 commit intomainfrom
paul.reinlein/cws-quality-gate
Closed

Add CWS Quality Gates#48139
preinlein wants to merge 1 commit intomainfrom
paul.reinlein/cws-quality-gate

Conversation

@preinlein
Copy link
Copy Markdown
Contributor

@preinlein preinlein commented Mar 20, 2026

What does this PR do?

Adds three SMP regression quality gates for CWS (Workload Protection) under test/regression/cases/:

  • quality_gate_security_idle — CWS on, shipped default policy, no lading generator. Baseline "turn it on and leave it alone" floor.
  • quality_gate_security_no_fs_load — CWS on, experiment-supplied default.policy, no generator. Isolates policy + approver overhead at zero application load.
  • quality_gate_security_mean_fs_load — CWS on, same experiment default.policy, file_tree generator sized to org2's per-host mean perf_buffer.events.write rate.

Each cases enforces a memory & cpu bound.

Also adds two Claude skills used to author and maintain these gates:

  • .claude/skills/explain-lading-config/SKILL.md
  • .claude/skills/analyze-quality-gate-security-mean-fs-load-experiment/SKILL.md — compares lading config vs. SMP capture vs. production event_type:open for the mean-FS-load gate

Removes the file_tree experiment:

  • It was not a quality gate
  • It had no bounds checks
  • It lacked a readme
  • It somewhat tests similar functionality as these quality gates
  • It wasn't really maintained (all updates are from SMP and not really related to what's being tested)

Motivation

CWS lacks quality-gate coverage in SMP. The three gates form a ladder — idle → no-load-with-policy → mean-FS-load — so a memory regression can be attributed to policy loading, approver overhead, or event-processing overhead rather than lumped together. The mean-FS-load rate (open_per_second: 41) reflects the org2 per-host weekly mean for perf_buffer.events.write so the gate tracks production rather than an arbitrary stressor.

Describe how you validated your changes

  • Ran the three experiments via SMP and iterated on agent config & bounds
  • Used /analyze-quality-gate-security-mean-fs-load-experiment to confirm the lading-configured open rate, the SMP-captured open rate, and production per-host weekly open mean match within noise.
  • Many tests of the quality gates using this notebook to verify what I'm seeing.

The majority of the work involved understanding how things work: what the CWS SLOs are, what they measure, what that is/means in production, configuring the agent to properly exercise the load we're putting on it, etc.

Additional Notes

  • file_tree max_nodes: 500000 is sized to exceed open_per_second × run_duration so the tree never exhausts mid-capture (once all nodes exist, subsequent opens are O_RDONLY and rejected by CWS flag approvers).
  • activity_dump.enabled: false in both security-agent.yaml and system-probe.yaml — activity dump is being reworked and adds unpredictable kernel-event volume.

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@dd-octo-sts dd-octo-sts Bot added the internal Identify a non-fork PR label Mar 20, 2026
@preinlein preinlein changed the title Add CWS Quality Gate for the base scenario (DO NOT REVIEW YET) Add CWS Quality Gate for the base scenario Mar 20, 2026
@github-actions github-actions Bot added the medium review PR review might take time label Mar 20, 2026
Comment thread .claude/skills/analyze-quality-gate-security-base-experiment/SKILL.md Outdated
Comment thread .claude/skills/explain-lading-config/SKILL.md
Comment thread test/regression/cases/quality_gate_security_base/datadog-agent/datadog.yaml Outdated
- **Compliance Config** (CSPM host benchmarks)
- **SBOM Scanning** (host vulnerability management)

## Owners
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new concept for QGs.

Comment thread test/regression/cases/quality_gate_security_base/README.md Outdated

- Memory usage is below a threshold

## Additional Information
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new concept for QGs.


The emitted metric from SMP should have a similar value to the production data we source.

### Verifying the Experiment Configuration
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new concept for QGs. This will allow us to close the loop and maintain QGs in a way that remain representative of real usage.

Comment thread test/regression/cases/quality_gate_security_base/README.md Outdated
@preinlein preinlein force-pushed the paul.reinlein/cws-quality-gate branch 3 times, most recently from e5523fc to cbe02d9 Compare March 20, 2026 17:51
Comment thread .github/CODEOWNERS Outdated
/test/benchmarks/apm_scripts/ @DataDog/agent-apm
/test/regression/ @DataDog/single-machine-performance
/test/regression/cases/docker_containers* @DataDog/single-machine-performance @DataDog/container-integrations
/test/regression/cases/quality_gate_security_base* @DataDog/single-machine-performance @DataDog/agent-security
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would want to drop SMP eventually (maybe after approval?)

@github-actions github-actions Bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Mar 20, 2026
@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Mar 20, 2026

Files inventory check summary

File checks results against ancestor baafab8c:

Results for datadog-agent_7.80.0~devel.git.462.fe74a2b.pipeline.111329495-1_amd64.deb:

No change detected

@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Mar 20, 2026

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor baafab8
📊 Static Quality Gates Dashboard
🔗 SQG Job

32 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
agent_deb_amd64 740.960 MiB
agent_deb_amd64_fips 699.144 MiB
agent_heroku_amd64 309.076 MiB
agent_rpm_amd64 740.943 MiB
agent_rpm_amd64_fips 699.127 MiB
agent_rpm_arm64 719.018 MiB
agent_rpm_arm64_fips 680.293 MiB
agent_suse_amd64 740.943 MiB
agent_suse_amd64_fips 699.127 MiB
agent_suse_arm64 719.018 MiB
agent_suse_arm64_fips 680.293 MiB
docker_agent_amd64 801.337 MiB
docker_agent_arm64 804.239 MiB
docker_agent_jmx_amd64 992.257 MiB
docker_agent_jmx_arm64 983.938 MiB
docker_cluster_agent_amd64 206.602 MiB
docker_cluster_agent_arm64 220.634 MiB
docker_cws_instrumentation_amd64 7.142 MiB
docker_cws_instrumentation_arm64 6.689 MiB
docker_host_profiler_amd64 301.106 MiB
docker_host_profiler_arm64 312.618 MiB
docker_dogstatsd_amd64 39.370 MiB
docker_dogstatsd_arm64 37.565 MiB
dogstatsd_deb_amd64 30.024 MiB
dogstatsd_deb_arm64 28.169 MiB
dogstatsd_rpm_amd64 30.024 MiB
dogstatsd_suse_amd64 30.024 MiB
iot_agent_deb_amd64 44.454 MiB
iot_agent_deb_arm64 41.439 MiB
iot_agent_deb_armhf 42.179 MiB
iot_agent_rpm_amd64 44.455 MiB
iot_agent_suse_amd64 44.455 MiB
On-wire sizes (compressed)
Quality gate Change Size (prev → curr → max)
agent_deb_amd64 -14.85 KiB (0.01% reduction) 175.287 → 175.273 → 179.160
agent_deb_amd64_fips -15.8 KiB (0.01% reduction) 166.997 → 166.981 → 174.440
agent_heroku_amd64 neutral 74.953 MiB → 80.310
agent_rpm_amd64 -6.07 KiB (0.00% reduction) 177.300 → 177.294 → 182.080
agent_rpm_amd64_fips -18.58 KiB (0.01% reduction) 168.379 → 168.361 → 174.140
agent_rpm_arm64 -25.2 KiB (0.02% reduction) 159.389 → 159.364 → 163.610
agent_rpm_arm64_fips -23.09 KiB (0.01% reduction) 151.730 → 151.707 → 156.850
agent_suse_amd64 -6.07 KiB (0.00% reduction) 177.300 → 177.294 → 182.080
agent_suse_amd64_fips -18.58 KiB (0.01% reduction) 168.379 → 168.361 → 174.140
agent_suse_arm64 -25.2 KiB (0.02% reduction) 159.389 → 159.364 → 163.610
agent_suse_arm64_fips -23.09 KiB (0.01% reduction) 151.730 → 151.707 → 156.850
docker_agent_amd64 neutral 267.696 MiB → 272.990
docker_agent_arm64 +2.13 KiB (0.00% increase) 254.717 → 254.719 → 261.470
docker_agent_jmx_amd64 neutral 336.355 MiB → 341.610
docker_agent_jmx_arm64 -2.65 KiB (0.00% reduction) 319.364 → 319.362 → 326.050
docker_cluster_agent_amd64 neutral 72.415 MiB → 73.460
docker_cluster_agent_arm64 neutral 67.867 MiB → 68.680
docker_cws_instrumentation_amd64 neutral 2.999 MiB → 3.330
docker_cws_instrumentation_arm64 neutral 2.729 MiB → 3.090
docker_host_profiler_amd64 neutral 110.750 MiB → 125.600
docker_host_profiler_arm64 neutral 105.078 MiB → 120.000
docker_dogstatsd_amd64 neutral 15.237 MiB → 15.870
docker_dogstatsd_arm64 neutral 14.555 MiB → 14.890
dogstatsd_deb_amd64 neutral 7.943 MiB → 8.830
dogstatsd_deb_arm64 neutral 6.827 MiB → 7.750
dogstatsd_rpm_amd64 neutral 7.951 MiB → 8.840
dogstatsd_suse_amd64 neutral 7.951 MiB → 8.840
iot_agent_deb_amd64 neutral 11.703 MiB → 13.210
iot_agent_deb_arm64 neutral 9.998 MiB → 11.620
iot_agent_deb_armhf neutral 10.204 MiB → 11.780
iot_agent_rpm_amd64 -2.81 KiB (0.02% reduction) 11.719 → 11.716 → 13.230
iot_agent_suse_amd64 -2.81 KiB (0.02% reduction) 11.719 → 11.716 → 13.230

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented Mar 20, 2026

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 1d3e4760-58d5-4ba8-8347-593dca98d57b

Baseline: baafab8
Comparison: fe74a2b
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization -0.54 [-3.41, +2.34] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
tcp_syslog_to_blackhole ingress throughput +0.62 [+0.45, +0.80] 1 Logs
ddot_logs memory utilization +0.49 [+0.43, +0.55] 1 Logs
quality_gate_idle_all_features memory utilization +0.48 [+0.44, +0.52] 1 Logs bounds checks dashboard
docker_containers_memory memory utilization +0.46 [+0.36, +0.56] 1 Logs
quality_gate_security_no_fs_load memory utilization +0.38 [+0.28, +0.48] 1 Logs bounds checks dashboard
otlp_ingest_logs memory utilization +0.37 [+0.27, +0.48] 1 Logs
ddot_metrics memory utilization +0.29 [+0.09, +0.49] 1 Logs
ddot_metrics_sum_delta memory utilization +0.18 [-0.00, +0.36] 1 Logs
ddot_metrics_sum_cumulative memory utilization +0.15 [-0.00, +0.31] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.06 [-0.51, +0.62] 1 Logs
ddot_metrics_sum_cumulativetodelta_exporter memory utilization +0.03 [-0.20, +0.27] 1 Logs
file_to_blackhole_500ms_latency egress throughput +0.02 [-0.38, +0.43] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput +0.00 [-0.20, +0.21] 1 Logs
uds_dogstatsd_to_api ingress throughput -0.00 [-0.20, +0.20] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.01 [-0.10, +0.09] 1 Logs
file_to_blackhole_100ms_latency egress throughput -0.03 [-0.16, +0.10] 1 Logs
otlp_ingest_metrics memory utilization -0.04 [-0.20, +0.12] 1 Logs
file_to_blackhole_1000ms_latency egress throughput -0.06 [-0.51, +0.39] 1 Logs
quality_gate_security_idle memory utilization -0.08 [-0.15, -0.01] 1 Logs bounds checks dashboard
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.13 [-0.18, -0.07] 1 Logs
quality_gate_idle memory utilization -0.27 [-0.33, -0.22] 1 Logs bounds checks dashboard
quality_gate_security_mean_fs_load memory utilization -0.29 [-0.32, -0.25] 1 Logs bounds checks dashboard
quality_gate_metrics_logs memory utilization -0.40 [-0.65, -0.15] 1 Logs bounds checks dashboard
docker_containers_cpu % cpu utilization -0.54 [-3.41, +2.34] 1 Logs
quality_gate_logs % cpu utilization -1.72 [-2.68, -0.75] 1 Logs bounds checks dashboard

Bounds Checks: ❌ Failed

perf experiment bounds_check_name replicates_passed observed_value links
docker_containers_cpu simple_check_run 10/10 697 ≥ 26
docker_containers_memory memory_usage 10/10 244.00MiB ≤ 370MiB
docker_containers_memory simple_check_run 10/10 727 ≥ 26
file_to_blackhole_0ms_latency memory_usage 10/10 0.16GiB ≤ 1.20GiB
file_to_blackhole_0ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_1000ms_latency memory_usage 10/10 0.21GiB ≤ 1.20GiB
file_to_blackhole_1000ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_100ms_latency memory_usage 10/10 0.17GiB ≤ 1.20GiB
file_to_blackhole_100ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_500ms_latency memory_usage 10/10 0.19GiB ≤ 1.20GiB
file_to_blackhole_500ms_latency missed_bytes 10/10 0B = 0B
quality_gate_idle intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle memory_usage 10/10 143.42MiB ≤ 147MiB bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 469.66MiB ≤ 495MiB bounds checks dashboard
quality_gate_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_logs memory_usage 10/10 178.42MiB ≤ 195MiB bounds checks dashboard
quality_gate_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 346.97 ≤ 2000 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 369.79MiB ≤ 430MiB bounds checks dashboard
quality_gate_metrics_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_security_idle cpu_usage 10/10 24.93 ≤ 40 bounds checks dashboard
quality_gate_security_idle memory_usage 10/10 285.01MiB ≤ 330MiB bounds checks dashboard
quality_gate_security_mean_fs_load cpu_usage 0/10 53.96 > 40 bounds checks dashboard
quality_gate_security_mean_fs_load memory_usage 10/10 269.44MiB ≤ 320MiB bounds checks dashboard
quality_gate_security_no_fs_load cpu_usage 10/10 23.14 ≤ 40 bounds checks dashboard
quality_gate_security_no_fs_load memory_usage 10/10 274.91MiB ≤ 320MiB bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Failed. Some Quality Gates were violated.

  • quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_security_no_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_security_no_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_security_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_security_idle, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_security_mean_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_security_mean_fs_load, bounds check cpu_usage: 0/10 replicas passed. Failed 10 which is > 0. Gate FAILED.

@dd-octo-sts dd-octo-sts Bot added the stale label Apr 4, 2026
@preinlein preinlein removed the stale label Apr 7, 2026
@DataDog DataDog deleted a comment from dd-octo-sts Bot Apr 17, 2026
@preinlein preinlein force-pushed the paul.reinlein/cws-quality-gate branch from 16accf1 to 6bc92d5 Compare April 20, 2026 13:07
@preinlein preinlein changed the title (DO NOT REVIEW YET) Add CWS Quality Gate for the base scenario (DO NOT REVIEW) Add CWS Quality Gates Apr 20, 2026
@preinlein preinlein added changelog/no-changelog No changelog entry needed qa/no-code-change No code change in Agent code requiring validation labels Apr 20, 2026
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely need some iteration after merge. Right now this works well considering I'm manually triggering the Quality Gates.

Ideally, we use data from regression detector runs off of main (ideally with some kind of tag that we can filter on) and nothing from CI itself as we don't want PR data to influence things.

Comment thread .github/CODEOWNERS
/test/benchmarks/apm_scripts/ @DataDog/agent-apm
/test/regression/ @DataDog/single-machine-performance
/test/regression/cases/docker_containers* @DataDog/single-machine-performance @DataDog/container-integrations
/test/regression/cases/quality_gate_security_* @DataDog/single-machine-performance @DataDog/agent-security
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you folks want exclusive ownership of the quality gates. I started with joint ownership but I'd love to be able to fully hand this over.

# Must exceed open_per_second × run_duration_seconds so that lading never
# exhausts the tree during a capture. Once all nodes exist on disk, opens
# become O_RDONLY and are rejected by CWS kernel-side flag approvers.
max_nodes: 500000
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I'm going to follow up on. We'll need to change how file system load gets generated so that there's a way to continuously generate unique files instead of using a cached set of data (I can elaborate if folks have questions).

Right now, this is a workaround that works just fine. Just not ideal.

# become O_RDONLY and are rejected by CWS kernel-side flag approvers.
max_nodes: 500000
open_per_second: 41
rename_per_second: 1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd like to get this to 0 since opens are the majority of traffic. This is a limitation in lading's current implementation.

Something that I'll be following up as well.

@preinlein preinlein changed the title (DO NOT REVIEW) Add CWS Quality Gates Add CWS Quality Gates Apr 22, 2026
@preinlein preinlein marked this pull request as ready for review April 22, 2026 19:22
@preinlein preinlein requested review from a team as code owners April 22, 2026 19:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2226bd8fdd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml Outdated
Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated
Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated
Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated
@@ -0,0 +1,87 @@
---
name: explain-lading-config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a skill in Lading that this skill delegates to? Is that possible to do across repositories, given that a prerequisite for this skill seems to be a clone of the Lading repository?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a skill in Lading that this skill delegates to?
Yes. 100% agree.

Is that possible to do across repositories, given that a prerequisite for this skill seems to be a clone of the Lading repository?
Maybe. TBD.

All I know is that this has proven to be very useful for me and I'd like to figure out how to expose this more globally. Don't know how yet. I know that I don't want to copy-paste this skill in every repo.

fwiw I'd like to get the lading CLI to be a sub-command of the SMP CLI and if the SMP CLI is brew tappable, suddenly we can get "this" into a lot of people's hands.

I don't know if we should try to package skills or we should try to push as much functionality into the lading CLI itself. I think the latter, that way it's deterministic and users can wrap the CLI with skills themselves if the CLI exposes enough.

Copy link
Copy Markdown
Contributor Author

@preinlein preinlein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @goxberry 🙇

@@ -0,0 +1,87 @@
---
name: explain-lading-config
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a skill in Lading that this skill delegates to?
Yes. 100% agree.

Is that possible to do across repositories, given that a prerequisite for this skill seems to be a clone of the Lading repository?
Maybe. TBD.

All I know is that this has proven to be very useful for me and I'd like to figure out how to expose this more globally. Don't know how yet. I know that I don't want to copy-paste this skill in every repo.

fwiw I'd like to get the lading CLI to be a sub-command of the SMP CLI and if the SMP CLI is brew tappable, suddenly we can get "this" into a lot of people's hands.

I don't know if we should try to package skills or we should try to push as much functionality into the lading CLI itself. I think the latter, that way it's deterministic and users can wrap the CLI with skills themselves if the CLI exposes enough.

Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated
Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated
Comment thread .claude/skills/explain-lading-config/SKILL.md Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggressively tested this by exercising all possible inputs, results are here: https://datadoghq.atlassian.net/wiki/x/uoMTigE

test matrix — 11 cases covering the various invocation paths:
  Happy paths
  1. Exact name → quality_gate_idle
  2. Substring match → security_mean
  3. Glob with ≤4 matches → *dogstatsd* (3 hits → AskUserQuestion path)
  4. Exact name of disabled experiment → ddot_traces (x-disabled-cases flag)
  5. Direct file path (absolute)

  Error / edge paths
  6. No match, no near-match → totally_fake_xyz
  7. Typo triggering "did you mean?" → quality_gate_idlee
  8. Ambiguous with >4 matches → quality_gate (bulleted list path)
  9. No arguments → full listing path (>4 → bulleted list)
  10. Non-existent file path → /tmp/nonexistent-xyz.yaml
  11. Existing non-lading file → /etc/hosts

  I'll run them one at a time so we can observe each outcome...
Summary
  All 11 invocation paths exercised successfully:
  ┌─────┬─────────────────────┬────────┬──────────────────────────┬────────┐
  │  #  │        Case         │ Script │      Skill behavior      │ Result │
  │     │                     │  exit  │                          │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 1   │ Exact name          │ 0      │ Resolved + explained     │ PASS   │
  │     │ quality_gate_idle   │        │                          │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 2   │ Substring           │ 0      │ Resolved unambiguously + │ PASS   │
  │     │ security_mean       │        │  explained               │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 3   │ Glob *dogstatsd* (3 │ 3      │ AskUserQuestion with 3   │ PASS   │
  │     │  hits)              │        │ options                  │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │     │ Disabled            │        │ Resolved + flagged       │        │
  │ 4   │ ddot_traces         │ 0      │ x-disabled-cases (also   │ PASS   │
  │     │                     │        │ caught a schema drift)   │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 5   │ Direct absolute     │ 0      │ Passed through +         │ PASS   │
  │     │ path                │        │ explained                │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 6   │ Bogus               │ 2      │ Not-found, no            │ PASS   │
  │     │ totally_fake_xyz    │        │ suggestions, stop        │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 7   │ Typo                │ 2      │ "did you mean?" →        │ PASS   │
  │     │ quality_gate_idlee  │        │ AskUserQuestion          │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 8   │ Broad quality_gate  │ 3      │ Plain bulleted list, no  │ PASS   │
  │     │ (7 hits)            │        │ AskUserQuestion          │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 9   │ No argument (31     │ 0      │ Plain bulleted list      │ PASS   │
  │     │ configs)            │        │                          │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 10  │ Non-existent path   │ 2      │ Direct-file branch "not  │ PASS   │
  │     │                     │        │ found"                   │        │
  ├─────┼─────────────────────┼────────┼──────────────────────────┼────────┤
  │ 11  │ Non-lading file     │ 2      │ Non-lading guard fires   │ PASS   │
  │     │ /etc/hosts          │        │                          │        │
  └─────┴─────────────────────┴────────┴──────────────────────────┴────────┘

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this PR, that level of testing makes sense.

Outside of scope for this PR, I'd like to see skills in the repo go through automated evals (behavioral first, then also triggering evals if we want AIs to call these skills). I'll reach out to some people internally to see what I can find.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to bring it up with Agent DevX as well. I'd like to know the immediate term stance on how to review skills and the acceptance criteria.

Otherwise, it puts reviewers in a bind.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system-probe.yaml is good, this one should only contain runtime_security_config.enabled=true (activity_dump and remote_config should only be present system-probe side)
Same for the other 2 experiments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the callout, fixed!

Two SMP experiments form a pair. Both run with the same custom `default.policy`; the axis that distinguishes them is whether lading generates filesystem load:

- **`quality_gate_security_no_fs_load`** — CWS enabled, custom `default.policy`, `generator: []`. Measures the floor for this policy: background event noise and policy-loaded memory footprint with no application-generated filesystem events.
- **`quality_gate_security_mean_fs_load`** — CWS enabled, custom `default.policy`, `file_tree` generator. Measures overhead under a production-representative mean filesystem load.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mean an interesting level of load? I would expect this to be a severely left skewed distribution, which mean will under-represent. Could we capture a higher percentile of the observed workload?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look. To give an idea of how skewed that distribution is, some hosts top out above 400k write events per second.

I got there with this query: top(max:datadog.runtime_security.perf_buffer.events.write{event_type:open} by {host}.as_rate(), 100, 'max', 'desc')

More interesting is something along these lines: percentile(max:datadog.runtime_security.perf_buffer.events.write{event_type:open} by {host}.as_rate(), 'p95', { * })

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, fwiw, I've been using this notebook to help visualize some of what we're looking at: https://app.datadoghq.com/notebook/13998267/cws-quality-gates?cell-eh89gz4d-from_ts=1776088673376&cell-eh89gz4d-refresh_mode=sliding&cell-eh89gz4d-to_ts=1776693473376&refresh_mode=paused&tpl_var_event_type=%2A&tpl_var_experiment=quality_gate_security_idle&utc_override=false&from_ts=1776710092745&to_ts=1776713692745

Is mean an interesting level of load?

I think it is. Being able to say "on average, this is the cost" is interesting to me. Having said that, being able to say "what's the expected cost at the 95th percentile" is also interesting. I could see us having both.

I also think mean is a lot easier to reason about and visualize in DataDog than doing a percentile of the maxes on hosts. What I would really like is for the underlying metric to be a distribution so we could use a percentile directly. Alas, it's not.

Personally, I'd like to revisit this later this year once I have a better understanding of COAT and other telemetry. I'd like to provide tooling that works for any Quality Gate. This is very much an intermediate step till then.


I'm going to push back a little bit on this ask for now. Folks from CWS were open to mean as an initial QG target. I'd like to get them started with mean and in the meeting next week with CWS, I'll make sure to encourage them to adapt the QGs according to the use cases they deem most important. I'll communicate that we'll/I'll be available to assist.

@preinlein preinlein force-pushed the paul.reinlein/cws-quality-gate branch from e75da91 to 6b980eb Compare May 1, 2026 17:56
@datadog-datadog-prod-us1

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@Ishirui Ishirui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reviewing specifically the agent-devx-owned files, i.e. the Claude skills you are adding:

  1. Could you split this into a different PR ? This one is big enough as-is
  2. Could you add yourselves as CODEOWNERS for those skills ? We in Agent DevX don't have much context on how these work (esp. w.r.t lading) 😅
  3. I think there was already a comment mentioning this, but it would be better imo to have these skills living in lading, with maybe a skill in the Agent that delegates to the lading skill.
  4. Would it be possible to avoid using complex bash scripts here ? Anything longer than a few lines should imo be in a more easily testable language, maybe an invoke task. I think there is also quite a bit of logic (e.g. resolving the git top-level) that should be put in a "library" (and might even already exist in tasks/libs, haven't checked) to avoid duplication.
  5. In general, I think we should avoid using AI for skills, or even intermediate skills steps, that can be replaced by standard scripting. In the analysis skill for instance, steps 1 through 4 are purely deterministic and do not require an LLM IIUC.

Sorry for the big set of comments, but we're still figuring out our policies regarding new skills in the repo, and would rather not have to do expensive cleanup to get everything up to standard once they are fully determined 🙏

@preinlein
Copy link
Copy Markdown
Contributor Author

I'm reviewing specifically the agent-devx-owned files, i.e. the Claude skills you are adding:

  1. Could you split this into a different PR ? This one is big enough as-is
  2. Could you add yourselves as CODEOWNERS for those skills ? We in Agent DevX don't have much context on how these work (esp. w.r.t lading) 😅
  3. I think there was already a comment mentioning this, but it would be better imo to have these skills living in lading, with maybe a skill in the Agent that delegates to the lading skill.
  4. Would it be possible to avoid using complex bash scripts here ? Anything longer than a few lines should imo be in a more easily testable language, maybe an invoke task. I think there is also quite a bit of logic (e.g. resolving the git top-level) that should be put in a "library" (and might even already exist in tasks/libs, haven't checked) to avoid duplication.
  5. In general, I think we should avoid using AI for skills, or even intermediate skills steps, that can be replaced by standard scripting. In the analysis skill for instance, steps 1 through 4 are purely deterministic and do not require an LLM IIUC.

Sorry for the big set of comments, but we're still figuring out our policies regarding new skills in the repo, and would rather not have to do expensive cleanup to get everything up to standard once they are fully determined 🙏

Since everything you commented is about the skills, I'll move them out into their own PR and open a fresh PR with you folks. That will address 1, I'll address 2.

I'll take the rest offline.

@preinlein preinlein force-pushed the paul.reinlein/cws-quality-gate branch from e57817b to fe74a2b Compare May 4, 2026 18:44
@preinlein
Copy link
Copy Markdown
Contributor Author

I've decided to split this PR into 3:

This is so that the quality gates are separated from the introduced skills. I'll be poking folks for reviews/approvals on those PRs.

@preinlein preinlein closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog No changelog entry needed internal Identify a non-fork PR long review PR is complex, plan time to review it qa/no-code-change No code change in Agent code requiring validation qa/skip-qa team/agent-devx team/agent-security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants