-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Setup CWS Quality Gates #50373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gh-worker-dd-mergequeue-cf854d
merged 1 commit into
main
from
paul.reinlein/cws-quality-gates
May 6, 2026
Merged
Setup CWS Quality Gates #50373
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 0 additions & 2 deletions
2
test/regression/cases/file_tree/datadog-agent/system-probe.yaml
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
40 changes: 40 additions & 0 deletions
40
test/regression/cases/quality_gate_security_idle/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Quality Gate CWS - Idle | ||
|
|
||
| ## Overview | ||
|
|
||
| This quality gate experiment measures the Datadog Agent's resource consumption | ||
| with Workload Protection just turned on — no custom policy, no lading-generated | ||
| filesystem workload. It establishes the floor that every CWS customer pays | ||
| before any tuning. | ||
|
|
||
| **The only enabled functionality is [workload protection](https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/).** | ||
|
|
||
| ## Owners | ||
|
|
||
| - **Teams**: @team-k9-cws-agent | ||
| - **Slack Channel**: [#security-and-compliance-agent](https://dd.enterprise.slack.com/archives/CTNVD37T3) | ||
|
|
||
| ## Scenario | ||
|
|
||
| Models a host that has just enabled CWS with no further configuration: | ||
|
|
||
| - No `runtime-security.d/default.policy` override — the agent runs with whatever | ||
| policies ship by default. | ||
| - `generator: []` in lading — no application-generated filesystem events. | ||
|
|
||
| The only events observed are background noise from default activity on the | ||
| host, filtered through the shipped approvers. | ||
|
|
||
| This is the baseline "turn it on and leave it alone" measurement. The sibling | ||
| gates `quality_gate_security_no_fs_load` and `quality_gate_security_mean_fs_load` | ||
| both layer the experiment's `default.policy` on top and isolate the effect of | ||
| lading-generated filesystem load. | ||
|
|
||
| ## Enforcements | ||
|
|
||
| - Memory usage is below a threshold | ||
| - Average CPU usage is below a threshold | ||
|
|
||
| ## Other Links | ||
|
|
||
| - [CWS Quality Gates Notebook](https://app.datadoghq.com/notebook/13998267/cws-quality-gate) |
13 changes: 2 additions & 11 deletions
13
...ases/file_tree/datadog-agent/datadog.yaml → ..._security_idle/datadog-agent/datadog.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,17 +1,8 @@ | ||
| auth_token_file_path: /tmp/agent-auth-token | ||
|
|
||
| dd_url: http://127.0.0.1:9091 | ||
|
|
||
| # Disable cloud detection. This stops the Agent from poking around the | ||
| # execution environment & network. This is particularly important if the target | ||
| # has network access. | ||
| cloud_provider_metadata: [] | ||
|
|
||
| logs_enabled: true | ||
|
|
||
| dd_url: http://127.0.0.1:9091 | ||
| telemetry: | ||
| enabled: true | ||
| checks: '*' | ||
| process_config: | ||
| process_dd_url: http://localhost:9093 | ||
| process_collection: | ||
| enabled: false |
4 changes: 4 additions & 0 deletions
4
test/regression/cases/quality_gate_security_idle/datadog-agent/security-agent.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ | ||
| # Only enable workload protection | ||
| runtime_security_config: | ||
| enabled: true |
10 changes: 10 additions & 0 deletions
10
test/regression/cases/quality_gate_security_idle/datadog-agent/system-probe.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ | ||
| # Only enable workload protection | ||
| runtime_security_config: | ||
| enabled: true | ||
| # # Activity dump is currently being reworked and when it is enabled, it causes a lot of kernel events | ||
| # # By disabling it, we get more predictable results from the generated load. | ||
| activity_dump: | ||
| enabled: false | ||
| remote_configuration: | ||
| enabled: false | ||
61 changes: 61 additions & 0 deletions
61
test/regression/cases/quality_gate_security_idle/experiment.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| optimization_goal: memory | ||
| erratic: false | ||
|
|
||
| target: | ||
| name: datadog-agent | ||
| cpu_allotment: 4 | ||
| # Set to 20% higher than the memory_usage check value | ||
| memory_allotment: 390 MiB | ||
|
|
||
| environment: | ||
| DD_API_KEY: a0000001 | ||
| DD_HOSTNAME: smp-regression | ||
|
|
||
| profiling_environment: | ||
| # internal profiling | ||
| DD_INTERNAL_PROFILING_ENABLED: true | ||
| DD_SYSTEM_PROBE_INTERNAL_PROFILING_ENABLED: true | ||
| DD_APM_INTERNAL_PROFILING_ENABLED: true | ||
| # run all the time | ||
| DD_SYSTEM_PROBE_INTERNAL_PROFILING_PERIOD: 1m | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_PERIOD: 1m | ||
| DD_INTERNAL_PROFILING_PERIOD: 1m | ||
| DD_SYSTEM_PROBE_INTERNAL_PROFILING_CPU_DURATION: 1m | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_CPU_DURATION: 1m | ||
| DD_INTERNAL_PROFILING_CPU_DURATION: 1m | ||
| # destination | ||
| DD_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket | ||
| DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket | ||
| # tags | ||
| DD_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_idle | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_idle | ||
| DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_idle | ||
|
|
||
| DD_INTERNAL_PROFILING_BLOCK_PROFILE_RATE: 10000 | ||
| DD_INTERNAL_PROFILING_DELTA_PROFILES: true | ||
| DD_INTERNAL_PROFILING_ENABLE_GOROUTINE_STACKTRACES: true | ||
| DD_INTERNAL_PROFILING_MUTEX_PROFILE_FRACTION: 10 | ||
|
|
||
| # ddprof options | ||
| DD_PROFILING_EXECUTION_TRACE_ENABLED: true | ||
| DD_PROFILING_EXECUTION_TRACE_PERIOD: 1m | ||
| DD_PROFILING_WAIT_PROFILE: true | ||
|
|
||
| checks: | ||
| - name: memory_usage | ||
| description: "Memory usage quality gate. This puts a bound on the total memory usage for CWS with no custom policy and no lading-generated filesystem load." | ||
| bounds: | ||
| series: total_pss_bytes | ||
| # When updating this, update the memory_allotment in the target section to 20% higher. | ||
| upper_bound: "330 MiB" | ||
|
|
||
| - name: cpu_usage | ||
| description: "CPU usage quality gate. This puts a bound on the total average collector millicore usage." | ||
| bounds: | ||
| series: avg(total_cpu_usage_millicores) | ||
| upper_bound: 40 | ||
|
|
||
| report_links: | ||
| - text: "bounds checks dashboard" | ||
| link: "https://app.datadoghq.com/dashboard/vz3-jd5-bdi?fromUser=true&refresh_mode=paused&tpl_var_experiment%5B0%5D={{ experiment }}&tpl_var_job_id%5B0%5D={{ job_id }}&view=spans&from_ts={{ start_time_ms }}&to_ts={{ end_time_ms }}&live=false" |
18 changes: 18 additions & 0 deletions
18
test/regression/cases/quality_gate_security_idle/lading/lading.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| generator: [] | ||
|
|
||
| blackhole: | ||
| # The datadog blackhole impersonates Datadog's V2 metrics intake. The agent's | ||
| # datadog.yaml sets `dd_url` to this address, so every statsd-emitted agent | ||
| # metric -- including `datadog.runtime_security.*` from security-agent and | ||
| # system-probe -- flows through the agent's normal forwarder and is recorded | ||
| # into SMP at its original payload timestamp. | ||
| # This is the same code path the agent uses in production. | ||
| - datadog: | ||
| v2: | ||
| binding_addr: "127.0.0.1:9091" | ||
|
|
||
| # target_metrics scrapes Prometheus/expvar endpoints on the target. CWS | ||
| # runtime_security metrics are statsd-only and are not exposed on those | ||
| # surfaces, so this is intentionally empty -- the datadog blackhole above | ||
| # captures them. | ||
| target_metrics: [] |
56 changes: 56 additions & 0 deletions
56
test/regression/cases/quality_gate_security_mean_fs_load/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # Quality Gate CWS - Mean FS Load | ||
|
|
||
| ## Overview | ||
|
|
||
| This quality gate experiment tests the Datadog Agent's performance and resource | ||
| consumption with Workload Protection enabled under a production-representative | ||
| mean filesystem load. It validates that the agent can handle continuous file | ||
| tree operations while staying within defined memory bounds. | ||
|
|
||
| **The only enabled functionality is [workload protection](https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/).** | ||
|
|
||
| ## Owners | ||
|
|
||
| - **Teams**: @team-k9-cws-agent | ||
| - **Slack Channel**: [#security-and-compliance-agent](https://dd.enterprise.slack.com/archives/CTNVD37T3) | ||
|
|
||
| ## Scenario | ||
|
|
||
| Models the per-host average filesystem event rate as observed in internal production data. | ||
| The load generated produces file opens and renames with no explicit CWS rules triggering. | ||
|
|
||
| A sibling gate, `quality_gate_security_no_fs_load`, uses the same `default.policy` | ||
| but `generator: []` — it measures the same configuration with zero lading-generated | ||
| filesystem events. | ||
|
|
||
| ## Enforcements | ||
|
|
||
| - Memory usage is below a threshold | ||
| - Average CPU usage is below a threshold | ||
|
|
||
| ## Additional Information | ||
|
|
||
| The key metric that determines the load is `datadog.runtime_security.perf_buffer.events.write`. This represents the number of kernel events which are being seen. | ||
|
|
||
| SMP runs emit an equivalent metric called `single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write`. | ||
|
|
||
| `datadog.runtime_security.perf_buffer.events.write` | ||
| → Lading load | ||
| → SMP run | ||
| → `single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write` == `datadog.runtime_security.perf_buffer.events.write` | ||
|
|
||
| The emitted metric from SMP should have a similar value to the production data we source. | ||
|
|
||
| ### Verifying the Experiment Configuration | ||
|
|
||
| To check whether the lading config accurately models production, run: | ||
|
|
||
| ``` | ||
| /analyze-quality-gate-security-mean-fs-load | ||
| ``` | ||
|
|
||
| This compares three values: the lading-configured event rate, the SMP-captured metric, and the production per-host average for `perf_buffer.events.write`. | ||
|
|
||
| ## Other Links | ||
|
|
||
| - [CWS Quality Gates Notebook](https://app.datadoghq.com/notebook/13998267/cws-quality-gate) |
8 changes: 8 additions & 0 deletions
8
test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/datadog.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| auth_token_file_path: /tmp/agent-auth-token | ||
|
|
||
| dd_url: http://127.0.0.1:9091 | ||
|
|
||
| # Disable cloud detection. This stops the Agent from poking around the | ||
| # execution environment & network. This is particularly important if the target | ||
| # has network access. | ||
| cloud_provider_metadata: [] |
4 changes: 4 additions & 0 deletions
4
.../cases/quality_gate_security_mean_fs_load/datadog-agent/runtime-security.d/default.policy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| rules: | ||
| - id: lading_open_monitor | ||
| expression: >- | ||
| open.file.path =~ "/lading-data/*" && open.flags & (O_CREAT | O_RDWR | O_WRONLY) > 0 |
4 changes: 4 additions & 0 deletions
4
test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/security-agent.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ | ||
| # Only enable workload protection | ||
| runtime_security_config: | ||
| enabled: true |
10 changes: 10 additions & 0 deletions
10
test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/system-probe.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ | ||
| # Only enable workload protection | ||
| runtime_security_config: | ||
| enabled: true | ||
| # # Activity dump is currently being reworked and when it is enabled, it causes a lot of kernel events | ||
| # # By disabling it, we get more predictable results from the generated load. | ||
| activity_dump: | ||
| enabled: false | ||
| remote_configuration: | ||
| enabled: false |
61 changes: 61 additions & 0 deletions
61
test/regression/cases/quality_gate_security_mean_fs_load/experiment.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| optimization_goal: memory | ||
| erratic: false | ||
|
|
||
| target: | ||
| name: datadog-agent | ||
| cpu_allotment: 4 | ||
| # Set to 20% higher than the memory_usage check value | ||
| memory_allotment: 380 MiB | ||
|
|
||
| environment: | ||
| DD_API_KEY: a0000001 | ||
| DD_HOSTNAME: smp-regression | ||
|
|
||
| profiling_environment: | ||
| # internal profiling | ||
| DD_INTERNAL_PROFILING_ENABLED: true | ||
| DD_SYSTEM_PROBE_INTERNAL_PROFILING_ENABLED: true | ||
| DD_APM_INTERNAL_PROFILING_ENABLED: true | ||
| # run all the time | ||
| DD_SYSTEM_PROBE_INTERNAL_PROFILING_PERIOD: 1m | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_PERIOD: 1m | ||
| DD_INTERNAL_PROFILING_PERIOD: 1m | ||
| DD_SYSTEM_PROBE_INTERNAL_PROFILING_CPU_DURATION: 1m | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_CPU_DURATION: 1m | ||
| DD_INTERNAL_PROFILING_CPU_DURATION: 1m | ||
| # destination | ||
| DD_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket | ||
| DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket | ||
| # tags | ||
| DD_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_mean_fs_load | ||
| DD_SECURITY_AGENT_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_mean_fs_load | ||
| DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_mean_fs_load | ||
|
|
||
| DD_INTERNAL_PROFILING_BLOCK_PROFILE_RATE: 10000 | ||
| DD_INTERNAL_PROFILING_DELTA_PROFILES: true | ||
| DD_INTERNAL_PROFILING_ENABLE_GOROUTINE_STACKTRACES: true | ||
| DD_INTERNAL_PROFILING_MUTEX_PROFILE_FRACTION: 10 | ||
|
|
||
| # ddprof options | ||
| DD_PROFILING_EXECUTION_TRACE_ENABLED: true | ||
| DD_PROFILING_EXECUTION_TRACE_PERIOD: 1m | ||
| DD_PROFILING_WAIT_PROFILE: true | ||
|
|
||
| checks: | ||
| - name: memory_usage | ||
| description: "Memory usage quality gate. This puts a bound on the total memory usage for CWS workloads." | ||
| bounds: | ||
| series: total_pss_bytes | ||
| # When updating this, update the memory_allotment in the target section to 20% higher. | ||
| upper_bound: "320 MiB" | ||
|
|
||
| - name: cpu_usage | ||
| description: "CPU usage quality gate. This puts a bound on the total average collector millicore usage." | ||
| bounds: | ||
| series: avg(total_cpu_usage_millicores) | ||
| upper_bound: 70 | ||
|
|
||
| report_links: | ||
| - text: "bounds checks dashboard" | ||
| link: "https://app.datadoghq.com/dashboard/vz3-jd5-bdi?fromUser=true&refresh_mode=paused&tpl_var_experiment%5B0%5D={{ experiment }}&tpl_var_job_id%5B0%5D={{ job_id }}&view=spans&from_ts={{ start_time_ms }}&to_ts={{ end_time_ms }}&live=false" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote_configuration.enabledis currently set at the top level ofsystem-probe.yaml, but CWS remote-config gating is read fromruntime_security_config.remote_configuration.enabled(and then combined with globalremote_configuration.enabled) inpkg/security/config/config.go(isRemoteConfigEnabled, lines 735-748). With this placement, the CWS-specific flag stays at its defaulttrue, so these quality gates still run with RC enabled, which can introduce remote policy/config traffic and non-deterministic overhead instead of the intended local-policy-only baseline. The same pattern appears in the other two new security quality-gate cases as well.Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes against the guidance I was provided by CWS. We can investigate later - non blocker.