diff --git a/.claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md b/.claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md new file mode 100644 index 000000000000..7668211e3807 --- /dev/null +++ b/.claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md @@ -0,0 +1,170 @@ +--- +name: analyze-quality-gate-security-mean-fs-load +description: Compare production CWS metrics with SMP regression experiment metrics from quality_gate_security_mean_fs_load and quality_gate_security_no_fs_load +user_invocable: true +allowed-tools: Bash, Read, Skill +--- + +# analyze-quality-gate-security-mean-fs-load + +Compare production CWS `perf_buffer.events.write` open rates against what the `quality_gate_security_mean_fs_load` lading config generates and what SMP captures, using `quality_gate_security_no_fs_load` as the no-load baseline. + +Two SMP experiments form a pair. Both run with the same custom `default.policy`; the axis that distinguishes them is whether lading generates filesystem load: + +- **`quality_gate_security_no_fs_load`** — CWS enabled, custom `default.policy`, `generator: []`. Measures the floor for this policy: background event noise and policy-loaded memory footprint with no application-generated filesystem events. +- **`quality_gate_security_mean_fs_load`** — CWS enabled, custom `default.policy`, `file_tree` generator. Measures overhead under a production-representative mean filesystem load. + +The difference between them isolates the generator's contribution, which should match production workload above background noise. + +The primary signal is `event_type:open` — it is the most common event type in production CWS workloads. Open events have high background noise from always-on VFS hooks (`hook_do_dentry_open`, `hook_vfs_open` under the `"*"` probe selector), but the `no_fs_load` experiment directly measures this noise floor, enabling clean subtraction. + +## Step 1: Invoke /explain-lading-config + +Call the `explain-lading-config` skill on `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`. Record the expected open event rate: `open_per_second` events/sec (each open operation produces 1 open event). Note `rename_per_second` but do not include it in the primary comparison. + +Note: `quality_gate_security_no_fs_load` has `generator: []` (no load generator). There is nothing to explain for it — its purpose is to measure the no-load baseline. + +## Step 2: Verify pup auth + +Run `pup auth status`. If expired, run `pup auth login`. If auth fails entirely, note that Datadog MCP tools are available as a fallback for read-only queries. + +## Step 3: Identify the latest job for each experiment and query it exclusively + +SMP runs multiple replicas per experiment; each is tagged with a unique `job_id`. Pin the analysis to one specific run by finding the latest `job_id` per experiment and filtering on it. This avoids time-window heuristics and the 300 s rollup gap-fill that deflates means when a short capture only partially occupies a bucket. + +### 3a. List jobs grouped by `job_id` + +Run grouped queries over the last 1 day (widen to 2d, then 7d only if nothing is returned). These are **inventory-only** — do not use their values, only their `job_id` scopes and timestamps: + +```bash +# mean_fs_load — jobs +pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load} by {job_id}.as_rate()' --from 1d --to now + +# no_fs_load — jobs +pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load} by {job_id}.as_rate()' --from 1d --to now +``` + +For each returned series, read its `scope` (contains `job_id:`) and its `pointlist`. Null-safe: drop points whose value is `None`. Per job, record the timestamp of the **last non-zero point**. The `job_id` with the most recent last-non-zero timestamp is "the latest job" for that experiment. + +### 3b. Re-query each latest `job_id` exclusively + +Add `job_id:` to the tag filter. Pin the query window to the job itself — not a relative range like `--from 1h`. Using a relative range is non-deterministic across runs: it shifts with wall-clock time, and the API rollup may include or exclude an edge bucket depending on where "now" lands, producing different pointlists and different means for the same `job_id`. + +From Step 3a you have the first and last non-zero epoch-ms (`FIRST_MS`, `LAST_MS`) for each latest `job_id`. Derive a deterministic window: + +```bash +FROM_TS=$(( FIRST_MS / 1000 - 60 )) # 1 minute of padding before +TO_TS=$(( LAST_MS / 1000 + 60 )) # 1 minute of padding after +``` + +Pass these as absolute Unix seconds to `pup`. The window width should stay under ~1 h so the API still returns 20-second interval data. + +Run in parallel: + +```bash +# mean_fs_load — latest job +pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS" +pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS" + +# no_fs_load — latest job +pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS" +pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS" +``` + +The `.as_rate()` modifier guarantees the result is in events/sec. Both this metric and the production metric are type `rate` (statsd_interval=10), so `.as_rate()` is a no-op today — but it documents intent and protects against metadata changes. + +Compute `mean` over **non-zero, non-null** points only. Record the `job_id`, the first and last non-zero timestamps (capture window), the interval returned, the data points count, and the raw sum used for the mean. + +When reporting `capture_first_ts` / `capture_last_ts`, do not compute ISO strings by hand — use the shell: + +```bash +date -u -r $(( MS / 1000 )) +%Y-%m-%dT%H:%M:%SZ +``` + +Always print the raw epoch-ms alongside the ISO string so the reader can verify the conversion. + +Do not include min/max in the report. + +### 3c. Sanity checks + +- If fewer than ~5 non-zero points come back for a latest job, the capture may be in flight or interrupted. Try the second-most-recent `job_id` and report both so the reviewer can judge. +- If the `mean_fs_load` and `no_fs_load` latest `job_id`s ran more than 24 h apart, flag the staleness — background noise floor drifts over time. +- Never fall back to wider windows with coarser rollups (≥300 s intervals) as a substitute — rollup gap-fill deflates per-job means when a capture only partially occupies a bucket. + +### 3d. Anti-pattern (why `job_id` is authoritative) + +Do **not** identify runs by time heuristics (contiguous non-zero clusters, gap-detection, "last N minutes"). A single SMP run at 20 s resolution still contains transient zero buckets that masquerade as run boundaries once rolled up, and multiple replicas of the same experiment run concurrently, so any time-slice may contain overlapping jobs. The `job_id` tag is the only authoritative grouping for a single run. + +## Step 4: Query production metric (weekly mean) + +Query production data over the last 7 days to get the weekly mean. Run both queries in parallel: + +```bash +# Primary — open-only, weekly mean +pup metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{event_type:open}.as_rate()' --from 7d --to now + +# Context — all file activity, weekly mean +pup metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{category:file_activity}.as_rate()' --from 7d --to now +``` + +All queries use `.as_rate()` so every value in the comparison chain is in the same unit (events/sec): +- Lading config: `open_per_second` × 1 = events/sec +- SMP captured: `.as_rate()` → events/sec +- Production: `.as_rate()` → events/sec + +Compute the mean from the weekly pointlist values. Do not include min/max in the report. + +## Step 5: Compare and analyze + +The primary comparison uses `event_type:open` values only. + +All values must be in **events/sec** (guaranteed by `.as_rate()` on metric queries, and by the explain-lading-config breakdown for lading). + +**Primary table — open events/sec:** + +| Source | Value (events/sec) | Description | +|--------|---------------------|-------------| +| Lading config | from Step 1 | `open_per_second` × 1 | +| SMP — no_fs_load (open) | from Step 3 | Background open noise with custom policy, no generator | +| SMP — mean_fs_load (open) | from Step 3 | Open rate with custom policy and generator running | +| Generator contribution | computed | mean_fs_load - no_fs_load | +| Internal production data per-host avg (open, weekly) | from Step 4 | production `event_type:open` `.as_rate()` weekly mean | + +**Supplementary context** (not used for tuning decisions): + +| Source | Value (events/sec) | Description | +|--------|---------------------|-------------| +| SMP — no_fs_load (all file activity) | from Step 3 | no-load `category:file_activity` — background noise floor | +| SMP — mean_fs_load (all file activity) | from Step 3 | loaded `category:file_activity` | +| Internal production data per-host avg (all file activity, weekly) | from Step 4 | production `category:file_activity` weekly mean | + +Analysis: +- **No-load baseline check**: The `no_fs_load` open rate captures background noise from always-on VFS hooks while the custom policy is in effect. Record this value as the noise floor. +- **Generator contribution check**: `generator_contribution = mean_fs_load_open - no_fs_load_open` reflects the generator's observable effect on CWS. Note that this need not equal `open_per_second` — one lading syscall can produce multiple CWS events (and vice versa). +- **Production validity check**: Compare the SMP `mean_fs_load` open rate against the internal production data per-host weekly open average. If they diverge, flag the gap. +- Note the `category:file_activity` totals for reference + +## Step 6: Output report + +Print a markdown report answering "does the lading config reflect the production open workload?" with these sections: + +1. **Lading config** — configured open rate (`open_per_second` × 1 = events/sec) +2. **SMP No-Load Baseline (no_fs_load, latest job ``)** — open rate and file activity with CWS + custom policy but no generator, showing the job_id, capture window (first → last non-zero epoch-ms with ISO derived via `date -u -r`), interval, data points count, and the mean expressed as `sum= / n= = ` so the reader can reproduce the arithmetic without trusting the model +3. **SMP Loaded (mean_fs_load, latest job ``)** — open rate and file activity with CWS + custom policy + generator, showing the job_id, capture window (epoch-ms + `date -u -r` derived ISO), interval, data points count, and the mean expressed as `sum= / n= = ` +4. **Generator Contribution** — delta between mean_fs_load and no_fs_load for open events (between the two latest jobs) +5. **Internal production data per-host avg (open, weekly)** — production open rate weekly mean + +Followed by a supplementary note showing `category:file_activity` totals from both SMP experiments and production for reference. + +Followed by a one-line assessment: match, mismatch, or insufficient data. + +## Step 7: Propose lading config changes + +If a mismatch was found, propose concrete changes to `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`: + +1. **Target**: the production open weekly mean from Step 4. +2. **Direction of change**: adjust `open_per_second` up or down to push `mean_fs_load_open` toward the target. Do not assume a 1:1 mapping between lading syscalls and CWS events — the security-agent can observe multiple kernel events per lading operation (e.g. a single rename may surface more than two events) and may dedupe or filter others. Treat the lading-to-CWS ratio as empirical, derived from the specific `job_id` used in Step 3b — cite the job_id alongside the ratio so the tuning decision is traceable. +3. **Show the diff** — print the exact YAML change (old value → new value) for `open_per_second`. Do not adjust `rename_per_second` — rename events are not the primary signal. +4. **Offer to apply** — ask whether to edit the lading config file directly. + +If data was insufficient (e.g. the latest job had too few non-zero points), offer to use the second-most-recent `job_id` before proposing changes. Do not fall back to wider time windows with coarser rollups. diff --git a/.claude/skills/explain-lading-config/SKILL.md b/.claude/skills/explain-lading-config/SKILL.md new file mode 100644 index 000000000000..6f646e7d16cd --- /dev/null +++ b/.claude/skills/explain-lading-config/SKILL.md @@ -0,0 +1,225 @@ +--- +name: explain-lading-config +description: Explains a lading.yaml config file from the regression test suite, using the lading Rust source as ground truth for field meanings and defaults. +user_invocable: true +argument-hint: "[experiment name]" +--- + +# explain-lading-config + +Explain what a lading regression test config does, grounded in lading source code. + +## Quick Start + +```bash +# 1. Verify the lading checkout exists and is on a known branch +bash .claude/skills/explain-lading-config/scripts/validate-lading-checkout.sh + +# 2. Resolve $ARGUMENTS to a lading.yaml path (exact/substring/glob/path) +bash .claude/skills/explain-lading-config/scripts/resolve-lading-config.sh "$ARGUMENTS" + +# 3. Read the resolved file, then grep source structs in parallel: +grep -n 'pub struct Config\|pub enum Config\|pub enum\|fn default_\|impl Default for\|#\[serde(default' \ + ~/dd/lading/lading/src/generator/.rs +``` + +Then explain with defaults resolved to concrete values (not function names). +Full workflow below. + +## Step 1: Validate lading checkout + +Run `.claude/skills/explain-lading-config/scripts/validate-lading-checkout.sh`. + +- Exit 0: script prints the current branch on stdout. If it is not `main`, warn + the user that explanations are grounded in a non-main branch, then continue. +- Exit non-zero: the script prints a suggested `git clone` command on stderr. + Relay that to the user and stop. + +Override the checkout location with `LADING_DIR` if needed. + +## Step 2: Determine target file + +Use `.claude/skills/explain-lading-config/scripts/resolve-lading-config.sh` to +avoid ad-hoc matching. The script handles path-like inputs, substring case +names, and shell globs (`*`, `?`), and it extracts the experiment name by +walking up to the directory above `lading/` — which supports both the common +`/lading/lading.yaml` layout and split-mode +`//lading/lading.yaml` layouts (in the latter, the script +reports `/` so it is not confused with the non-split layout). + +**If `$ARGUMENTS` is provided:** run `resolve-lading-config.sh "$ARGUMENTS"`. +- Exit 0: stdout is the resolved absolute path; read it. +- Exit 3 (ambiguous): stderr lists candidates. + - **≤ 4 candidates:** use `AskUserQuestion` to pick one, then read that + path. + - **> 4 candidates** (a broad substring like `i` can match 20+): do not + try to force them into `AskUserQuestion`. Print the experiment names + as a short bulleted list and ask the user to narrow the query and + re-invoke `/explain-lading-config `. +- Exit 2 (not found): stderr may include "did you mean?" suggestions — if + present, offer the suggestions to the user via `AskUserQuestion` (up to + 4 options) or as a short list; if not, relay the error and stop. +- Exit 4 (wrong repo): the script is being run from outside the agent repo. + Relay the error verbatim and stop — the user needs to `cd` into the repo. + +**If the resolved path contains `/x-disabled-cases/`** (e.g. +`test/regression/x-disabled-cases/ddot_traces/lading/lading.yaml`), flag +this explicitly in the explanation — the experiment exists on disk but is +not currently executed by SMP. Otherwise a user may assume it's live. + +**Reading very large configs:** multi-sender configs (e.g. +`uds_dogstatsd_20mb_12k_contexts_20_senders`, ~870 lines) are usually +block-copies of one template with a few fields varying (typically only +`seed`). Before a full `Read`, check size and duplication: + +```bash +wc -l # scale check +grep -c '^ - ' # top-level list entries +yq '.generator | length' 2>/dev/null # if yq is present +``` + +For highly-duplicated configs, `Read` only the first block (plus the +blackhole/target_metrics sections) and report the generator as +"N identical copies, seed differs" instead of walking every block. Spot- +check one later block to confirm uniformity. + +**If `$ARGUMENTS` is omitted:** run `resolve-lading-config.sh` with no +argument. It emits `\t` lines for every discovered config. + +- **≤ 4 configs:** offer them via `AskUserQuestion` and read the selected + path. +- **> 4 configs** (the usual case — there are ~30 in this repo): do NOT + stuff them into `AskUserQuestion`'s 4-option cap with ad-hoc categories. + Print the experiment names as a plain bulleted list to the user and ask + them to type the name (or re-invoke the skill with `/explain-lading-config + `). This costs one round-trip instead of two. + +## Step 3: Read the lading codebase for context + +Before explaining, read the relevant source files from the lading checkout +to understand config fields. Do NOT rely on embedded knowledge — always +read the source. Source paths below use `~/dd/lading/` for readability; +substitute `$LADING_DIR` (from Step 1) if the user overrode the location. + +If an expected source file doesn't exist (lading may have renamed or +restructured), fall back to `grep -rln 'pub struct Config' ~/dd/lading/lading/src/generator/` (or `blackhole/`, `target_metrics/`) to locate the current file, then proceed as normal. Mention the rename in the +explanation so the user knows the skill's default paths are out of date. + +First, parse the config to see which sections are populated (`generator`, +`blackhole`, `target_metrics`). Only read source files for sections that +actually exist. In particular: **if `generator: []`, skip the generator +source reads entirely** — there is nothing to ground. + +1. If `generator` has entries: read `~/dd/lading/lading/src/generator.rs` to + identify generator types, then for each type used read + `~/dd/lading/lading/src/generator/.rs` (config struct, field + meanings, defaults). +2. If a payload variant is referenced (e.g. `dogstatsd`, + `opentelemetry_metrics`), find its module. The variant-to-module + mapping lives in `~/dd/lading/lading_payload/src/lib.rs` — grep for + the variant's PascalCase enum name (e.g. `OpentelemetryMetrics`) and + follow the `crate::…` path it points to. Common mappings: + - `dogstatsd` → `lading_payload/src/dogstatsd.rs` + - `opentelemetry_metrics` → `lading_payload/src/opentelemetry/metric.rs` + - `opentelemetry_logs` → `lading_payload/src/opentelemetry/log.rs` + - `datadog_logs` → `lading_payload/src/datadog_logs.rs` + **Variant serialization forms:** + - `variant: "syslog5424"` (plain string) — the enum variant carries no + config fields (unit/empty struct). There are no knobs to explain; the + module itself encodes all behaviour. + - `variant: { opentelemetry_metrics: {} }` (mapping with empty body) — + the variant has a `Config` struct and is using `Config::default()`. + Follow `impl Default for Config` and any nested `Default` impls. + - `variant: { dogstatsd: { contexts: …, kind_weights: … } }` — explicit + field overrides; report them alongside the defaults for any omitted + sibling fields. +3. If blackholes are configured, read: + - `~/dd/lading/lading/src/blackhole.rs` — blackhole enum + - `~/dd/lading/lading/src/blackhole/.rs` — per-blackhole config structs +4. If `target_metrics` has entries, read: + - `~/dd/lading/lading/src/target_metrics/prometheus.rs` — `uri`, `metrics`, `tags` + - `~/dd/lading/lading/src/target_metrics/expvar.rs` — `uri`, `vars`, `tags` + (Other scrapers live alongside in `target_metrics/`.) + +Read these files in parallel where possible. + +### Reading strategy: grep before Read + +Lading's source files can be hundreds of lines. To ground defaults without +reading whole files, use this invariant: every default in lading follows the +pattern `#[serde(default = "default_foo")]` → `fn default_foo() -> T { ... }`. +`Default` impls (for payload types like `KindWeights`, `MetricWeights`) are +adjacent to their struct definitions. + +Efficient approach for a generator/blackhole/target_metrics type: + +```bash +# Locate top-level Config + all named defaults + nested enum variants in one pass +grep -n 'pub struct Config\|pub enum Config\|pub enum\|fn default_\|impl Default for\|#\[serde(default' \ + ~/dd/lading/lading/src/generator/.rs +``` + +Include `pub enum` — some generators' top-level `Config` is an enum +(e.g. `file_gen::Config` discriminates on `traditional` / `logrotate` / +`logrotate_fs`), and several structs hold nested enums (`http::Method`, +`blackhole::datadog::Variant`) that the YAML maps into with nested keys. + +Then `Read` only the line ranges that matter (struct/enum body + default fns). +Reserve full-file reads for cases where the struct body references types +you still need to understand. + +## Step 4: Explain the config + +Using the lading source as ground truth, provide a structured explanation: + +### Generator summary + +For each generator entry, include the fields relevant to that generator type: +- Type and protocol/variant +- Target endpoint (`addr`, `path`, `target_uri`) — if network-based +- Throughput (`bytes_per_second`) — if network-based +- Parallel connections or sender count — if applicable +- Payload characteristics (contexts, tag counts, metric type weights, body sizes, `kind_weights`) — if applicable +- Operation rates (e.g. `open_per_second`, `rename_per_second`) — for filesystem generators +- Container churn rate = `number_of_containers / max_lifetime_seconds` — for container generators (report containers recycled per second, since that's what the agent sees) +- Default values for any omitted fields. **Always resolve the default to a + concrete value**, not just the function name — the user wants to know + what actually runs. Follow `#[serde(default = "default_foo")]` → the body + of `fn default_foo()`, or the `impl Default` block, and report the literal + (e.g. `block_cache_method: Fixed (via lading_payload::block::default_cache_method)`). + If the default is a nested struct with its own defaults, recurse one level; + cite further nested defaults by path rather than expanding the whole tree. +- Cache config (`maximum_prebuild_cache_size_bytes`, `block_cache_method`) — if applicable + +Skip fields that don't exist on the generator type. The Rust `Config` struct is authoritative for which fields exist. + +### Aggregate load + +Summarize total load across all generators. Pick the right unit for the +generator type — don't invent a bytes/s number for non-network load: +- **Network** (`http`, `tcp`, `udp`, `unix_*`, `grpc`, `splunk_hec`): sum + `bytes_per_second` across generators. +- **Filesystem** (`file_tree`, `file_gen`): report operation rates + (`open_per_second`, `rename_per_second`) or the load profile. +- **Container** (`container`): report the churn rate + (`number_of_containers / max_lifetime_seconds` containers recycled per + second); throughput isn't meaningful. +- **Mixed**: report each dimension separately. + +### Blackhole sinks + +What endpoints absorb target output, any simulated latency. + +### Target metrics + +What telemetry is scraped from the target (if configured). Per scraper: +type (`prometheus`, `expvar`, …), URI, and any tags (e.g. `sub_agent`). +**Do not enumerate large var lists verbatim** — when `vars:` has many +entries (common for `expvar`), summarize by category (forwarder, +serializer, writers, `memstats/*`, etc.) and cite the line range for +follow-up. A config with no `generator` but heavy `target_metrics` is +typically an idle-baseline experiment measuring the agent's self-cost. + +### Source references + +Cite the specific lading source files read, with relative paths from `~/dd/lading/`, so the user can dig deeper. diff --git a/.claude/skills/explain-lading-config/scripts/resolve-lading-config.sh b/.claude/skills/explain-lading-config/scripts/resolve-lading-config.sh new file mode 100755 index 000000000000..41f445e4c9a3 --- /dev/null +++ b/.claude/skills/explain-lading-config/scripts/resolve-lading-config.sh @@ -0,0 +1,216 @@ +#!/usr/bin/env bash +# Resolve the target lading.yaml for the explain-lading-config skill. +# +# Usage: +# resolve-lading-config.sh [ARG] +# +# When ARG is omitted, emits one `\t` line per +# discovered lading.yaml (tab-separated). Callers list these to the user. +# +# When ARG is provided, resolves it to exactly one path and prints that path. +# Exits non-zero with candidate paths on stderr if the argument is ambiguous, +# or with "not found" if nothing matches. +# +# Accepted ARG forms: +# - absolute or relative path to a lading.yaml +# - path containing '/' — treated as a file path +# - substring of a case/experiment name (plain substring, no shell glob +# characters required) +# - glob with '*' or '?' — matched against experiment names, not paths +# +# Experiment name derivation: the first directory above `lading/` in the +# config path. This handles both the common `/lading/lading.yaml` +# layout and split-mode `//lading/lading.yaml` layouts — the +# reported name is `` for the former and `` for the latter, +# with `/` disambiguating split-mode rows in the listing. + +set -euo pipefail + +repo_root() { + git rev-parse --show-toplevel 2>/dev/null || pwd +} + +# Exit early with a clear error if we cannot locate the regression suite. +# Otherwise a user running this from the wrong directory (e.g. /tmp) would +# see a silent "no matches" for every query. +require_regression_dir() { + local root + root="$(repo_root)" + if [[ ! -d "$root/test/regression" ]]; then + cat >&2 </lading/lading.yaml -> +# Split: ...///lading/lading.yaml -> / +display_name() { + local path="$1" + local lading_dir parent_of_lading case_dir + lading_dir="$(dirname "$path")" # ...//lading or ...///lading + parent_of_lading="$(dirname "$lading_dir")" + case_dir="$(dirname "$parent_of_lading")" + + # Detect split-mode by checking whether the grandparent of lading.yaml + # lives under a directory named `cases` (or `x-disabled-cases`). If so + # it's the standard layout; otherwise it's split-mode and we include + # one more segment of context. + local case_parent_name + case_parent_name="$(basename "$case_dir")" + if [[ "$case_parent_name" == "cases" || "$case_parent_name" == "x-disabled-cases" ]]; then + basename "$parent_of_lading" + else + printf '%s/%s\n' "$(basename "$case_dir")" "$(basename "$parent_of_lading")" + fi +} + +list_all() { + local path + while IFS= read -r -d '' path; do + printf '%s\t%s\n' "$(display_name "$path")" "$path" + done < <(find_configs) | sort +} + +# Render a `\t` stream for humans — annotates disabled rows. +# The tab-separated layout is preserved (`\t\t` for disabled +# rows), so tabular parsers still work on the first two fields. +annotate_for_display() { + local name path + while IFS=$'\t' read -r name path; do + [[ -z "$name" ]] && continue + if [[ "$path" == */x-disabled-cases/* ]]; then + printf '%s\t%s\t%s\n' "$name" "$path" "(disabled)" + else + printf '%s\t%s\n' "$name" "$path" + fi + done +} + +# Emit up to three "did you mean?" suggestions on stderr. +# Scoring: count of tokens from the query (split on `_`, space, or `-`) that +# appear as substrings of the candidate name. Matching is case-insensitive +# because all experiment names are lowercase. Ties broken by shorter name. +suggest_near_matches() { + local query="$1" all="$2" + local lower_query + lower_query="$(printf '%s' "$query" | tr '[:upper:]' '[:lower:]')" + local IFS_=$IFS + # shellcheck disable=SC2206 + IFS=$' \t_-' read -ra tokens <<< "$lower_query" + IFS=$IFS_ + local scored="" name path score token + while IFS=$'\t' read -r name path; do + [[ -z "$name" ]] && continue + score=0 + for token in "${tokens[@]}"; do + [[ -n "$token" && "$name" == *"$token"* ]] && score=$((score + 1)) + done + if [[ "$score" -gt 0 ]]; then + scored+="$score"$'\t'"$name"$'\n' + fi + done <<< "$all" + if [[ -n "$scored" ]]; then + echo "did you mean?" >&2 + printf '%s' "$scored" | sort -k1,1rn -k2,2 | head -3 | cut -f2 | sed 's/^/ /' >&2 + fi +} + +resolve_one() { + local arg="$1" + + # 1) Direct path — resolve and return if the file exists AND looks like a + # lading config. We reject arbitrary existing files (e.g. /etc/hosts) + # to avoid the downstream explainer operating on something unrelated. + if [[ "$arg" == */* || "$arg" == *.yaml ]]; then + # Expand a leading `~/` by hand — [[ -f ]] does not tilde-expand a + # quoted literal, and `${arg#~/}` has subtle tilde-expansion behaviour + # in the pattern, so just slice off the first two characters. + case "$arg" in + "~/"*) arg="$HOME/${arg:2}" ;; + "~") arg="$HOME" ;; + esac + if [[ ! -f "$arg" ]]; then + echo "not found: $arg" >&2 + return 2 + fi + if [[ "$(basename "$arg")" != "lading.yaml" ]] \ + && ! head -200 "$arg" 2>/dev/null | grep -qE '^(generator|blackhole|target_metrics)\s*:'; then + echo "not a lading config: $arg" >&2 + echo "expected a file named 'lading.yaml' or one containing a top-level" >&2 + echo "'generator:', 'blackhole:', or 'target_metrics:' key" >&2 + return 2 + fi + printf '%s\n' "$(cd "$(dirname "$arg")" && pwd)/$(basename "$arg")" + return 0 + fi + + # 2) Match against experiment names. + # + # Matching precedence: + # a. Exact name match wins outright (even if the arg is a substring of + # other names — e.g. `uds_dogstatsd_to_api` should not be ambiguous + # just because `uds_dogstatsd_to_api_v3` exists). + # b. Else if the arg contains glob metachars, shell-glob against names. + # c. Else plain substring match (so `security_mean` matches + # `quality_gate_security_mean_fs_load`). + # Matching is case-insensitive — all experiment names are lowercase in + # the repo, so treating the arg as lowercase adds ergonomics (typing + # `QUALITY_GATE_IDLE` still works) without ambiguity. + local all exact matches path name lower_arg lower_name + all="$(list_all)" + exact="" + matches="" + lower_arg="$(printf '%s' "$arg" | tr '[:upper:]' '[:lower:]')" + while IFS=$'\t' read -r name path; do + lower_name="$(printf '%s' "$name" | tr '[:upper:]' '[:lower:]')" + if [[ "$lower_name" == "$lower_arg" ]]; then + exact="$name"$'\t'"$path"$'\n' + fi + if [[ "$lower_arg" == *[*?[]* ]]; then + # shellcheck disable=SC2053 + [[ "$lower_name" == $lower_arg ]] && matches="$matches$name"$'\t'"$path"$'\n' + else + [[ "$lower_name" == *"$lower_arg"* ]] && matches="$matches$name"$'\t'"$path"$'\n' + fi + done <<< "$all" + + if [[ -n "$exact" ]]; then + printf '%s' "$exact" | cut -f2 + return 0 + fi + + local count + count="$(printf '%s' "$matches" | grep -c . || true)" + if [[ "$count" -eq 0 ]]; then + echo "no lading.yaml matches '$arg'" >&2 + suggest_near_matches "$arg" "$all" >&2 + return 2 + fi + if [[ "$count" -gt 1 ]]; then + echo "multiple matches for '$arg':" >&2 + printf '%s' "$matches" | annotate_for_display >&2 + return 3 + fi + printf '%s' "$matches" | cut -f2 +} + +require_regression_dir + +if [[ $# -eq 0 || -z "${1-}" ]]; then + list_all | annotate_for_display +else + resolve_one "$1" +fi diff --git a/.claude/skills/explain-lading-config/scripts/validate-lading-checkout.sh b/.claude/skills/explain-lading-config/scripts/validate-lading-checkout.sh new file mode 100755 index 000000000000..4eaff9e5317e --- /dev/null +++ b/.claude/skills/explain-lading-config/scripts/validate-lading-checkout.sh @@ -0,0 +1,31 @@ +#!/usr/bin/env bash +# Validate that ~/dd/lading is a usable lading checkout. +# +# Exits 0 with the checkout's current branch printed on stdout when usable. +# Exits 1 with a suggested `git clone` command on stderr when the checkout +# is missing or not a git repo. +# +# Callers should print the stdout line to the user so they know which branch +# explanations will be grounded in, and warn if the branch is not `main`. + +set -euo pipefail + +LADING_DIR="${LADING_DIR:-$HOME/dd/lading}" + +if [[ ! -d "$LADING_DIR" ]]; then + cat >&2 </dev/null 2>&1; then + echo "lading checkout at $LADING_DIR is not a git repo" >&2 + exit 1 +fi + +branch="$(git -C "$LADING_DIR" branch --show-current)" +echo "$branch" diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 1efaf23a0a3d..654cff49dac6 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -897,6 +897,7 @@ /test/regression/ @DataDog/single-machine-performance /test/regression/ebpf @DataDog/single-machine-performance @DataDog/ebpf-platform /test/regression/cases/docker_containers* @DataDog/single-machine-performance @DataDog/container-integrations +/test/regression/cases/quality_gate_security_* @DataDog/single-machine-performance @DataDog/agent-security /tools/ @DataDog/agent-devx /tools/host-profiler/ @DataDog/profiling-full-host diff --git a/test/regression/cases/file_tree/datadog-agent/system-probe.yaml b/test/regression/cases/file_tree/datadog-agent/system-probe.yaml deleted file mode 100644 index 49517e76a319..000000000000 --- a/test/regression/cases/file_tree/datadog-agent/system-probe.yaml +++ /dev/null @@ -1,2 +0,0 @@ -runtime_security_config: - enabled: true diff --git a/test/regression/cases/file_tree/experiment.yaml b/test/regression/cases/file_tree/experiment.yaml deleted file mode 100644 index f8c48ce7e0a3..000000000000 --- a/test/regression/cases/file_tree/experiment.yaml +++ /dev/null @@ -1,41 +0,0 @@ -optimization_goal: memory -erratic: false - -target: - name: datadog-agent - cpu_allotment: 4 - memory_allotment: 2GiB - - environment: - DD_API_KEY: a0000001 - DD_HOSTNAME: smp-regression - DD_RUNTIME_SECURITY_CONFIG_ENABLED: true - DD_RUNTIME_SECURITY_CONFIG_NETWORK_ENABLED: true - DD_RUNTIME_SECURITY_CONFIG_REMOTE_CONFIGURATION_ENABLED: true - - profiling_environment: - # internal profiling - DD_INTERNAL_PROFILING_ENABLED: true - DD_SYSTEM_PROBE_INTERNAL_PROFILING_ENABLED: true - DD_APM_INTERNAL_PROFILING_ENABLED: true - # run all the time - DD_SYSTEM_PROBE_INTERNAL_PROFILING_PERIOD: 1m - DD_INTERNAL_PROFILING_PERIOD: 1m - DD_SYSTEM_PROBE_INTERNAL_PROFILING_CPU_DURATION: 1m - DD_INTERNAL_PROFILING_CPU_DURATION: 1m - # destination - DD_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket - DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket - # tags - DD_INTERNAL_PROFILING_EXTRA_TAGS: experiment:file_tree - DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_EXTRA_TAGS: experiment:file_tree - - DD_INTERNAL_PROFILING_BLOCK_PROFILE_RATE: 10000 - DD_INTERNAL_PROFILING_DELTA_PROFILES: true - DD_INTERNAL_PROFILING_ENABLE_GOROUTINE_STACKTRACES: true - DD_INTERNAL_PROFILING_MUTEX_PROFILE_FRACTION: 10 - - # ddprof options - DD_PROFILING_EXECUTION_TRACE_ENABLED: true - DD_PROFILING_EXECUTION_TRACE_PERIOD: 1m - DD_PROFILING_WAIT_PROFILE: true diff --git a/test/regression/cases/file_tree/lading/lading.yaml b/test/regression/cases/file_tree/lading/lading.yaml deleted file mode 100644 index 8c67b7fdc405..000000000000 --- a/test/regression/cases/file_tree/lading/lading.yaml +++ /dev/null @@ -1,20 +0,0 @@ -generator: - - file_tree: - seed: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, - 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131] - root: /lading-data/ - rename_per_second: 10 - -blackhole: - - http: - binding_addr: "127.0.0.1:9091" - body_variant: "nothing" - - http: - binding_addr: "127.0.0.1:9093" - body_variant: "nothing" - -target_metrics: - - prometheus: # core agent telemetry - uri: "http://127.0.0.1:5000/telemetry" - tags: - sub_agent: "core" diff --git a/test/regression/cases/quality_gate_security_idle/README.md b/test/regression/cases/quality_gate_security_idle/README.md new file mode 100644 index 000000000000..d958ef9c3971 --- /dev/null +++ b/test/regression/cases/quality_gate_security_idle/README.md @@ -0,0 +1,40 @@ +# Quality Gate CWS - Idle + +## Overview + +This quality gate experiment measures the Datadog Agent's resource consumption +with Workload Protection just turned on — no custom policy, no lading-generated +filesystem workload. It establishes the floor that every CWS customer pays +before any tuning. + +**The only enabled functionality is [workload protection](https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/).** + +## Owners + +- **Teams**: @team-k9-cws-agent +- **Slack Channel**: [#security-and-compliance-agent](https://dd.enterprise.slack.com/archives/CTNVD37T3) + +## Scenario + +Models a host that has just enabled CWS with no further configuration: + +- No `runtime-security.d/default.policy` override — the agent runs with whatever + policies ship by default. +- `generator: []` in lading — no application-generated filesystem events. + +The only events observed are background noise from default activity on the +host, filtered through the shipped approvers. + +This is the baseline "turn it on and leave it alone" measurement. The sibling +gates `quality_gate_security_no_fs_load` and `quality_gate_security_mean_fs_load` +both layer the experiment's `default.policy` on top and isolate the effect of +lading-generated filesystem load. + +## Enforcements + +- Memory usage is below a threshold +- Average CPU usage is below a threshold + +## Other Links + +- [CWS Quality Gates Notebook](https://app.datadoghq.com/notebook/13998267/cws-quality-gate) diff --git a/test/regression/cases/file_tree/datadog-agent/datadog.yaml b/test/regression/cases/quality_gate_security_idle/datadog-agent/datadog.yaml similarity index 63% rename from test/regression/cases/file_tree/datadog-agent/datadog.yaml rename to test/regression/cases/quality_gate_security_idle/datadog-agent/datadog.yaml index 3144449d0523..26f2feec63c3 100644 --- a/test/regression/cases/file_tree/datadog-agent/datadog.yaml +++ b/test/regression/cases/quality_gate_security_idle/datadog-agent/datadog.yaml @@ -1,17 +1,8 @@ auth_token_file_path: /tmp/agent-auth-token +dd_url: http://127.0.0.1:9091 + # Disable cloud detection. This stops the Agent from poking around the # execution environment & network. This is particularly important if the target # has network access. cloud_provider_metadata: [] - -logs_enabled: true - -dd_url: http://127.0.0.1:9091 -telemetry: - enabled: true - checks: '*' -process_config: - process_dd_url: http://localhost:9093 - process_collection: - enabled: false diff --git a/test/regression/cases/quality_gate_security_idle/datadog-agent/security-agent.yaml b/test/regression/cases/quality_gate_security_idle/datadog-agent/security-agent.yaml new file mode 100644 index 000000000000..3e3fa1317468 --- /dev/null +++ b/test/regression/cases/quality_gate_security_idle/datadog-agent/security-agent.yaml @@ -0,0 +1,4 @@ +# Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ +# Only enable workload protection +runtime_security_config: + enabled: true diff --git a/test/regression/cases/quality_gate_security_idle/datadog-agent/system-probe.yaml b/test/regression/cases/quality_gate_security_idle/datadog-agent/system-probe.yaml new file mode 100644 index 000000000000..08af8f6b8b27 --- /dev/null +++ b/test/regression/cases/quality_gate_security_idle/datadog-agent/system-probe.yaml @@ -0,0 +1,10 @@ +# Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ +# Only enable workload protection +runtime_security_config: + enabled: true +# # Activity dump is currently being reworked and when it is enabled, it causes a lot of kernel events +# # By disabling it, we get more predictable results from the generated load. + activity_dump: + enabled: false +remote_configuration: + enabled: false diff --git a/test/regression/cases/quality_gate_security_idle/experiment.yaml b/test/regression/cases/quality_gate_security_idle/experiment.yaml new file mode 100644 index 000000000000..17882eeedee6 --- /dev/null +++ b/test/regression/cases/quality_gate_security_idle/experiment.yaml @@ -0,0 +1,61 @@ +optimization_goal: memory +erratic: false + +target: + name: datadog-agent + cpu_allotment: 4 + # Set to 20% higher than the memory_usage check value + memory_allotment: 390 MiB + + environment: + DD_API_KEY: a0000001 + DD_HOSTNAME: smp-regression + + profiling_environment: + # internal profiling + DD_INTERNAL_PROFILING_ENABLED: true + DD_SYSTEM_PROBE_INTERNAL_PROFILING_ENABLED: true + DD_APM_INTERNAL_PROFILING_ENABLED: true + # run all the time + DD_SYSTEM_PROBE_INTERNAL_PROFILING_PERIOD: 1m + DD_SECURITY_AGENT_INTERNAL_PROFILING_PERIOD: 1m + DD_INTERNAL_PROFILING_PERIOD: 1m + DD_SYSTEM_PROBE_INTERNAL_PROFILING_CPU_DURATION: 1m + DD_SECURITY_AGENT_INTERNAL_PROFILING_CPU_DURATION: 1m + DD_INTERNAL_PROFILING_CPU_DURATION: 1m + # destination + DD_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + DD_SECURITY_AGENT_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + # tags + DD_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_idle + DD_SECURITY_AGENT_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_idle + DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_idle + + DD_INTERNAL_PROFILING_BLOCK_PROFILE_RATE: 10000 + DD_INTERNAL_PROFILING_DELTA_PROFILES: true + DD_INTERNAL_PROFILING_ENABLE_GOROUTINE_STACKTRACES: true + DD_INTERNAL_PROFILING_MUTEX_PROFILE_FRACTION: 10 + + # ddprof options + DD_PROFILING_EXECUTION_TRACE_ENABLED: true + DD_PROFILING_EXECUTION_TRACE_PERIOD: 1m + DD_PROFILING_WAIT_PROFILE: true + +checks: + - name: memory_usage + description: "Memory usage quality gate. This puts a bound on the total memory usage for CWS with no custom policy and no lading-generated filesystem load." + bounds: + series: total_pss_bytes + # When updating this, update the memory_allotment in the target section to 20% higher. + upper_bound: "330 MiB" + + - name: cpu_usage + description: "CPU usage quality gate. This puts a bound on the total average collector millicore usage." + bounds: + series: avg(total_cpu_usage_millicores) + upper_bound: 40 + +report_links: + - text: "bounds checks dashboard" + link: "https://app.datadoghq.com/dashboard/vz3-jd5-bdi?fromUser=true&refresh_mode=paused&tpl_var_experiment%5B0%5D={{ experiment }}&tpl_var_job_id%5B0%5D={{ job_id }}&view=spans&from_ts={{ start_time_ms }}&to_ts={{ end_time_ms }}&live=false" diff --git a/test/regression/cases/quality_gate_security_idle/lading/lading.yaml b/test/regression/cases/quality_gate_security_idle/lading/lading.yaml new file mode 100644 index 000000000000..e86f519d147d --- /dev/null +++ b/test/regression/cases/quality_gate_security_idle/lading/lading.yaml @@ -0,0 +1,18 @@ +generator: [] + +blackhole: + # The datadog blackhole impersonates Datadog's V2 metrics intake. The agent's + # datadog.yaml sets `dd_url` to this address, so every statsd-emitted agent + # metric -- including `datadog.runtime_security.*` from security-agent and + # system-probe -- flows through the agent's normal forwarder and is recorded + # into SMP at its original payload timestamp. + # This is the same code path the agent uses in production. + - datadog: + v2: + binding_addr: "127.0.0.1:9091" + +# target_metrics scrapes Prometheus/expvar endpoints on the target. CWS +# runtime_security metrics are statsd-only and are not exposed on those +# surfaces, so this is intentionally empty -- the datadog blackhole above +# captures them. +target_metrics: [] diff --git a/test/regression/cases/quality_gate_security_mean_fs_load/README.md b/test/regression/cases/quality_gate_security_mean_fs_load/README.md new file mode 100644 index 000000000000..2c802b7061f8 --- /dev/null +++ b/test/regression/cases/quality_gate_security_mean_fs_load/README.md @@ -0,0 +1,56 @@ +# Quality Gate CWS - Mean FS Load + +## Overview + +This quality gate experiment tests the Datadog Agent's performance and resource +consumption with Workload Protection enabled under a production-representative +mean filesystem load. It validates that the agent can handle continuous file +tree operations while staying within defined memory bounds. + +**The only enabled functionality is [workload protection](https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/).** + +## Owners + +- **Teams**: @team-k9-cws-agent +- **Slack Channel**: [#security-and-compliance-agent](https://dd.enterprise.slack.com/archives/CTNVD37T3) + +## Scenario + +Models the per-host average filesystem event rate as observed in internal production data. +The load generated produces file opens and renames with no explicit CWS rules triggering. + +A sibling gate, `quality_gate_security_no_fs_load`, uses the same `default.policy` +but `generator: []` — it measures the same configuration with zero lading-generated +filesystem events. + +## Enforcements + +- Memory usage is below a threshold +- Average CPU usage is below a threshold + +## Additional Information + +The key metric that determines the load is `datadog.runtime_security.perf_buffer.events.write`. This represents the number of kernel events which are being seen. + +SMP runs emit an equivalent metric called `single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write`. + +`datadog.runtime_security.perf_buffer.events.write` +→ Lading load +→ SMP run +→ `single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write` == `datadog.runtime_security.perf_buffer.events.write` + +The emitted metric from SMP should have a similar value to the production data we source. + +### Verifying the Experiment Configuration + +To check whether the lading config accurately models production, run: + +``` +/analyze-quality-gate-security-mean-fs-load +``` + +This compares three values: the lading-configured event rate, the SMP-captured metric, and the production per-host average for `perf_buffer.events.write`. + +## Other Links + +- [CWS Quality Gates Notebook](https://app.datadoghq.com/notebook/13998267/cws-quality-gate) diff --git a/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/datadog.yaml b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/datadog.yaml new file mode 100644 index 000000000000..26f2feec63c3 --- /dev/null +++ b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/datadog.yaml @@ -0,0 +1,8 @@ +auth_token_file_path: /tmp/agent-auth-token + +dd_url: http://127.0.0.1:9091 + +# Disable cloud detection. This stops the Agent from poking around the +# execution environment & network. This is particularly important if the target +# has network access. +cloud_provider_metadata: [] diff --git a/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/runtime-security.d/default.policy b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/runtime-security.d/default.policy new file mode 100644 index 000000000000..179c64bbb571 --- /dev/null +++ b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/runtime-security.d/default.policy @@ -0,0 +1,4 @@ +rules: + - id: lading_open_monitor + expression: >- + open.file.path =~ "/lading-data/*" && open.flags & (O_CREAT | O_RDWR | O_WRONLY) > 0 diff --git a/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/security-agent.yaml b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/security-agent.yaml new file mode 100644 index 000000000000..3e3fa1317468 --- /dev/null +++ b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/security-agent.yaml @@ -0,0 +1,4 @@ +# Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ +# Only enable workload protection +runtime_security_config: + enabled: true diff --git a/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/system-probe.yaml b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/system-probe.yaml new file mode 100644 index 000000000000..08af8f6b8b27 --- /dev/null +++ b/test/regression/cases/quality_gate_security_mean_fs_load/datadog-agent/system-probe.yaml @@ -0,0 +1,10 @@ +# Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ +# Only enable workload protection +runtime_security_config: + enabled: true +# # Activity dump is currently being reworked and when it is enabled, it causes a lot of kernel events +# # By disabling it, we get more predictable results from the generated load. + activity_dump: + enabled: false +remote_configuration: + enabled: false diff --git a/test/regression/cases/quality_gate_security_mean_fs_load/experiment.yaml b/test/regression/cases/quality_gate_security_mean_fs_load/experiment.yaml new file mode 100644 index 000000000000..5c0b0f1fa769 --- /dev/null +++ b/test/regression/cases/quality_gate_security_mean_fs_load/experiment.yaml @@ -0,0 +1,61 @@ +optimization_goal: memory +erratic: false + +target: + name: datadog-agent + cpu_allotment: 4 + # Set to 20% higher than the memory_usage check value + memory_allotment: 380 MiB + + environment: + DD_API_KEY: a0000001 + DD_HOSTNAME: smp-regression + + profiling_environment: + # internal profiling + DD_INTERNAL_PROFILING_ENABLED: true + DD_SYSTEM_PROBE_INTERNAL_PROFILING_ENABLED: true + DD_APM_INTERNAL_PROFILING_ENABLED: true + # run all the time + DD_SYSTEM_PROBE_INTERNAL_PROFILING_PERIOD: 1m + DD_SECURITY_AGENT_INTERNAL_PROFILING_PERIOD: 1m + DD_INTERNAL_PROFILING_PERIOD: 1m + DD_SYSTEM_PROBE_INTERNAL_PROFILING_CPU_DURATION: 1m + DD_SECURITY_AGENT_INTERNAL_PROFILING_CPU_DURATION: 1m + DD_INTERNAL_PROFILING_CPU_DURATION: 1m + # destination + DD_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + DD_SECURITY_AGENT_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + # tags + DD_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_mean_fs_load + DD_SECURITY_AGENT_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_mean_fs_load + DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_mean_fs_load + + DD_INTERNAL_PROFILING_BLOCK_PROFILE_RATE: 10000 + DD_INTERNAL_PROFILING_DELTA_PROFILES: true + DD_INTERNAL_PROFILING_ENABLE_GOROUTINE_STACKTRACES: true + DD_INTERNAL_PROFILING_MUTEX_PROFILE_FRACTION: 10 + + # ddprof options + DD_PROFILING_EXECUTION_TRACE_ENABLED: true + DD_PROFILING_EXECUTION_TRACE_PERIOD: 1m + DD_PROFILING_WAIT_PROFILE: true + +checks: + - name: memory_usage + description: "Memory usage quality gate. This puts a bound on the total memory usage for CWS workloads." + bounds: + series: total_pss_bytes + # When updating this, update the memory_allotment in the target section to 20% higher. + upper_bound: "320 MiB" + + - name: cpu_usage + description: "CPU usage quality gate. This puts a bound on the total average collector millicore usage." + bounds: + series: avg(total_cpu_usage_millicores) + upper_bound: 40 + +report_links: + - text: "bounds checks dashboard" + link: "https://app.datadoghq.com/dashboard/vz3-jd5-bdi?fromUser=true&refresh_mode=paused&tpl_var_experiment%5B0%5D={{ experiment }}&tpl_var_job_id%5B0%5D={{ job_id }}&view=spans&from_ts={{ start_time_ms }}&to_ts={{ end_time_ms }}&live=false" diff --git a/test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml b/test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml new file mode 100644 index 000000000000..4272dfa26844 --- /dev/null +++ b/test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml @@ -0,0 +1,28 @@ +generator: + - file_tree: + seed: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, + 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131] + root: /lading-data/ + # Must exceed open_per_second × run_duration_seconds so that lading never + # exhausts the tree during a capture. Once all nodes exist on disk, opens + # become O_RDONLY and are rejected by CWS kernel-side flag approvers. + max_nodes: 2000000 + open_per_second: 1230 + rename_per_second: 1 + +blackhole: + # The datadog blackhole impersonates Datadog's V2 metrics intake. The agent's + # datadog.yaml sets `dd_url` to this address, so every statsd-emitted agent + # metric -- including `datadog.runtime_security.*` from security-agent and + # system-probe -- flows through the agent's normal forwarder and is recorded + # into SMP at its original payload timestamp. + # This is the same code path the agent uses in production. + - datadog: + v2: + binding_addr: "127.0.0.1:9091" + +# target_metrics scrapes Prometheus/expvar endpoints on the target. CWS +# runtime_security metrics are statsd-only and are not exposed on those +# surfaces, so this is intentionally empty -- the datadog blackhole above +# captures them. +target_metrics: [] diff --git a/test/regression/cases/quality_gate_security_no_fs_load/README.md b/test/regression/cases/quality_gate_security_no_fs_load/README.md new file mode 100644 index 000000000000..4ec25b61b938 --- /dev/null +++ b/test/regression/cases/quality_gate_security_no_fs_load/README.md @@ -0,0 +1,36 @@ +# Quality Gate CWS - No FS Load + +## Overview + +This quality gate experiment measures the Datadog Agent's resource consumption +with Workload Protection enabled and a CWS `default.policy` in effect, but with +no lading-generated filesystem workload. It isolates the overhead of the CWS +policy and approver pipeline under zero application-generated load. + +**The only enabled functionality is [workload protection](https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/).** + +## Owners + +- **Teams**: @team-k9-cws-agent +- **Slack Channel**: [#security-and-compliance-agent](https://dd.enterprise.slack.com/archives/CTNVD37T3) + +## Scenario + +Models a host with CWS enabled and the shipped `default.policy` overridden by +this experiment's policy, but with zero lading-generated filesystem events. The +only events observed are background noise from default activity on the host. + +This is the no-load counterpart to `quality_gate_security_mean_fs_load`: the two +share the same policy configuration and differ only in whether lading is +generating filesystem load. See `quality_gate_security_idle` for the +no-load-and-no-custom-policy baseline (what every CWS customer pays before any +tuning). + +## Enforcements + +- Memory usage is below a threshold +- Average CPU usage is below a threshold + +## Other Links + +- [CWS Quality Gates Notebook](https://app.datadoghq.com/notebook/13998267/cws-quality-gate) diff --git a/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/datadog.yaml b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/datadog.yaml new file mode 100644 index 000000000000..26f2feec63c3 --- /dev/null +++ b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/datadog.yaml @@ -0,0 +1,8 @@ +auth_token_file_path: /tmp/agent-auth-token + +dd_url: http://127.0.0.1:9091 + +# Disable cloud detection. This stops the Agent from poking around the +# execution environment & network. This is particularly important if the target +# has network access. +cloud_provider_metadata: [] diff --git a/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/runtime-security.d/default.policy b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/runtime-security.d/default.policy new file mode 100644 index 000000000000..179c64bbb571 --- /dev/null +++ b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/runtime-security.d/default.policy @@ -0,0 +1,4 @@ +rules: + - id: lading_open_monitor + expression: >- + open.file.path =~ "/lading-data/*" && open.flags & (O_CREAT | O_RDWR | O_WRONLY) > 0 diff --git a/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/security-agent.yaml b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/security-agent.yaml new file mode 100644 index 000000000000..3e3fa1317468 --- /dev/null +++ b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/security-agent.yaml @@ -0,0 +1,4 @@ +# Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ +# Only enable workload protection +runtime_security_config: + enabled: true diff --git a/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/system-probe.yaml b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/system-probe.yaml new file mode 100644 index 000000000000..08af8f6b8b27 --- /dev/null +++ b/test/regression/cases/quality_gate_security_no_fs_load/datadog-agent/system-probe.yaml @@ -0,0 +1,10 @@ +# Per https://docs.datadoghq.com/security/workload_protection/setup/agent/linux/ +# Only enable workload protection +runtime_security_config: + enabled: true +# # Activity dump is currently being reworked and when it is enabled, it causes a lot of kernel events +# # By disabling it, we get more predictable results from the generated load. + activity_dump: + enabled: false +remote_configuration: + enabled: false diff --git a/test/regression/cases/quality_gate_security_no_fs_load/experiment.yaml b/test/regression/cases/quality_gate_security_no_fs_load/experiment.yaml new file mode 100644 index 000000000000..e0f012d37e36 --- /dev/null +++ b/test/regression/cases/quality_gate_security_no_fs_load/experiment.yaml @@ -0,0 +1,61 @@ +optimization_goal: memory +erratic: false + +target: + name: datadog-agent + cpu_allotment: 4 + # Set to 20% higher than the memory_usage check value + memory_allotment: 380 MiB + + environment: + DD_API_KEY: a0000001 + DD_HOSTNAME: smp-regression + + profiling_environment: + # internal profiling + DD_INTERNAL_PROFILING_ENABLED: true + DD_SYSTEM_PROBE_INTERNAL_PROFILING_ENABLED: true + DD_APM_INTERNAL_PROFILING_ENABLED: true + # run all the time + DD_SYSTEM_PROBE_INTERNAL_PROFILING_PERIOD: 1m + DD_SECURITY_AGENT_INTERNAL_PROFILING_PERIOD: 1m + DD_INTERNAL_PROFILING_PERIOD: 1m + DD_SYSTEM_PROBE_INTERNAL_PROFILING_CPU_DURATION: 1m + DD_SECURITY_AGENT_INTERNAL_PROFILING_CPU_DURATION: 1m + DD_INTERNAL_PROFILING_CPU_DURATION: 1m + # destination + DD_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + DD_SECURITY_AGENT_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_UNIX_SOCKET: /smp-host/apm.socket + # tags + DD_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_no_fs_load + DD_SECURITY_AGENT_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_no_fs_load + DD_SYSTEM_PROBE_CONFIG_INTERNAL_PROFILING_EXTRA_TAGS: experiment:quality_gate_security_no_fs_load + + DD_INTERNAL_PROFILING_BLOCK_PROFILE_RATE: 10000 + DD_INTERNAL_PROFILING_DELTA_PROFILES: true + DD_INTERNAL_PROFILING_ENABLE_GOROUTINE_STACKTRACES: true + DD_INTERNAL_PROFILING_MUTEX_PROFILE_FRACTION: 10 + + # ddprof options + DD_PROFILING_EXECUTION_TRACE_ENABLED: true + DD_PROFILING_EXECUTION_TRACE_PERIOD: 1m + DD_PROFILING_WAIT_PROFILE: true + +checks: + - name: memory_usage + description: "Memory usage quality gate. This puts a bound on the total memory usage for CWS with the default.policy in effect but no lading-generated filesystem load." + bounds: + series: total_pss_bytes + # When updating this, update the memory_allotment in the target section to 20% higher. + upper_bound: "320 MiB" + + - name: cpu_usage + description: "CPU usage quality gate. This puts a bound on the total average collector millicore usage." + bounds: + series: avg(total_cpu_usage_millicores) + upper_bound: 40 + +report_links: + - text: "bounds checks dashboard" + link: "https://app.datadoghq.com/dashboard/vz3-jd5-bdi?fromUser=true&refresh_mode=paused&tpl_var_experiment%5B0%5D={{ experiment }}&tpl_var_job_id%5B0%5D={{ job_id }}&view=spans&from_ts={{ start_time_ms }}&to_ts={{ end_time_ms }}&live=false" diff --git a/test/regression/cases/quality_gate_security_no_fs_load/lading/lading.yaml b/test/regression/cases/quality_gate_security_no_fs_load/lading/lading.yaml new file mode 100644 index 000000000000..e86f519d147d --- /dev/null +++ b/test/regression/cases/quality_gate_security_no_fs_load/lading/lading.yaml @@ -0,0 +1,18 @@ +generator: [] + +blackhole: + # The datadog blackhole impersonates Datadog's V2 metrics intake. The agent's + # datadog.yaml sets `dd_url` to this address, so every statsd-emitted agent + # metric -- including `datadog.runtime_security.*` from security-agent and + # system-probe -- flows through the agent's normal forwarder and is recorded + # into SMP at its original payload timestamp. + # This is the same code path the agent uses in production. + - datadog: + v2: + binding_addr: "127.0.0.1:9091" + +# target_metrics scrapes Prometheus/expvar endpoints on the target. CWS +# runtime_security metrics are statsd-only and are not exposed on those +# surfaces, so this is intentionally empty -- the datadog blackhole above +# captures them. +target_metrics: []