diff --git a/.claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md b/.claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md new file mode 100644 index 000000000000..aea98de70ad1 --- /dev/null +++ b/.claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md @@ -0,0 +1,170 @@ +--- +name: analyze-quality-gate-security-mean-fs-load +description: Compare production CWS metrics with SMP regression experiment metrics from quality_gate_security_mean_fs_load and quality_gate_security_no_fs_load +user_invocable: true +allowed-tools: Bash, Read, Skill +--- + +# analyze-quality-gate-security-mean-fs-load + +Compare production CWS `perf_buffer.events.write` open rates against what the `quality_gate_security_mean_fs_load` lading config generates and what SMP captures, using `quality_gate_security_no_fs_load` as the no-load baseline. + +Two SMP experiments form a pair. Both run with the same custom `default.policy`; the axis that distinguishes them is whether lading generates filesystem load: + +- **`quality_gate_security_no_fs_load`** — CWS enabled, custom `default.policy`, `generator: []`. Measures the floor for this policy: background event noise and policy-loaded memory footprint with no application-generated filesystem events. +- **`quality_gate_security_mean_fs_load`** — CWS enabled, custom `default.policy`, `file_tree` generator. Measures overhead under a production-representative mean filesystem load. + +The difference between them isolates the generator's contribution, which should match production workload above background noise. + +The primary signal is `event_type:open` — it is the most common event type in production CWS workloads. Open events have high background noise from always-on VFS hooks (`hook_do_dentry_open`, `hook_vfs_open` under the `"*"` probe selector), but the `no_fs_load` experiment directly measures this noise floor, enabling clean subtraction. + +## Step 1: Invoke /explain-lading-config + +Call the `explain-lading-config` skill on `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`. Record the expected open event rate: `open_per_second` events/sec (each open operation produces 1 open event). Note `rename_per_second` but do not include it in the primary comparison. + +Note: `quality_gate_security_no_fs_load` has `generator: []` (no load generator). There is nothing to explain for it — its purpose is to measure the no-load baseline. + +## Step 2: Verify pup auth + +Run `pup auth status`. If expired, run `pup auth login`. If auth fails entirely, note that Datadog MCP tools are available as a fallback for read-only queries. + +## Step 3: Identify the latest job for each experiment and query it exclusively + +SMP runs multiple replicas per experiment; each is tagged with a unique `job_id`. Pin the analysis to one specific run by finding the latest `job_id` per experiment and filtering on it. This avoids time-window heuristics and the 300 s rollup gap-fill that deflates means when a short capture only partially occupies a bucket. + +### 3a. List jobs grouped by `job_id` + +Run grouped queries over the last 1 day (widen to 2d, then 7d only if nothing is returned). These are **inventory-only** — do not use their values, only their `job_id` scopes and timestamps: + +```bash +# mean_fs_load — jobs +pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load} by {job_id}.as_rate()' --from 1d --to now + +# no_fs_load — jobs +pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load} by {job_id}.as_rate()' --from 1d --to now +``` + +For each returned series, read its `scope` (contains `job_id:`) and its `pointlist`. Null-safe: drop points whose value is `None`. Per job, record the timestamp of the **last non-zero point**. The `job_id` with the most recent last-non-zero timestamp is "the latest job" for that experiment. + +### 3b. Re-query each latest `job_id` exclusively + +Add `job_id:` to the tag filter. Pin the query window to the job itself — not a relative range like `--from 1h`. Using a relative range is non-deterministic across runs: it shifts with wall-clock time, and the API rollup may include or exclude an edge bucket depending on where "now" lands, producing different pointlists and different means for the same `job_id`. + +From Step 3a you have the first and last non-zero epoch-ms (`FIRST_MS`, `LAST_MS`) for each latest `job_id`. Derive a deterministic window: + +```bash +FROM_TS=$(( FIRST_MS / 1000 - 60 )) # 1 minute of padding before +TO_TS=$(( LAST_MS / 1000 + 60 )) # 1 minute of padding after +``` + +Pass these as absolute Unix seconds to `pup`. The window width should stay under ~1 h so the API still returns 20-second interval data. + +Run in parallel: + +```bash +# mean_fs_load — latest job +pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS" +pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS" + +# no_fs_load — latest job +pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS" +pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS" +``` + +The `.as_rate()` modifier guarantees the result is in events/sec. Both this metric and the production metric are type `rate` (statsd_interval=10), so `.as_rate()` is a no-op today — but it documents intent and protects against metadata changes. + +Compute `mean` over **non-zero, non-null** points only. Record the `job_id`, the first and last non-zero timestamps (capture window), the interval returned, the data points count, and the raw sum used for the mean. + +When reporting `capture_first_ts` / `capture_last_ts`, do not compute ISO strings by hand, use: + +```bash +python3 -c "from datetime import datetime, timezone; print(datetime.fromtimestamp($(( MS / 1000 )), timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))" +``` + +Always print the raw epoch-ms alongside the ISO string so the reader can verify the conversion. + +Do not include min/max in the report. + +### 3c. Sanity checks + +- If fewer than ~5 non-zero points come back for a latest job, the capture may be in flight or interrupted. Try the second-most-recent `job_id` and report both so the reviewer can judge. +- If the `mean_fs_load` and `no_fs_load` latest `job_id`s ran more than 24 h apart, flag the staleness — background noise floor drifts over time. +- Never fall back to wider windows with coarser rollups (≥300 s intervals) as a substitute — rollup gap-fill deflates per-job means when a capture only partially occupies a bucket. + +### 3d. Anti-pattern (why `job_id` is authoritative) + +Do **not** identify runs by time heuristics (contiguous non-zero clusters, gap-detection, "last N minutes"). A single SMP run at 20 s resolution still contains transient zero buckets that masquerade as run boundaries once rolled up, and multiple replicas of the same experiment run concurrently, so any time-slice may contain overlapping jobs. The `job_id` tag is the only authoritative grouping for a single run. + +## Step 4: Query production metric (weekly mean) + +Query production data over the last 7 days to get the weekly mean. Run both queries in parallel: + +```bash +# Primary — open-only, weekly mean +pup --read-only metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{event_type:open}.as_rate()' --from 7d --to now + +# Context — all file activity, weekly mean +pup --read-only metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{category:file_activity}.as_rate()' --from 7d --to now +``` + +All queries use `.as_rate()` so every value in the comparison chain is in the same unit (events/sec): +- Lading config: `open_per_second` × 1 = events/sec +- SMP captured: `.as_rate()` → events/sec +- Production: `.as_rate()` → events/sec + +Compute the mean from the weekly pointlist values. Do not include min/max in the report. + +## Step 5: Compare and analyze + +The primary comparison uses `event_type:open` values only. + +All values must be in **events/sec** (guaranteed by `.as_rate()` on metric queries, and by the explain-lading-config breakdown for lading). + +**Primary table — open events/sec:** + +| Source | Value (events/sec) | Description | +|--------|---------------------|-------------| +| Lading config | from Step 1 | `open_per_second` × 1 | +| SMP — no_fs_load (open) | from Step 3 | Background open noise with custom policy, no generator | +| SMP — mean_fs_load (open) | from Step 3 | Open rate with custom policy and generator running | +| Generator contribution | computed | mean_fs_load - no_fs_load | +| Internal production data per-host avg (open, weekly) | from Step 4 | production `event_type:open` `.as_rate()` weekly mean | + +**Supplementary context** (not used for tuning decisions): + +| Source | Value (events/sec) | Description | +|--------|---------------------|-------------| +| SMP — no_fs_load (all file activity) | from Step 3 | no-load `category:file_activity` — background noise floor | +| SMP — mean_fs_load (all file activity) | from Step 3 | loaded `category:file_activity` | +| Internal production data per-host avg (all file activity, weekly) | from Step 4 | production `category:file_activity` weekly mean | + +Analysis: +- **No-load baseline check**: The `no_fs_load` open rate captures background noise from always-on VFS hooks while the custom policy is in effect. Record this value as the noise floor. +- **Generator contribution check**: `generator_contribution = mean_fs_load_open - no_fs_load_open` reflects the generator's observable effect on CWS. Note that this need not equal `open_per_second` — one lading syscall can produce multiple CWS events (and vice versa). +- **Production validity check**: Compare the SMP `mean_fs_load` open rate against the internal production data per-host weekly open average. If they diverge, flag the gap. +- Note the `category:file_activity` totals for reference + +## Step 6: Output report + +Print a markdown report answering "does the lading config reflect the production open workload?" with these sections: + +1. **Lading config** — configured open rate (`open_per_second` × 1 = events/sec) +2. **SMP No-Load Baseline (no_fs_load, latest job ``)** — open rate and file activity with CWS + custom policy but no generator, showing the job_id, capture window (first → last non-zero epoch-ms with ISO derived via `date -u -r`), interval, data points count, and the mean expressed as `sum= / n= = ` so the reader can reproduce the arithmetic without trusting the model +3. **SMP Loaded (mean_fs_load, latest job ``)** — open rate and file activity with CWS + custom policy + generator, showing the job_id, capture window (epoch-ms + `date -u -r` derived ISO), interval, data points count, and the mean expressed as `sum= / n= = ` +4. **Generator Contribution** — delta between mean_fs_load and no_fs_load for open events (between the two latest jobs) +5. **Internal production data per-host avg (open, weekly)** — production open rate weekly mean + +Followed by a supplementary note showing `category:file_activity` totals from both SMP experiments and production for reference. + +Followed by a one-line assessment: match, mismatch, or insufficient data. + +## Step 7: Propose lading config changes + +If a mismatch was found, propose concrete changes to `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`: + +1. **Target**: the production open weekly mean from Step 4. +2. **Direction of change**: adjust `open_per_second` up or down to push `mean_fs_load_open` toward the target. Do not assume a 1:1 mapping between lading syscalls and CWS events — the security-agent can observe multiple kernel events per lading operation (e.g. a single rename may surface more than two events) and may dedupe or filter others. Treat the lading-to-CWS ratio as empirical, derived from the specific `job_id` used in Step 3b — cite the job_id alongside the ratio so the tuning decision is traceable. +3. **Show the diff** — print the exact YAML change (old value → new value) for `open_per_second`. Do not adjust `rename_per_second` — rename events are not the primary signal. +4. **Offer to apply** — ask whether to edit the lading config file directly. + +If data was insufficient (e.g. the latest job had too few non-zero points), offer to use the second-most-recent `job_id` before proposing changes. Do not fall back to wider time windows with coarser rollups. diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index cdb5b0fc36ec..2f50d25ba33b 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -941,4 +941,5 @@ /q_branch/ @DataDog/q-branch # AI-related files -/.claude/skills/explain-lading-config @DataDog/single-machine-performance +/.claude/skills/explain-lading-config @DataDog/single-machine-performance +/.claude/skills/analyze-quality-gate-security-mean-fs-load @DataDog/single-machine-performance @DataDog/agent-security