Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions .claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely need some iteration after merge. Right now this works well considering I'm manually triggering the Quality Gates.

Ideally, we use data from regression detector runs off of main (ideally with some kind of tag that we can filter on) and nothing from CI itself as we don't want PR data to influence things.

Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
name: analyze-quality-gate-security-mean-fs-load
description: Compare production CWS metrics with SMP regression experiment metrics from quality_gate_security_mean_fs_load and quality_gate_security_no_fs_load
user_invocable: true
allowed-tools: Bash, Read, Skill
---

# analyze-quality-gate-security-mean-fs-load

Compare production CWS `perf_buffer.events.write` open rates against what the `quality_gate_security_mean_fs_load` lading config generates and what SMP captures, using `quality_gate_security_no_fs_load` as the no-load baseline.

Two SMP experiments form a pair. Both run with the same custom `default.policy`; the axis that distinguishes them is whether lading generates filesystem load:

- **`quality_gate_security_no_fs_load`** — CWS enabled, custom `default.policy`, `generator: []`. Measures the floor for this policy: background event noise and policy-loaded memory footprint with no application-generated filesystem events.
- **`quality_gate_security_mean_fs_load`** — CWS enabled, custom `default.policy`, `file_tree` generator. Measures overhead under a production-representative mean filesystem load.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mean an interesting level of load? I would expect this to be a severely left skewed distribution, which mean will under-represent. Could we capture a higher percentile of the observed workload?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look. To give an idea of how skewed that distribution is, some hosts top out above 400k write events per second.

I got there with this query: top(max:datadog.runtime_security.perf_buffer.events.write{event_type:open} by {host}.as_rate(), 100, 'max', 'desc')

More interesting is something along these lines: percentile(max:datadog.runtime_security.perf_buffer.events.write{event_type:open} by {host}.as_rate(), 'p95', { * })

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, fwiw, I've been using this notebook to help visualize some of what we're looking at: https://app.datadoghq.com/notebook/13998267/cws-quality-gates?cell-eh89gz4d-from_ts=1776088673376&cell-eh89gz4d-refresh_mode=sliding&cell-eh89gz4d-to_ts=1776693473376&refresh_mode=paused&tpl_var_event_type=%2A&tpl_var_experiment=quality_gate_security_idle&utc_override=false&from_ts=1776710092745&to_ts=1776713692745

Is mean an interesting level of load?

I think it is. Being able to say "on average, this is the cost" is interesting to me. Having said that, being able to say "what's the expected cost at the 95th percentile" is also interesting. I could see us having both.

I also think mean is a lot easier to reason about and visualize in DataDog than doing a percentile of the maxes on hosts. What I would really like is for the underlying metric to be a distribution so we could use a percentile directly. Alas, it's not.

Personally, I'd like to revisit this later this year once I have a better understanding of COAT and other telemetry. I'd like to provide tooling that works for any Quality Gate. This is very much an intermediate step till then.


I'm going to push back a little bit on this ask for now. Folks from CWS were open to mean as an initial QG target. I'd like to get them started with mean and in the meeting next week with CWS, I'll make sure to encourage them to adapt the QGs according to the use cases they deem most important. I'll communicate that we'll/I'll be available to assist.


The difference between them isolates the generator's contribution, which should match production workload above background noise.

The primary signal is `event_type:open` — it is the most common event type in production CWS workloads. Open events have high background noise from always-on VFS hooks (`hook_do_dentry_open`, `hook_vfs_open` under the `"*"` probe selector), but the `no_fs_load` experiment directly measures this noise floor, enabling clean subtraction.

## Step 1: Invoke /explain-lading-config

Call the `explain-lading-config` skill on `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`. Record the expected open event rate: `open_per_second` events/sec (each open operation produces 1 open event). Note `rename_per_second` but do not include it in the primary comparison.

Note: `quality_gate_security_no_fs_load` has `generator: []` (no load generator). There is nothing to explain for it — its purpose is to measure the no-load baseline.

## Step 2: Verify pup auth

Run `pup auth status`. If expired, run `pup auth login`. If auth fails entirely, note that Datadog MCP tools are available as a fallback for read-only queries.

## Step 3: Identify the latest job for each experiment and query it exclusively

SMP runs multiple replicas per experiment; each is tagged with a unique `job_id`. Pin the analysis to one specific run by finding the latest `job_id` per experiment and filtering on it. This avoids time-window heuristics and the 300 s rollup gap-fill that deflates means when a short capture only partially occupies a bucket.

### 3a. List jobs grouped by `job_id`

Run grouped queries over the last 1 day (widen to 2d, then 7d only if nothing is returned). These are **inventory-only** — do not use their values, only their `job_id` scopes and timestamps:

```bash
# mean_fs_load — jobs
pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load} by {job_id}.as_rate()' --from 1d --to now

# no_fs_load — jobs
pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load} by {job_id}.as_rate()' --from 1d --to now
```

For each returned series, read its `scope` (contains `job_id:<UUID>`) and its `pointlist`. Null-safe: drop points whose value is `None`. Per job, record the timestamp of the **last non-zero point**. The `job_id` with the most recent last-non-zero timestamp is "the latest job" for that experiment.

### 3b. Re-query each latest `job_id` exclusively

Add `job_id:<UUID>` to the tag filter. Pin the query window to the job itself — not a relative range like `--from 1h`. Using a relative range is non-deterministic across runs: it shifts with wall-clock time, and the API rollup may include or exclude an edge bucket depending on where "now" lands, producing different pointlists and different means for the same `job_id`.

From Step 3a you have the first and last non-zero epoch-ms (`FIRST_MS`, `LAST_MS`) for each latest `job_id`. Derive a deterministic window:

```bash
FROM_TS=$(( FIRST_MS / 1000 - 60 )) # 1 minute of padding before
TO_TS=$(( LAST_MS / 1000 + 60 )) # 1 minute of padding after
```

Pass these as absolute Unix seconds to `pup`. The window width should stay under ~1 h so the API still returns 20-second interval data.

Run in parallel:

```bash
# mean_fs_load — latest job
pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:<LOADED_UUID>}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS"
pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:<LOADED_UUID>}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS"

# no_fs_load — latest job
pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:<NOLOAD_UUID>}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS"
pup metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:<NOLOAD_UUID>}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS"
```

The `.as_rate()` modifier guarantees the result is in events/sec. Both this metric and the production metric are type `rate` (statsd_interval=10), so `.as_rate()` is a no-op today — but it documents intent and protects against metadata changes.

Compute `mean` over **non-zero, non-null** points only. Record the `job_id`, the first and last non-zero timestamps (capture window), the interval returned, the data points count, and the raw sum used for the mean.

When reporting `capture_first_ts` / `capture_last_ts`, do not compute ISO strings by hand — use the shell:

```bash
date -u -r $(( MS / 1000 )) +%Y-%m-%dT%H:%M:%SZ
```

Always print the raw epoch-ms alongside the ISO string so the reader can verify the conversion.

Do not include min/max in the report.

### 3c. Sanity checks

- If fewer than ~5 non-zero points come back for a latest job, the capture may be in flight or interrupted. Try the second-most-recent `job_id` and report both so the reviewer can judge.
- If the `mean_fs_load` and `no_fs_load` latest `job_id`s ran more than 24 h apart, flag the staleness — background noise floor drifts over time.
- Never fall back to wider windows with coarser rollups (≥300 s intervals) as a substitute — rollup gap-fill deflates per-job means when a capture only partially occupies a bucket.

### 3d. Anti-pattern (why `job_id` is authoritative)

Do **not** identify runs by time heuristics (contiguous non-zero clusters, gap-detection, "last N minutes"). A single SMP run at 20 s resolution still contains transient zero buckets that masquerade as run boundaries once rolled up, and multiple replicas of the same experiment run concurrently, so any time-slice may contain overlapping jobs. The `job_id` tag is the only authoritative grouping for a single run.

## Step 4: Query production metric (weekly mean)

Query production data over the last 7 days to get the weekly mean. Run both queries in parallel:

```bash
# Primary — open-only, weekly mean
pup metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{event_type:open}.as_rate()' --from 7d --to now

# Context — all file activity, weekly mean
pup metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{category:file_activity}.as_rate()' --from 7d --to now
```

All queries use `.as_rate()` so every value in the comparison chain is in the same unit (events/sec):
- Lading config: `open_per_second` × 1 = events/sec
- SMP captured: `.as_rate()` → events/sec
- Production: `.as_rate()` → events/sec

Compute the mean from the weekly pointlist values. Do not include min/max in the report.

## Step 5: Compare and analyze

The primary comparison uses `event_type:open` values only.

All values must be in **events/sec** (guaranteed by `.as_rate()` on metric queries, and by the explain-lading-config breakdown for lading).

**Primary table — open events/sec:**

| Source | Value (events/sec) | Description |
|--------|---------------------|-------------|
| Lading config | from Step 1 | `open_per_second` × 1 |
| SMP — no_fs_load (open) | from Step 3 | Background open noise with custom policy, no generator |
| SMP — mean_fs_load (open) | from Step 3 | Open rate with custom policy and generator running |
| Generator contribution | computed | mean_fs_load - no_fs_load |
| Internal production data per-host avg (open, weekly) | from Step 4 | production `event_type:open` `.as_rate()` weekly mean |

**Supplementary context** (not used for tuning decisions):

| Source | Value (events/sec) | Description |
|--------|---------------------|-------------|
| SMP — no_fs_load (all file activity) | from Step 3 | no-load `category:file_activity` — background noise floor |
| SMP — mean_fs_load (all file activity) | from Step 3 | loaded `category:file_activity` |
| Internal production data per-host avg (all file activity, weekly) | from Step 4 | production `category:file_activity` weekly mean |

Analysis:
- **No-load baseline check**: The `no_fs_load` open rate captures background noise from always-on VFS hooks while the custom policy is in effect. Record this value as the noise floor.
- **Generator contribution check**: `generator_contribution = mean_fs_load_open - no_fs_load_open` reflects the generator's observable effect on CWS. Note that this need not equal `open_per_second` — one lading syscall can produce multiple CWS events (and vice versa).
- **Production validity check**: Compare the SMP `mean_fs_load` open rate against the internal production data per-host weekly open average. If they diverge, flag the gap.
- Note the `category:file_activity` totals for reference

## Step 6: Output report

Print a markdown report answering "does the lading config reflect the production open workload?" with these sections:

1. **Lading config** — configured open rate (`open_per_second` × 1 = events/sec)
2. **SMP No-Load Baseline (no_fs_load, latest job `<NOLOAD_UUID>`)** — open rate and file activity with CWS + custom policy but no generator, showing the job_id, capture window (first → last non-zero epoch-ms with ISO derived via `date -u -r`), interval, data points count, and the mean expressed as `sum=<S> / n=<N> = <mean>` so the reader can reproduce the arithmetic without trusting the model
3. **SMP Loaded (mean_fs_load, latest job `<LOADED_UUID>`)** — open rate and file activity with CWS + custom policy + generator, showing the job_id, capture window (epoch-ms + `date -u -r` derived ISO), interval, data points count, and the mean expressed as `sum=<S> / n=<N> = <mean>`
4. **Generator Contribution** — delta between mean_fs_load and no_fs_load for open events (between the two latest jobs)
5. **Internal production data per-host avg (open, weekly)** — production open rate weekly mean

Followed by a supplementary note showing `category:file_activity` totals from both SMP experiments and production for reference.

Followed by a one-line assessment: match, mismatch, or insufficient data.

## Step 7: Propose lading config changes

If a mismatch was found, propose concrete changes to `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`:

1. **Target**: the production open weekly mean from Step 4.
2. **Direction of change**: adjust `open_per_second` up or down to push `mean_fs_load_open` toward the target. Do not assume a 1:1 mapping between lading syscalls and CWS events — the security-agent can observe multiple kernel events per lading operation (e.g. a single rename may surface more than two events) and may dedupe or filter others. Treat the lading-to-CWS ratio as empirical, derived from the specific `job_id` used in Step 3b — cite the job_id alongside the ratio so the tuning decision is traceable.
3. **Show the diff** — print the exact YAML change (old value → new value) for `open_per_second`. Do not adjust `rename_per_second` — rename events are not the primary signal.
4. **Offer to apply** — ask whether to edit the lading config file directly.

If data was insufficient (e.g. the latest job had too few non-zero points), offer to use the second-most-recent `job_id` before proposing changes. Do not fall back to wider time windows with coarser rollups.
Loading
Loading