Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions .claude/skills/analyze-quality-gate-security-mean-fs-load/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
name: analyze-quality-gate-security-mean-fs-load
description: Compare production CWS metrics with SMP regression experiment metrics from quality_gate_security_mean_fs_load and quality_gate_security_no_fs_load
user_invocable: true
allowed-tools: Bash, Read, Skill
---

# analyze-quality-gate-security-mean-fs-load

Compare production CWS `perf_buffer.events.write` open rates against what the `quality_gate_security_mean_fs_load` lading config generates and what SMP captures, using `quality_gate_security_no_fs_load` as the no-load baseline.

Two SMP experiments form a pair. Both run with the same custom `default.policy`; the axis that distinguishes them is whether lading generates filesystem load:

- **`quality_gate_security_no_fs_load`** — CWS enabled, custom `default.policy`, `generator: []`. Measures the floor for this policy: background event noise and policy-loaded memory footprint with no application-generated filesystem events.
- **`quality_gate_security_mean_fs_load`** — CWS enabled, custom `default.policy`, `file_tree` generator. Measures overhead under a production-representative mean filesystem load.

The difference between them isolates the generator's contribution, which should match production workload above background noise.

The primary signal is `event_type:open` — it is the most common event type in production CWS workloads. Open events have high background noise from always-on VFS hooks (`hook_do_dentry_open`, `hook_vfs_open` under the `"*"` probe selector), but the `no_fs_load` experiment directly measures this noise floor, enabling clean subtraction.

## Step 1: Invoke /explain-lading-config

Call the `explain-lading-config` skill on `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`. Record the expected open event rate: `open_per_second` events/sec (each open operation produces 1 open event). Note `rename_per_second` but do not include it in the primary comparison.

Note: `quality_gate_security_no_fs_load` has `generator: []` (no load generator). There is nothing to explain for it — its purpose is to measure the no-load baseline.

## Step 2: Verify pup auth

Run `pup auth status`. If expired, run `pup auth login`. If auth fails entirely, note that Datadog MCP tools are available as a fallback for read-only queries.

## Step 3: Identify the latest job for each experiment and query it exclusively

SMP runs multiple replicas per experiment; each is tagged with a unique `job_id`. Pin the analysis to one specific run by finding the latest `job_id` per experiment and filtering on it. This avoids time-window heuristics and the 300 s rollup gap-fill that deflates means when a short capture only partially occupies a bucket.

### 3a. List jobs grouped by `job_id`

Run grouped queries over the last 1 day (widen to 2d, then 7d only if nothing is returned). These are **inventory-only** — do not use their values, only their `job_id` scopes and timestamps:

```bash
# mean_fs_load — jobs
pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load} by {job_id}.as_rate()' --from 1d --to now

# no_fs_load — jobs
pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load} by {job_id}.as_rate()' --from 1d --to now
```

For each returned series, read its `scope` (contains `job_id:<UUID>`) and its `pointlist`. Null-safe: drop points whose value is `None`. Per job, record the timestamp of the **last non-zero point**. The `job_id` with the most recent last-non-zero timestamp is "the latest job" for that experiment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems potentially scriptable in the future. I don't think it's worth blocking on writing a script for this operation -- I'd rather have the skill in the repo sooner -- but worth noting as a potential variance-reducing optimization.


### 3b. Re-query each latest `job_id` exclusively

Add `job_id:<UUID>` to the tag filter. Pin the query window to the job itself — not a relative range like `--from 1h`. Using a relative range is non-deterministic across runs: it shifts with wall-clock time, and the API rollup may include or exclude an edge bucket depending on where "now" lands, producing different pointlists and different means for the same `job_id`.

From Step 3a you have the first and last non-zero epoch-ms (`FIRST_MS`, `LAST_MS`) for each latest `job_id`. Derive a deterministic window:

```bash
FROM_TS=$(( FIRST_MS / 1000 - 60 )) # 1 minute of padding before
TO_TS=$(( LAST_MS / 1000 + 60 )) # 1 minute of padding after
```

Pass these as absolute Unix seconds to `pup`. The window width should stay under ~1 h so the API still returns 20-second interval data.

Run in parallel:

```bash
# mean_fs_load — latest job
pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:<LOADED_UUID>}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS"
pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:<LOADED_UUID>}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS"

# no_fs_load — latest job
pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:<NOLOAD_UUID>}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS"
pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:<NOLOAD_UUID>}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS"
```

The `.as_rate()` modifier guarantees the result is in events/sec. Both this metric and the production metric are type `rate` (statsd_interval=10), so `.as_rate()` is a no-op today — but it documents intent and protects against metadata changes.

Compute `mean` over **non-zero, non-null** points only. Record the `job_id`, the first and last non-zero timestamps (capture window), the interval returned, the data points count, and the raw sum used for the mean.

When reporting `capture_first_ts` / `capture_last_ts`, do not compute ISO strings by hand, use:

```bash
python3 -c "from datetime import datetime, timezone; print(datetime.fromtimestamp($(( MS / 1000 )), timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1; Python is probably more cross-platform than date from GNU Coreutils.

```

Always print the raw epoch-ms alongside the ISO string so the reader can verify the conversion.

Do not include min/max in the report.

### 3c. Sanity checks

- If fewer than ~5 non-zero points come back for a latest job, the capture may be in flight or interrupted. Try the second-most-recent `job_id` and report both so the reviewer can judge.
- If the `mean_fs_load` and `no_fs_load` latest `job_id`s ran more than 24 h apart, flag the staleness — background noise floor drifts over time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background noise floor drift is an interesting observation. I wonder if we can model that in the future.

- Never fall back to wider windows with coarser rollups (≥300 s intervals) as a substitute — rollup gap-fill deflates per-job means when a capture only partially occupies a bucket.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call-out here.


### 3d. Anti-pattern (why `job_id` is authoritative)

Do **not** identify runs by time heuristics (contiguous non-zero clusters, gap-detection, "last N minutes"). A single SMP run at 20 s resolution still contains transient zero buckets that masquerade as run boundaries once rolled up, and multiple replicas of the same experiment run concurrently, so any time-slice may contain overlapping jobs. The `job_id` tag is the only authoritative grouping for a single run.

## Step 4: Query production metric (weekly mean)

Query production data over the last 7 days to get the weekly mean. Run both queries in parallel:

```bash
# Primary — open-only, weekly mean
pup --read-only metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{event_type:open}.as_rate()' --from 7d --to now

# Context — all file activity, weekly mean
pup --read-only metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{category:file_activity}.as_rate()' --from 7d --to now
```

All queries use `.as_rate()` so every value in the comparison chain is in the same unit (events/sec):
- Lading config: `open_per_second` × 1 = events/sec
- SMP captured: `.as_rate()` → events/sec
- Production: `.as_rate()` → events/sec

Compute the mean from the weekly pointlist values. Do not include min/max in the report.

## Step 5: Compare and analyze

The primary comparison uses `event_type:open` values only.

All values must be in **events/sec** (guaranteed by `.as_rate()` on metric queries, and by the explain-lading-config breakdown for lading).

**Primary table — open events/sec:**

| Source | Value (events/sec) | Description |
|--------|---------------------|-------------|
| Lading config | from Step 1 | `open_per_second` × 1 |
| SMP — no_fs_load (open) | from Step 3 | Background open noise with custom policy, no generator |
| SMP — mean_fs_load (open) | from Step 3 | Open rate with custom policy and generator running |
| Generator contribution | computed | mean_fs_load - no_fs_load |
| Internal production data per-host avg (open, weekly) | from Step 4 | production `event_type:open` `.as_rate()` weekly mean |

**Supplementary context** (not used for tuning decisions):

| Source | Value (events/sec) | Description |
|--------|---------------------|-------------|
| SMP — no_fs_load (all file activity) | from Step 3 | no-load `category:file_activity` — background noise floor |
| SMP — mean_fs_load (all file activity) | from Step 3 | loaded `category:file_activity` |
| Internal production data per-host avg (all file activity, weekly) | from Step 4 | production `category:file_activity` weekly mean |

Analysis:
- **No-load baseline check**: The `no_fs_load` open rate captures background noise from always-on VFS hooks while the custom policy is in effect. Record this value as the noise floor.
- **Generator contribution check**: `generator_contribution = mean_fs_load_open - no_fs_load_open` reflects the generator's observable effect on CWS. Note that this need not equal `open_per_second` — one lading syscall can produce multiple CWS events (and vice versa).
- **Production validity check**: Compare the SMP `mean_fs_load` open rate against the internal production data per-host weekly open average. If they diverge, flag the gap.
- Note the `category:file_activity` totals for reference

## Step 6: Output report

Print a markdown report answering "does the lading config reflect the production open workload?" with these sections:

1. **Lading config** — configured open rate (`open_per_second` × 1 = events/sec)
2. **SMP No-Load Baseline (no_fs_load, latest job `<NOLOAD_UUID>`)** — open rate and file activity with CWS + custom policy but no generator, showing the job_id, capture window (first → last non-zero epoch-ms with ISO derived via `date -u -r`), interval, data points count, and the mean expressed as `sum=<S> / n=<N> = <mean>` so the reader can reproduce the arithmetic without trusting the model
3. **SMP Loaded (mean_fs_load, latest job `<LOADED_UUID>`)** — open rate and file activity with CWS + custom policy + generator, showing the job_id, capture window (epoch-ms + `date -u -r` derived ISO), interval, data points count, and the mean expressed as `sum=<S> / n=<N> = <mean>`
4. **Generator Contribution** — delta between mean_fs_load and no_fs_load for open events (between the two latest jobs)
5. **Internal production data per-host avg (open, weekly)** — production open rate weekly mean

Followed by a supplementary note showing `category:file_activity` totals from both SMP experiments and production for reference.

Followed by a one-line assessment: match, mismatch, or insufficient data.

## Step 7: Propose lading config changes

If a mismatch was found, propose concrete changes to `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`:

1. **Target**: the production open weekly mean from Step 4.
2. **Direction of change**: adjust `open_per_second` up or down to push `mean_fs_load_open` toward the target. Do not assume a 1:1 mapping between lading syscalls and CWS events — the security-agent can observe multiple kernel events per lading operation (e.g. a single rename may surface more than two events) and may dedupe or filter others. Treat the lading-to-CWS ratio as empirical, derived from the specific `job_id` used in Step 3b — cite the job_id alongside the ratio so the tuning decision is traceable.
3. **Show the diff** — print the exact YAML change (old value → new value) for `open_per_second`. Do not adjust `rename_per_second` — rename events are not the primary signal.
4. **Offer to apply** — ask whether to edit the lading config file directly.

If data was insufficient (e.g. the latest job had too few non-zero points), offer to use the second-most-recent `job_id` before proposing changes. Do not fall back to wider time windows with coarser rollups.
3 changes: 2 additions & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -941,4 +941,5 @@
/q_branch/ @DataDog/q-branch

# AI-related files
/.claude/skills/explain-lading-config @DataDog/single-machine-performance
/.claude/skills/explain-lading-config @DataDog/single-machine-performance
/.claude/skills/analyze-quality-gate-security-mean-fs-load @DataDog/single-machine-performance @DataDog/agent-security
Loading