-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add analyze-quality-gate-security-mean-fs-load skill #50375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| --- | ||
| name: analyze-quality-gate-security-mean-fs-load | ||
| description: Compare production CWS metrics with SMP regression experiment metrics from quality_gate_security_mean_fs_load and quality_gate_security_no_fs_load | ||
| user_invocable: true | ||
| allowed-tools: Bash, Read, Skill | ||
| --- | ||
|
|
||
| # analyze-quality-gate-security-mean-fs-load | ||
|
|
||
| Compare production CWS `perf_buffer.events.write` open rates against what the `quality_gate_security_mean_fs_load` lading config generates and what SMP captures, using `quality_gate_security_no_fs_load` as the no-load baseline. | ||
|
|
||
| Two SMP experiments form a pair. Both run with the same custom `default.policy`; the axis that distinguishes them is whether lading generates filesystem load: | ||
|
|
||
| - **`quality_gate_security_no_fs_load`** — CWS enabled, custom `default.policy`, `generator: []`. Measures the floor for this policy: background event noise and policy-loaded memory footprint with no application-generated filesystem events. | ||
| - **`quality_gate_security_mean_fs_load`** — CWS enabled, custom `default.policy`, `file_tree` generator. Measures overhead under a production-representative mean filesystem load. | ||
|
|
||
| The difference between them isolates the generator's contribution, which should match production workload above background noise. | ||
|
|
||
| The primary signal is `event_type:open` — it is the most common event type in production CWS workloads. Open events have high background noise from always-on VFS hooks (`hook_do_dentry_open`, `hook_vfs_open` under the `"*"` probe selector), but the `no_fs_load` experiment directly measures this noise floor, enabling clean subtraction. | ||
|
|
||
| ## Step 1: Invoke /explain-lading-config | ||
|
|
||
| Call the `explain-lading-config` skill on `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`. Record the expected open event rate: `open_per_second` events/sec (each open operation produces 1 open event). Note `rename_per_second` but do not include it in the primary comparison. | ||
|
|
||
| Note: `quality_gate_security_no_fs_load` has `generator: []` (no load generator). There is nothing to explain for it — its purpose is to measure the no-load baseline. | ||
|
|
||
| ## Step 2: Verify pup auth | ||
|
|
||
| Run `pup auth status`. If expired, run `pup auth login`. If auth fails entirely, note that Datadog MCP tools are available as a fallback for read-only queries. | ||
|
|
||
| ## Step 3: Identify the latest job for each experiment and query it exclusively | ||
|
|
||
| SMP runs multiple replicas per experiment; each is tagged with a unique `job_id`. Pin the analysis to one specific run by finding the latest `job_id` per experiment and filtering on it. This avoids time-window heuristics and the 300 s rollup gap-fill that deflates means when a short capture only partially occupies a bucket. | ||
|
|
||
| ### 3a. List jobs grouped by `job_id` | ||
|
|
||
| Run grouped queries over the last 1 day (widen to 2d, then 7d only if nothing is returned). These are **inventory-only** — do not use their values, only their `job_id` scopes and timestamps: | ||
|
|
||
| ```bash | ||
| # mean_fs_load — jobs | ||
| pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load} by {job_id}.as_rate()' --from 1d --to now | ||
|
|
||
| # no_fs_load — jobs | ||
| pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load} by {job_id}.as_rate()' --from 1d --to now | ||
| ``` | ||
|
|
||
| For each returned series, read its `scope` (contains `job_id:<UUID>`) and its `pointlist`. Null-safe: drop points whose value is `None`. Per job, record the timestamp of the **last non-zero point**. The `job_id` with the most recent last-non-zero timestamp is "the latest job" for that experiment. | ||
|
|
||
| ### 3b. Re-query each latest `job_id` exclusively | ||
|
|
||
| Add `job_id:<UUID>` to the tag filter. Pin the query window to the job itself — not a relative range like `--from 1h`. Using a relative range is non-deterministic across runs: it shifts with wall-clock time, and the API rollup may include or exclude an edge bucket depending on where "now" lands, producing different pointlists and different means for the same `job_id`. | ||
|
|
||
| From Step 3a you have the first and last non-zero epoch-ms (`FIRST_MS`, `LAST_MS`) for each latest `job_id`. Derive a deterministic window: | ||
|
|
||
| ```bash | ||
| FROM_TS=$(( FIRST_MS / 1000 - 60 )) # 1 minute of padding before | ||
| TO_TS=$(( LAST_MS / 1000 + 60 )) # 1 minute of padding after | ||
| ``` | ||
|
|
||
| Pass these as absolute Unix seconds to `pup`. The window width should stay under ~1 h so the API still returns 20-second interval data. | ||
|
|
||
| Run in parallel: | ||
|
|
||
| ```bash | ||
| # mean_fs_load — latest job | ||
| pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:<LOADED_UUID>}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS" | ||
| pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_mean_fs_load,job_id:<LOADED_UUID>}.as_rate()' --from "$LOADED_FROM_TS" --to "$LOADED_TO_TS" | ||
|
|
||
| # no_fs_load — latest job | ||
| pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{event_type:open,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:<NOLOAD_UUID>}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS" | ||
| pup --read-only metrics query --query 'avg:single_machine_performance.regression_detector.capture.datadog.runtime_security.perf_buffer.events.write{category:file_activity,variant:comparison,experiment:quality_gate_security_no_fs_load,job_id:<NOLOAD_UUID>}.as_rate()' --from "$NOLOAD_FROM_TS" --to "$NOLOAD_TO_TS" | ||
| ``` | ||
|
|
||
| The `.as_rate()` modifier guarantees the result is in events/sec. Both this metric and the production metric are type `rate` (statsd_interval=10), so `.as_rate()` is a no-op today — but it documents intent and protects against metadata changes. | ||
|
|
||
| Compute `mean` over **non-zero, non-null** points only. Record the `job_id`, the first and last non-zero timestamps (capture window), the interval returned, the data points count, and the raw sum used for the mean. | ||
|
|
||
| When reporting `capture_first_ts` / `capture_last_ts`, do not compute ISO strings by hand, use: | ||
|
|
||
| ```bash | ||
| python3 -c "from datetime import datetime, timezone; print(datetime.fromtimestamp($(( MS / 1000 )), timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1; Python is probably more cross-platform than |
||
| ``` | ||
|
|
||
| Always print the raw epoch-ms alongside the ISO string so the reader can verify the conversion. | ||
|
|
||
| Do not include min/max in the report. | ||
|
|
||
| ### 3c. Sanity checks | ||
|
|
||
| - If fewer than ~5 non-zero points come back for a latest job, the capture may be in flight or interrupted. Try the second-most-recent `job_id` and report both so the reviewer can judge. | ||
| - If the `mean_fs_load` and `no_fs_load` latest `job_id`s ran more than 24 h apart, flag the staleness — background noise floor drifts over time. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The background noise floor drift is an interesting observation. I wonder if we can model that in the future. |
||
| - Never fall back to wider windows with coarser rollups (≥300 s intervals) as a substitute — rollup gap-fill deflates per-job means when a capture only partially occupies a bucket. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call-out here. |
||
|
|
||
| ### 3d. Anti-pattern (why `job_id` is authoritative) | ||
|
|
||
| Do **not** identify runs by time heuristics (contiguous non-zero clusters, gap-detection, "last N minutes"). A single SMP run at 20 s resolution still contains transient zero buckets that masquerade as run boundaries once rolled up, and multiple replicas of the same experiment run concurrently, so any time-slice may contain overlapping jobs. The `job_id` tag is the only authoritative grouping for a single run. | ||
|
|
||
| ## Step 4: Query production metric (weekly mean) | ||
|
|
||
| Query production data over the last 7 days to get the weekly mean. Run both queries in parallel: | ||
|
|
||
| ```bash | ||
| # Primary — open-only, weekly mean | ||
| pup --read-only metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{event_type:open}.as_rate()' --from 7d --to now | ||
|
|
||
| # Context — all file activity, weekly mean | ||
| pup --read-only metrics query --query 'avg:datadog.runtime_security.perf_buffer.events.write{category:file_activity}.as_rate()' --from 7d --to now | ||
| ``` | ||
|
|
||
| All queries use `.as_rate()` so every value in the comparison chain is in the same unit (events/sec): | ||
| - Lading config: `open_per_second` × 1 = events/sec | ||
| - SMP captured: `.as_rate()` → events/sec | ||
| - Production: `.as_rate()` → events/sec | ||
|
|
||
| Compute the mean from the weekly pointlist values. Do not include min/max in the report. | ||
|
|
||
| ## Step 5: Compare and analyze | ||
|
|
||
| The primary comparison uses `event_type:open` values only. | ||
|
|
||
| All values must be in **events/sec** (guaranteed by `.as_rate()` on metric queries, and by the explain-lading-config breakdown for lading). | ||
|
|
||
| **Primary table — open events/sec:** | ||
|
|
||
| | Source | Value (events/sec) | Description | | ||
| |--------|---------------------|-------------| | ||
| | Lading config | from Step 1 | `open_per_second` × 1 | | ||
| | SMP — no_fs_load (open) | from Step 3 | Background open noise with custom policy, no generator | | ||
| | SMP — mean_fs_load (open) | from Step 3 | Open rate with custom policy and generator running | | ||
| | Generator contribution | computed | mean_fs_load - no_fs_load | | ||
| | Internal production data per-host avg (open, weekly) | from Step 4 | production `event_type:open` `.as_rate()` weekly mean | | ||
|
|
||
| **Supplementary context** (not used for tuning decisions): | ||
|
|
||
| | Source | Value (events/sec) | Description | | ||
| |--------|---------------------|-------------| | ||
| | SMP — no_fs_load (all file activity) | from Step 3 | no-load `category:file_activity` — background noise floor | | ||
| | SMP — mean_fs_load (all file activity) | from Step 3 | loaded `category:file_activity` | | ||
| | Internal production data per-host avg (all file activity, weekly) | from Step 4 | production `category:file_activity` weekly mean | | ||
|
|
||
| Analysis: | ||
| - **No-load baseline check**: The `no_fs_load` open rate captures background noise from always-on VFS hooks while the custom policy is in effect. Record this value as the noise floor. | ||
| - **Generator contribution check**: `generator_contribution = mean_fs_load_open - no_fs_load_open` reflects the generator's observable effect on CWS. Note that this need not equal `open_per_second` — one lading syscall can produce multiple CWS events (and vice versa). | ||
| - **Production validity check**: Compare the SMP `mean_fs_load` open rate against the internal production data per-host weekly open average. If they diverge, flag the gap. | ||
| - Note the `category:file_activity` totals for reference | ||
|
|
||
| ## Step 6: Output report | ||
|
|
||
| Print a markdown report answering "does the lading config reflect the production open workload?" with these sections: | ||
|
|
||
| 1. **Lading config** — configured open rate (`open_per_second` × 1 = events/sec) | ||
| 2. **SMP No-Load Baseline (no_fs_load, latest job `<NOLOAD_UUID>`)** — open rate and file activity with CWS + custom policy but no generator, showing the job_id, capture window (first → last non-zero epoch-ms with ISO derived via `date -u -r`), interval, data points count, and the mean expressed as `sum=<S> / n=<N> = <mean>` so the reader can reproduce the arithmetic without trusting the model | ||
| 3. **SMP Loaded (mean_fs_load, latest job `<LOADED_UUID>`)** — open rate and file activity with CWS + custom policy + generator, showing the job_id, capture window (epoch-ms + `date -u -r` derived ISO), interval, data points count, and the mean expressed as `sum=<S> / n=<N> = <mean>` | ||
| 4. **Generator Contribution** — delta between mean_fs_load and no_fs_load for open events (between the two latest jobs) | ||
| 5. **Internal production data per-host avg (open, weekly)** — production open rate weekly mean | ||
|
|
||
| Followed by a supplementary note showing `category:file_activity` totals from both SMP experiments and production for reference. | ||
|
|
||
| Followed by a one-line assessment: match, mismatch, or insufficient data. | ||
|
|
||
| ## Step 7: Propose lading config changes | ||
|
|
||
| If a mismatch was found, propose concrete changes to `test/regression/cases/quality_gate_security_mean_fs_load/lading/lading.yaml`: | ||
|
|
||
| 1. **Target**: the production open weekly mean from Step 4. | ||
| 2. **Direction of change**: adjust `open_per_second` up or down to push `mean_fs_load_open` toward the target. Do not assume a 1:1 mapping between lading syscalls and CWS events — the security-agent can observe multiple kernel events per lading operation (e.g. a single rename may surface more than two events) and may dedupe or filter others. Treat the lading-to-CWS ratio as empirical, derived from the specific `job_id` used in Step 3b — cite the job_id alongside the ratio so the tuning decision is traceable. | ||
| 3. **Show the diff** — print the exact YAML change (old value → new value) for `open_per_second`. Do not adjust `rename_per_second` — rename events are not the primary signal. | ||
| 4. **Offer to apply** — ask whether to edit the lading config file directly. | ||
|
|
||
| If data was insufficient (e.g. the latest job had too few non-zero points), offer to use the second-most-recent `job_id` before proposing changes. Do not fall back to wider time windows with coarser rollups. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems potentially scriptable in the future. I don't think it's worth blocking on writing a script for this operation -- I'd rather have the skill in the repo sooner -- but worth noting as a potential variance-reducing optimization.