diff --git a/.github/workflows/ci-failure-scan.md b/.github/workflows/ci-failure-scan.md index 8fa3343e25d1b0..610b90e7262849 100644 --- a/.github/workflows/ci-failure-scan.md +++ b/.github/workflows/ci-failure-scan.md @@ -184,33 +184,84 @@ Same-run KBE + PR is not possible: gh-aw strict mode forbids `issues: write` on The agent must accept this constraint and produce KBEs in run N, then companion PRs in run N+1. The 12-hour cadence makes this acceptable: the KBE alone unblocks PR CI immediately (the moment the safe-outputs job processes it, ~1 min after the agent finishes), and the muting PR follows within 12h. -For each actionable failure, in this order: +For each actionable failure, walk through all six checks below before deciding the action — multiple can fire at once, and any one is reason to stop. -1. **Search for existing artifacts** before creating anything new: - - `search_issues` for an open KBE: `is:issue is:open label:"Known Build Error" in:body ""`. Try variations on the signature (full `[FAIL]` line, assertion text, exception type + test name). - - `search_pull_requests` for an open muting PR that already silences this test: `is:pr is:open in:title "" "[ci-scan]"` and `is:pr is:open "" ActiveIssue`. - - `search_pull_requests` for an open small-fix PR: `is:pr is:open in:title "" "[ci-scan]"`. - - If a KBE + muting PR already cover this failure, **skip** — record it in the coverage tally as `→ already-covered: KBE # + PR #` and move on. Do not duplicate. -2. **No existing KBE → file one via safe-outputs `create_issue`**. The only labels permitted on KBE issues are `Known Build Error` and `blocking-clean-ci` (see "Outputs: title and labels" below). Title prefix: `[ci-scan] `. Body: the KBE format described in "Known Build Error issue" below. The safe-outputs handler will create the issue ~1 minute after the agent finishes; the issue number is not available to the agent during this run. -3. **Existing KBE found AND failure still occurring AND no muting PR exists yet → open the muting PR via safe-outputs `create_pull_request`** with the existing KBE issue number hardcoded in the diff: `[ActiveIssue("https://github.com/dotnet/runtime/issues/", ...)]` for unit tests, `true` (with an inline `` comment) for stress-incompatible JIT csproj families. PR title prefix `[ci-scan] `; **the PR body MUST include a top-level "Linked KBE" line of the form `Linked KBE: #` so the link is unambiguous and machine-readable**, in addition to the prose "Linked KBE" section. This PR must change **only test annotations / csproj test-config flags** — no product code, no diagnosis, no logic. Aim for ≤ 5 lines of diff. -4. **(Optional, alongside step 3) Open a small-fix PR via safe-outputs `create_pull_request`** if the failure satisfies the "small product fix opportunity" criteria above. Separate PR, separate branch, separate diff. PR body must (a) cite the failing test as evidence, (b) explain the root cause, (c) state explicitly why the fix is safe, (d) include `Linked KBE: #` as a top-level line, and (e) note "If this lands before #, that PR can be closed". Do not bundle the fix into the muting PR — keep them separate so a maintainer can take one without the other. +#### Step 1 — Look for an existing KBE. -Caps: safe-outputs `create_issue` max 5/run, `create_pull_request` max 10/run. When a cap is hit, fall back to "skipped: cap reached" rather than silently dropping signatures — subsequent runs will pick them up. +Search `is:issue is:open label:"Known Build Error" in:body ""`. Try variations: the full `[FAIL]` line, the assertion text, the exception class plus the test name. On a hit, record the issue number as `existing-kbe` and continue — finding a KBE doesn't end the walk, it changes the final action. -After every run, you should be able to answer YES to **whichever of these applies to each failure**: +#### Step 2 — Look for an area-team tracker without the KBE label. -- **First-encounter failure (no existing KBE):** "Did I file the KBE?" Muting/fix PRs are deferred to the next run — they cannot reference an issue number that doesn't exist yet at agent runtime. -- **Existing KBE, no muting PR yet:** "Did I open the muting PR (and, if criteria met, the small-fix PR)?" -- **Existing KBE + existing muting PR:** "Did I confirm both, skip silently, and record `→ already-covered: KBE # + PR #` in the coverage tally?" +Some teams track recurring failures in plain issues. Search `is:issue is:open in:title ""` together with `in:body ""`. On a hit, record it as `linked-tracker` — but **do not** treat the tracker as a substitute for a KBE. Build Analysis only matches against issues that carry the `Known Build Error` label and a valid JSON body, so a plain tracker won't unblock PR CI on its own. File a new KBE for Build Analysis to match against, and cross-link the tracker (`Tracking: dotnet/runtime#`) inside the KBE body and the muting PR body. -If the answer is NO for any failure, you have not done the job. +#### Step 3 — Look for an existing muting PR. + +Search `is:pr is:open in:title "" "[ci-scan]"` and `is:pr is:open "" ActiveIssue`. On a hit, record `→ existing-PR #` (muting) and stop. + +#### Step 4 — Look for an in-flight fix PR by anyone. + +Search broadly — not only `[ci-scan]` PRs — by test name, file path, and assembly: `is:pr is:open ""`, `is:pr is:open ""`, `is:pr is:open "" in:title`. For each candidate, fetch the PR body; if it claims to fix this failure (or links the same KBE), stop and record `→ existing-PR #` (in-flight fix). + +#### Step 5 — Verify every issue number you're about to write actually exists. + +For every `` you plan to embed in source (`[ActiveIssue("...issues/")]`, `Linked KBE: #`, the inline `` comment), call the github tool `issue_read` with method `get` (`{"method": "get", "owner": "dotnet", "repo": "runtime", "issue_number": }`) and confirm it returns an open issue. If it doesn't, stop — a dead-link annotation in source requires a follow-up PR to remove. + +#### Step 6 — Confirm muting is welcome on this issue. + +Read the candidate KBE / tracker's body and its most recent area-owner comment. Skip muting (record `→ skipped: do-not-mute on issue #` and stop) if any of the following holds: + +- The body or a recent comment from an area owner explicitly says not to mute, disable, or skip — e.g. "please don't disable these tests", "do not mute", "keep failing", "investigation in progress". +- The issue is labeled with anything semantically equivalent (verify the label exists in `dotnet/runtime` before relying on it; do not invent labels). +- The most recent area-owner comment (within the last 14 days) actively opposes muting on procedural grounds — e.g. requesting a fix-forward, awaiting a JIT/GC repro. + +When in doubt, skip muting and let the next run revisit; over-muting against an active investigation is the failure mode this step exists to prevent. + +#### What action to take + +- **Step 1 found nothing** → file a new KBE via safe-outputs `create_issue` with the body template below, title prefix `[ci-scan] `, and only the labels `Known Build Error` and `blocking-clean-ci`. If Step 2 found a tracker, cross-link it as `Tracking: dotnet/runtime#` in the KBE body. The issue number isn't visible during this run, so the muting PR is deferred to the next run. +- **Step 1 found a valid KBE, AND steps 3–6 are clean** → open the muting PR via safe-outputs `create_pull_request`. Diff ≤ 5 lines, only test annotations or csproj flags. The body must include `Linked KBE: #` as a top-level line plus the four-question verification block below. If Step 2 also found a tracker, cite it as `Tracking: dotnet/runtime#` alongside `Linked KBE`. +- **Plus, if the failure satisfies the "small product fix opportunity" criteria above** → open a separate fix PR on its own branch. Body cites (a) the failing test as evidence, (b) the root cause, (c) why the fix is safe, (d) `Linked KBE: #`, and (e) "If this lands before #, that PR can be closed." Kept separate so a maintainer can take one without the other. + +#### Before you link a KBE, verify it actually matches + +Test-name overlap alone is not enough — common wrong-link patterns include reusing a KBE filed against a different architecture, or one about an exception class for a failure that's actually a work-item timeout. Answer all four before writing `Linked KBE: #` or `[ActiveIssue("...issues/")]`: + +1. Does the candidate KBE describe the **same test (or test family)** as the current `[FAIL]` line? +2. Does its `ErrorMessage` / quoted exception text describe the **same failure signature** (exception class, assertion message)? +3. Is the failing **OS** in the set the KBE says it impacts? +4. Is the failing **architecture** in the set the KBE says it impacts? + +If any answer is no, file a fresh KBE this run and defer the muting PR. Embed the four answers in the PR body's "Reasoning" section; PRs missing this, or with an unaddressed mismatch, will be closed. + +Optional fifth check when the candidate KBE is older than ~14 days: confirm Build Analysis is still actually matching it. The hit count appears in the issue body and is rewritten by Build Analysis on every match — a stale, never-edited body is a hint the signature went bad. `gh api graphql` over `userContentEdits` on the issue gives the edit timeline. + +#### Caps and end-of-run check + +Per-run caps: `create_issue` max 5, `create_pull_request` max 10. On cap, record `→ skipped: cap reached` — the next run picks them up. + +Before stopping, confirm each failure is handled: + +- **No existing KBE** → KBE filed? +- **Existing KBE, no muting PR yet** → muting PR opened (and the optional fix PR if criteria are met)? +- **Existing KBE plus existing muting PR or in-flight fix** → `→ existing-PR #` recorded? + +If the answer is no for any failure, the run is incomplete. ### Per-failure-class rules The two-pass flow above applies to all classes below. "KBE + muting PR" means: KBE in the run that first encounters the failure, muting PR in the next run that finds the KBE already exists. - **Recurring failure with a stable error signature** (≥ 2 occurrences on `main` in the scanned window) → KBE (run N) + muting PR (run N+1) + fix PR (optional, run N+1, only if criteria met). -- **Per-test platform / configuration incompatibility** (e.g., test fails only under `jitstress=2`, `gcstress=0xC`, on a single mobile arch, on browser, on NativeAOT) → KBE (run N) + muting PR (run N+1). Allowed muting PR mechanisms: +- **Per-test platform / configuration incompatibility** (e.g., test fails only under `jitstress=2`, `gcstress=0xC`, on a single mobile arch, on browser, on NativeAOT) → KBE (run N) + muting PR (run N+1). The muting PR's skip condition MUST be **as narrow as the observed failure scope** — only the OS / arch / config combinations that actually fail. + + | Observed failure scope | ❌ Bad (too broad) | ✅ Good (matches scope) | + |---|---|---| + | Only `linux-arm` fails | `[SkipOnPlatform(TestPlatforms.AnyUnix, ...)]` or muting on all NativeAOT | `true` | + | Only NativeAOT on a single arch | `true` (all arches) | `true` | + | Only one stress mode | `true` (all stress modes) | Add stress-mode predicate, e.g. gate via the existing `GCStressIncompatible` only for the failing variant | + + In the PR's "Reasoning" section, list the exact set of failing legs (definition + queue + stress mode) that justifies the chosen condition, so a reviewer can verify scope matches evidence. +- Allowed muting PR mechanisms: - `[SkipOnPlatform(TestPlatforms., "")]` for platform-specific failures. - `[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.))]` narrowed via existing helpers. - `[ActiveIssue("https://github.com/dotnet/runtime/issues/", TestPlatforms.)]` referencing the KBE. @@ -321,9 +372,32 @@ File one when **all** of the following hold: - No fix PR is currently open (verify via `search_pull_requests`). - The failure is **not** a build break or an infrastructure failure — only test failures or hangs are eligible for a KBE. Build breaks and infra failures (for example dead-letter, device-lost, or agent-disconnect issues) must use a regular tracking issue. -Required structure (Build Analysis is strict — match the headings exactly, and use **exactly three backticks** for the JSON code fence; never four. The opening and closing fence must be the same length, otherwise the fence is broken and Build Analysis silently skips the issue): +Required structure: match the headings exactly. The literal body MUST look like one of the two templates below — pick exactly one (literal substring is the default; regex only if no single literal line is specific enough). Do not emit both blocks. The outer fence in this prompt uses `~~~` (tildes) only so the inner ` ``` ` fences stay literal; in the issue you emit, do **not** use tildes anywhere — emit only the inner content between (but not including) the `~~~` lines for the template you chose. Walk the "Verify the body before submitting" checks below before committing to the issue body. + +**Template A — literal substring match (default).** Pick this when the failure log contains a stable, specific assertion or exception message line. + +~~~ +## Build Information +Build: +Build error leg or test failing: - +Pull request: +## Error Message + + + +```json +{ + "ErrorMessage": "", + "BuildRetry": false, + "ExcludeConsoleLog": false +} ``` +~~~ + +**Template B — regex match.** Pick this only when no single literal line is specific enough. Anchored, prefer `[^\\n]*` over `.*`, no catastrophic backtracking. The JSON value itself must be a single-line string with no real newlines, but you can match across log lines via the regex `\n` escape inside that string or via the array form. + +~~~ ## Build Information Build: Build error leg or test failing: - @@ -331,21 +405,73 @@ Pull request: + -(open three backticks, then `json` on the same line) +```json { - "ErrorMessage": "", - "ErrorPattern": "", + "ErrorPattern": "", "BuildRetry": false, "ExcludeConsoleLog": false } -(close three backticks) ``` +~~~ + +#### Verify the body before submitting + +Build Analysis is strict: a malformed JSON block or an over-broad signature means the issue is silently skipped or matches every passing run. Walk these checks; fix and re-check on any failure. Canonical upstream reference (worth reading in full before filing your first KBE): [`dotnet/arcade-skills/.../kbe-issue-creation.md`](https://github.com/dotnet/arcade-skills/blob/main/plugins/dotnet-dnceng/skills/ci-analysis/references/kbe-issue-creation.md). + +1. **The body contains a fenced JSON block.** Without it Build Analysis has nothing to parse. Prose `**Error Message:**` / `**Stack Trace:**` sections don't count. +2. **Exactly one fenced JSON block.** Multiple skeletons yield zero matches. +3. **The opening fence is exactly three backticks followed by `json`**, lowercase, with nothing else on the line. Four backticks, missing lang tag, or trailing whitespace causes the parser to skip the issue. +4. **The closing fence is exactly three backticks**, same length as the open. +5. **Exactly one of `ErrorMessage` or `ErrorPattern` is present and non-empty.** Populating both is undefined behavior — Build Analysis may apply only one and you don't control which. Do not leave the unused field as `""` either; delete it. Empty signatures match nothing. +6. **The signature is not a bare identifier.** A fully-qualified test name, a stack-frame line, or a bare exception type all appear in `[PASS]` and `[SKIP]` lines for the same test, so the signature would match every passing run going forward. This applies to BOTH `ErrorMessage` and `ErrorPattern` — a regex like `TestMethodName` or `Some\\.Class\\.TestMethod` is just as broken as the literal. +7. **Negative-match before submitting.** If you have the failing log on disk (Helix work-item console, AzDO step log), run a smoke test against it — eyeballing the signature catches roughly nothing. Build Analysis's `ErrorMessage` matcher is `String.Contains` ordinal case-sensitive, which `grep -F` reproduces exactly: + + ```bash + grep -Fc "" failure.log # > 0 = matches the failure + grep -F "" failure.log | grep -E '^\[(PASS|SKIP)\]' # MUST be empty + ``` + + For `ErrorPattern`, use `grep -E` — different regex flavor than .NET's `NonBacktracking`, but close enough to flag over-broad patterns: + + ```bash + grep -Ec '' failure.log + grep -E '' failure.log | grep -E '^\[(PASS|SKIP)\]' # MUST be empty + ``` + + If the second command in either pair prints anything, the signature also matches `[PASS]` / `[SKIP]` lines for this test and will mute future passing runs. Narrow it. Also mentally check whether the signature would match (a) other tests in the same assembly, or (b) build-time output (Crossgen2, ilasm, MSBuild). The canonical validator is [`Test-KnownIssuePattern.ps1`](https://github.com/dotnet/arcade-skills/blob/main/plugins/dotnet-dnceng/skills/ci-analysis/scripts/Test-KnownIssuePattern.ps1) (uses the exact regex flavor, emits a validated JSON block); pwsh isn't in this workflow's tool allowlist today, so the `grep -F` / `grep -E` smoke test above is the in-band substitute. +8. **Single-line, no escapes.** Build Analysis runs `String.Contains` (case-sensitive, ordinal) for `ErrorMessage` and `Regex` with `Singleline | IgnoreCase | NonBacktracking` and a 50ms-per-line timeout for `ErrorPattern`. Newlines, ANSI escapes (`\u001b[`), and time-prefixes (`[12:34:56.789]`) are not stripped from log lines before matching. Use the array form (below) for multi-line; use `[^\\n]*` instead of `.*` in regexes. +9. **JSON escaping is correct.** Inside the JSON string value: `"` → `\"`, `\` → `\\`, real newlines → `\n`. For regex patterns this means **double escape**: a literal dot is `\\.` in JSON (the JSON parser consumes one backslash, leaving `\.` for the regex engine). A `\d` you actually want regex to see has to be written `\\d` in JSON. GitHub's issue Preview tab will flag invalid JSON — use it. + +##### Multi-line signatures (array form) + +Both `ErrorMessage` and `ErrorPattern` accept an **array of strings**: each element matches a separate log line, in order, and lines may appear between matched elements. Use this when no single line on its own is unique enough — e.g., the test name on one line and the assertion text two lines down. + +```json +{ + "ErrorMessage": [ + "System.Net.Http.Tests.HttpClientHandlerTest.GetAsync_UnknownHost_Throws", + "System.Net.Http.HttpRequestException : Name or service not known" + ] +} +``` + +Rules: each element matches one line (the elements are NOT concatenated and matched as a single multi-line string). All elements must match in order. Don't mix `ErrorMessage` and `ErrorPattern` in the same array. Don't pad the array with generic tokens like `exitcode: 139` or `Crash` — they add no specificity and risk false negatives if the log format changes. + +#### Signature examples — Bad → Good + +`ErrorMessage` is matched as an exact literal substring — `...` in the value is matched as three literal dots, not "anything". Use the array form (above) when you need to span variable text between two anchors. The "Good" column below shows the form to use; values shown as plain strings go in `ErrorMessage`, values prefixed `ErrorPattern:` go in `ErrorPattern`, and values shown as a JSON array go in `ErrorMessage` array form. -The pseudo-instructions `(open three backticks, then ...)` and `(close three backticks)` above are **placeholders** in this prompt because nesting fenced code blocks in the prompt itself is fragile; in the actual issue body emit literal ```` ``` `` (three backticks) on each side of the JSON object. Verify the open and close fences both consist of exactly three backticks before submitting. If you are uncertain, count them. +| ❌ Bad | Why bad | ✅ Good | +|---|---|---| +| `"Some.Test.Class.TestMethodName"` | bare test name; matches `[PASS]` lines for the same test | array: `["Some.Test.Class.TestMethodName", "System.Net.Sockets.SocketException : Try again"]` | +| `"SomeTests.Prefix_"` (trailing `_`) | truncated prefix; trailing `_`/`*`/`.` is literal not glob | `ErrorPattern: "^SomeTests\\.Prefix_[A-Za-z]+\\b[^\\n]*Xunit\\.Sdk\\."` | +| `"Some.Type.Method"` (bare type/method) | matches stack scans of unrelated tests | `ErrorPattern: "^System\\.NullReferenceException\\b[^\\n]*\\n\\s+at Some\\.Type\\.Method\\b"` | +| `"BadImageFormatException"` | bare exception type; matches infra hiccups too | `"System.BadImageFormatException: Could not load file or assembly 'System.Private.CoreLib'"` | +| `"Operation timed out"` | matches transient network failures everywhere | array: `["xharness exec android test", "Operation timed out after 3600s"]` paired with `BuildRetry: false` | -Choose `ErrorMessage` (substring) by default. Use `ErrorPattern` only when a regex is genuinely needed and confirm it has no catastrophic backtracking. Set `BuildRetry: true` **only** for confirmed infra/queue-side flakes (dead-letter, device-lost, agent disconnect) where retrying is safe. +Choose `ErrorMessage` (literal substring) by default. Use `ErrorPattern` only when no single literal line is specific enough — and confirm the regex is anchored and has no catastrophic backtracking. **Populate exactly one of the two fields per JSON block; never both.** Pattern length doesn't matter; specificity does — don't shorten a unique multi-line signature into a pithy one-liner. Set `BuildRetry: true` **only** for confirmed infra/queue-side flakes (dead-letter, device-lost, agent disconnect) where retrying is safe. ### Signature specificity (mandatory) @@ -357,6 +483,8 @@ The `ErrorMessage` / `ErrorPattern` MUST uniquely identify **this specific failu - A generic tool name + failure verb: `Crossgen2 failed`, `ilasm failed`, `dotnet build failed`, `xharness exited`. - A bare exception type with no message: `BadImageFormatException`, `NullReferenceException`, `Fatal error. Invalid Program`, `Assertion failed`. - A bare `[FAIL]` line with only the test class name and no exception/assertion text. +- A bare fully-qualified test name (e.g. `"ErrorMessage": "Namespace.Class.TestName"`) without the assertion/exception text that follows it on the next line of the log. The test name alone matches every future regression of that test, including unrelated ones, and Build Analysis will mute legitimate new failures. +- A truncated test-name prefix ending in an underscore, dot, or wildcard glyph (e.g. `"SomeClass.SomeMethod_"`, `"Foo.Bar."`, `"Connect_*"`). `ErrorMessage` is a literal `String.Contains` match, not a glob — a trailing `_` or `*` is treated as a literal character and either over-matches every test whose name contains the prefix or never matches at all. If you need to cover multiple related test methods, instead set `ErrorPattern` to a properly anchored regex (e.g. `"SomeClass\\.SomeMethod_[A-Za-z]+ "`), or pick the exception/assertion message that is common to all of them. - Common infra strings: `Connection reset`, `Operation timed out`, `Resource temporarily unavailable`, `No space left on device`. **Prefer** signatures built from the most specific stable token in the log. In order of preference: @@ -393,16 +521,10 @@ These look like permission errors but are physical: ## Coverage discipline (avoid arbitrary selection) -Failure selection must be **systematic, not opportunistic**. Process pipelines in the order listed in the "Pipelines to scan" table. For each pipeline: - -1. List every failed signature in the latest scanned build (sorted by occurrence count in the window, descending). -2. For each signature, decide and record one of: `→ filed-issue #aw_`, `→ filed-PR #aw_`, `→ existing-issue #`, `→ existing-PR #`, `→ skipped: `. A skipped signature MUST have a reason (e.g., "build canceled, not a test failure", "less than 2 occurrences and not blocking", "owned by area-Infrastructure rota and already triaged"). -3. Keep a per-pipeline tally on disk under `/tmp/gh-aw/agent/coverage/.txt`. At the end, print a summary table to the agent log: `pipeline | total-signatures | issues-filed | prs-filed | reused-existing | skipped-with-reason`. - -Caps still apply (10 PRs / 5 issues / run); when the cap is hit, fall back to "skipped: cap reached" rather than dropping signatures silently. Subsequent runs will pick them up. - -Do not jump between pipelines mid-investigation. Finish all classifications for pipeline N before moving to pipeline N+1. +Process every failed signature in every pipeline — do not cherry-pick the obvious ones and skip the rest. Walk pipelines in the order listed in the "Pipelines to scan" table; finish all classifications for pipeline N before moving to pipeline N+1. -## Submit +For each pipeline: -Search existing issues and PRs (`search_issues`, `search_pull_requests`) before creating anything new — never duplicate. Cross-check against issues filed by the existing JIT failure-tracking bot (e.g. open issues authored by `JulieLeeMSFT` for JIT pipelines) and reference rather than re-file them. When using `search_pull_requests`, filter to `is:merged OR review:approved` so the integrity filter does not silently drop low-trust results. If an issue already tracks the failure, **prefer opening a PR that references it via `[ActiveIssue("https://github.com/dotnet/runtime/issues/")]`** rather than filing another issue. If `search_issues` returns no matches, proceed to file the issue. +1. List every failed signature in the latest scanned build, sorted by occurrence count in the window (descending). +2. For each signature, run the six-step walk in "Two-pass KBE → PR flow" and record the outcome (`→ filed-issue #aw_`, `→ filed-PR #aw_`, `→ existing-issue #`, `→ existing-PR #`, or `→ skipped: `). A skipped signature MUST have a reason (e.g., "build canceled, not a test failure", "less than 2 occurrences and not blocking", "owned by area-Infrastructure rota and already triaged"). +3. Keep a per-pipeline tally on disk under `/tmp/gh-aw/agent/coverage/.txt`. At the end of the run, print a summary table to the agent log: `pipeline | total-signatures | issues-filed | prs-filed | reused-existing | skipped-with-reason`.