feat(harness): add structured failure taxonomy (#1024)#1069
Conversation
Introduce a deterministic failure taxonomy as a narrow foundation for run harness metadata without changing tool behavior.\n\nConstraint: Open PRs already cover ledgers, lifecycle bus, and progress notifications, so this PR stays limited to classifier primitives and tests.\nRejected: Implementing automatic recovery in this PR | that would widen scope beyond issue #1024 and risk conflicting with future run-harness work.\nConfidence: high\nScope-risk: narrow\nDirective: Keep future recovery policy opt-in and consume these categories as metadata, not as implicit behavior changes.\nTested: npm test -- --runInBand tests/failure/classifier.test.ts; npm run build:src; npx eslint src/failure --ext .ts\nNot-tested: Live browser metadata attachment; deferred to follow-up run/evidence harness issues.
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8f710d86c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| category: 'CONNECTION_LOST', | ||
| confidence: 0.95, | ||
| reason: 'CDP/browser transport connection was lost', | ||
| test: ({ text }) => /\b(not connected to chrome|call connect\(\) first|websocket.*closed|websocket is not open|browser has disconnected|browser disconnected|cdpsession connection closed|connection closed|session closed|protocol error|execution context was destroyed|cannot find context with specified id|inspected target navigated or closed|puppeteer\.connect\(\) timed out|session initialization timed out)\b/i.test(text), |
There was a problem hiding this comment.
Narrow connection-loss regex to avoid stale-ref misclassification
The CONNECTION_LOST rule currently matches any message containing protocol error, which causes common stale-reference failures like Protocol error (DOM.resolveNode): No node with given id found to be tagged as connection loss too. Because STALE_REF and CONNECTION_LOST both use confidence 0.95, the tie-break in classifyFailure makes primaryFailureCategory pick CONNECTION_LOST, so callers get the wrong primary category and may apply the wrong remediation path for a routine detached-node case.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e027e22100
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| category: 'CONNECTION_LOST', | ||
| confidence: 0.95, | ||
| reason: 'CDP/browser transport connection was lost', | ||
| test: ({ text }) => /\b(not connected to chrome|call connect\(\) first|websocket.*closed|websocket is not open|browser has disconnected|browser disconnected|cdpsession connection closed|connection closed|session closed|protocol error.*(?:connection|closed|disconnected|target closed)|execution context was destroyed|cannot find context with specified id|inspected target navigated or closed|puppeteer\.connect\(\) timed out|session initialization timed out)\b/i.test(text), |
There was a problem hiding this comment.
Remove navigation context errors from connection-loss rule
The CONNECTION_LOST matcher currently treats execution context was destroyed and cannot find context with specified id as transport disconnects, but these errors commonly occur during normal page navigations while the browser session is still alive. Because this rule has top confidence, primaryFailureCategory will mislabel navigation races as connection loss, which can drive the wrong recovery path (reconnect/restart instead of navigation synchronization).
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| export function primaryFailureCategory(input: ClassifyFailureInput = {}): FailureClassification { | ||
| return classifyFailure(input)[0] ?? { category: 'UNKNOWN', confidence: 0.5, reason: 'No failure classifier rule matched' }; |
There was a problem hiding this comment.
Honor fallbackToUnknown in primary category helper
primaryFailureCategory always returns UNKNOWN when no rule matches, even if fallbackToUnknown is explicitly set to false. This breaks the contract implied by ClassifyFailureInput and makes it impossible for callers using the primary helper to distinguish “no match” from an explicit unknown classification.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 71aa172cd9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| export function primaryFailureCategory(input: ClassifyFailureInput = {}): FailureClassification { | ||
| return classifyFailure(input)[0] ?? { category: 'UNKNOWN', confidence: 0.5, reason: 'No failure classifier rule matched' }; |
There was a problem hiding this comment.
Honor fallbackToUnknown in primary category helper
primaryFailureCategory accepts ClassifyFailureInput, but when fallbackToUnknown is false and no rules match, this helper still synthesizes UNKNOWN via the ?? fallback. That makes the documented opt-out ineffective for callers that use the primary helper (for example, pipelines that need to distinguish “no match” from an explicit unknown), which can skew downstream failure metrics and control flow.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e37641d25f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| category: 'ELEMENT_NOT_FOUND', | ||
| confidence: 0.88, | ||
| reason: 'Requested selector/ref/semantic element could not be found', | ||
| test: ({ text }) => /\b(element not found|no elements? found|no matching element|selector.+not found|selector.+failed|queryselectorall.*(?:0|zero)|could not find|no good match found|no clickable elements found)\b/i.test(text), |
There was a problem hiding this comment.
Restrict ELEMENT_NOT_FOUND matcher to DOM-specific phrases
The ELEMENT_NOT_FOUND rule currently matches the generic phrase could not find, which causes unrelated runtime failures to be misclassified as missing DOM elements (for example Puppeteer launch errors like “Could not find expected browser (chrome) locally”). In those cases primaryFailureCategory will return ELEMENT_NOT_FOUND (0.88) instead of a more accurate bucket (or UNKNOWN), which can mislead downstream triage/recovery logic that relies on these categories.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e37641d25f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| category: 'CONNECTION_LOST', | ||
| confidence: 0.95, | ||
| reason: 'CDP/browser transport connection was lost', | ||
| test: ({ text }) => /\b(not connected to chrome|call connect\(\) first|websocket.*closed|websocket is not open|browser has disconnected|browser disconnected|cdpsession connection closed|connection closed|session closed|protocol error.*(?:connection|closed|disconnected|target closed)|inspected target navigated or closed|puppeteer\.connect\(\) timed out|session initialization timed out)\b/i.test(text), |
There was a problem hiding this comment.
Remove navigation-detach text from connection-loss rule
The CONNECTION_LOST regex currently matches inspected target navigated or closed, which is emitted for normal target navigation detach events even when the browser transport is still healthy. In that scenario, primaryFailureCategory will report CONNECTION_LOST (0.95) because TAB_UNHEALTHY does not match this text, which can drive reconnect/restart handling instead of navigation synchronization.
Useful? React with 👍 / 👎.
Codex review found navigation-detach and generic launch errors could be classified as connection loss or missing elements, which would steer harness recovery toward the wrong path. Constraint: Keep the taxonomy metadata-only and deterministic without adding recovery execution or new dependencies. Rejected: Broad catch-all regexes for transport and element failures | they hide normal navigation churn and non-DOM runtime failures. Confidence: high Scope-risk: narrow Directive: Prefer narrow category phrases and explicit regression examples before adding new classifier matches. Tested: npm test -- --runInBand tests/failure/classifier.test.ts; npm run build:src; npx eslint src/failure --ext .ts; npm run lint:tier; git diff --check Not-tested: Full CI matrix locally. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1969efed6b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| reason: 'Page or failure indicates missing/expired authentication or credentials', | ||
| test: (input) => { | ||
| const combined = `${input.text} ${input.currentUrl} ${input.pageTitle}`; | ||
| return AUTH_CONTEXT.test(combined) || /\b(401|unauthorized|forbidden|please sign in|session expired)\b/i.test(combined); |
There was a problem hiding this comment.
Narrow AUTH_REQUIRED matching for generic "forbidden" errors
The AUTH_REQUIRED rule currently matches any message containing forbidden, which causes plain 403 Forbidden failures to be labeled as authentication issues even when they come from WAF or policy blocks. In practice, many bot/firewall responses are just 403 Forbidden without explicit WAF keywords, so this rule will systematically misclassify them and distort downstream failure analytics/recovery decisions. Consider requiring explicit auth context for forbidden (or treating bare 403/forbidden as CAPTCHA_OR_WAF/UNKNOWN).
Useful? React with 👍 / 👎.
Codex review found that bare forbidden responses were being classified as authentication failures even when they may be WAF or policy blocks, and CI also exposed an over-budget interact tool description. Constraint: Keep the failure taxonomy deterministic and metadata-only while preserving explicit login/session auth matches. Rejected: Treating every forbidden response as CAPTCHA_OR_WAF | without WAF context, UNKNOWN is safer than inventing a specific recovery route. Confidence: high Scope-risk: narrow Directive: Add explicit context before mapping generic HTTP denial text into recovery categories. Tested: npm test -- --runInBand tests/tool-descriptions.test.ts tests/failure/classifier.test.ts; npm run build:src; npx eslint src/failure src/tools/interact.ts --ext .ts; npm run lint:tier; git diff --check Not-tested: Full CI matrix locally. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review Latest commits address the ELEMENT_NOT_FOUND, navigation-detach, and forbidden/auth false-positive feedback, plus the tool-description CI contract. Please review the current head. |
PR #1069 was conflict-marked against develop, so the branch needed the current base contracts before it could be evaluated for merge. Constraint: Preserve the structured failure taxonomy behavior and take develop's already-shortened interact wording to avoid reopening the CI description-budget failure. Rejected: Rebasing the public branch | a merge commit preserves existing review context and avoids rewriting pushed PR history. Confidence: medium Scope-risk: moderate Directive: Treat this commit as base synchronization; future behavior changes should stay in failure taxonomy files. Tested: git diff --check Not-tested: Full validation pending after merge. Co-authored-by: OmX <omx@oh-my-codex.dev>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b58bb8b220
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| category: 'ELEMENT_NOT_FOUND', | ||
| confidence: 0.88, | ||
| reason: 'Requested selector/ref/semantic element could not be found', | ||
| test: ({ text }) => /\b(element not found|no elements? found|no matching element|selector.+not found|selector.+failed|queryselectorall.*(?:0|zero)|could not find (?:an? )?(?:element|selector|ref|button|link|input|field|node)|no good match found|no clickable elements found)\b/i.test(text), |
There was a problem hiding this comment.
Require zero-only match for querySelectorAll not-found errors
Tighten the ELEMENT_NOT_FOUND regex so queryselectorall only matches explicit zero results. As written, queryselectorall.*(?:0|zero) also matches successful counts like 10/20, which will misclassify valid element discovery as ELEMENT_NOT_FOUND and can trigger the wrong recovery path when routing on the primary category.
Useful? React with 👍 / 👎.
| export function primaryFailureCategory(input?: ClassifyFailureInput): FailureClassification; | ||
| export function primaryFailureCategory(input: ClassifyFailureInput = {}): FailureClassification | undefined { |
There was a problem hiding this comment.
Return optional type when unknown fallback is disabled
The overloads advertise that primaryFailureCategory returns a FailureClassification for general ClassifyFailureInput, but the implementation returns undefined when fallbackToUnknown is false and no rule matches. This is a type-safety hole for callers passing a value typed as ClassifyFailureInput (not literal false), because TypeScript will permit direct .category access that can fail at runtime.
Useful? React with 👍 / 👎.
CI showed the Cursor progressive-disclosure contract still expected the previous Tier 1 tool count after develop added additional Tier 1 tools. Constraint: Keep the verification focused on the current Tier 1 surface while preserving explicit tool-name assertions. Rejected: Removing the count guard | it still catches accidental tier leakage when paired with the explicit expand_tools and registration checks. Confidence: high Scope-risk: narrow Directive: Update this count whenever intentional Tier 1 tools are added or demoted. Tested: npx jest --config jest.ci.config.js --runInBand tests/cross-env/cursor-verification.test.ts tests/failure/classifier.test.ts tests/tool-descriptions.test.ts; npm run build; npm run lint; npm run lint:tier; git diff --check Not-tested: Full CI matrix after this commit. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05cf2392f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| category: 'CONNECTION_LOST', | ||
| confidence: 0.95, | ||
| reason: 'CDP/browser transport connection was lost', | ||
| test: ({ text }) => /\b(not connected to chrome|call connect\(\) first|websocket.*closed|websocket is not open|browser has disconnected|browser disconnected|cdpsession connection closed|connection closed|session closed|protocol error.*(?:connection|disconnected)|puppeteer\.connect\(\) timed out|session initialization timed out)\b/i.test(text), |
There was a problem hiding this comment.
Remove generic
session closed from connection-loss matcher
The CONNECTION_LOST regex treats any session closed text as a transport disconnect, but Puppeteer frequently emits messages like Protocol error (...): Session closed. Most likely the page has been closed when only the target tab dies. In that case this rule wins with higher confidence than TAB_UNHEALTHY, so primaryFailureCategory reports a browser connection outage and can trigger reconnect/restart logic instead of tab-level recovery.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05cf2392f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| export function primaryFailureCategory(input: ClassifyFailureInput & { fallbackToUnknown: false }): FailureClassification | undefined; | ||
| export function primaryFailureCategory(input?: ClassifyFailureInput): FailureClassification; |
There was a problem hiding this comment.
Return undefined in overload when fallback may be false
Update the general overload to allow undefined because this function can return undefined whenever fallbackToUnknown is false at runtime, even if that value is not a literal false in the call site. As written, callers passing a computed ClassifyFailureInput get the FailureClassification return type and may dereference .category unsafely, causing runtime crashes when no rule matches.
Useful? React with 👍 / 👎.
Windows CI can check out JSON fixtures with CRLF line endings while JSON.stringify emits LF, causing the console-capture regression guard to fail without a response-shape change. Constraint: Keep the v1.11.0 fixture content unchanged and preserve the regression comparison across Linux, macOS, and Windows. Rejected: Regenerating the fixture | that would hide a checkout-format issue and make the expected shape harder to audit. Confidence: high Scope-risk: narrow Directive: Normalize platform line endings at fixture-read boundaries before byte-style JSON comparisons. Tested: npx jest --config jest.ci.config.js --runInBand tests/tools/console-capture-regression.test.ts tests/cross-env/cursor-verification.test.ts tests/failure/classifier.test.ts tests/tool-descriptions.test.ts; npm run build; npm run lint; npm run lint:tier; git diff --check Not-tested: Full GitHub Actions matrix after this commit. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87a5abd96a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| category: 'CONNECTION_LOST', | ||
| confidence: 0.95, | ||
| reason: 'CDP/browser transport connection was lost', | ||
| test: ({ text }) => /\b(not connected to chrome|call connect\(\) first|websocket.*closed|websocket is not open|browser has disconnected|browser disconnected|cdpsession connection closed|connection closed|session closed|protocol error.*(?:connection|disconnected)|puppeteer\.connect\(\) timed out|session initialization timed out)\b/i.test(text), |
There was a problem hiding this comment.
Exclude tab-closure errors from CONNECTION_LOST
The CONNECTION_LOST rule currently matches any error containing session closed, which captures common Puppeteer tab-close failures like Protocol error (...): Session closed. Most likely the page has been closed. even when the browser transport is still healthy. Because TAB_UNHEALTHY does not match that phrasing and has lower confidence anyway, primaryFailureCategory will mislabel tab-level failures as connection loss, which will drive the wrong remediation path (reconnect browser instead of recovering/replacing the tab).
Useful? React with 👍 / 👎.
Merge latest develop into PR #1069 and keep the shared cursor/console fixture expectations aligned after the server-to-client primitive landed. Constraint: PR #962 updated the same cross-environment and console fixture tests, leaving this PR blocked by fixture-only conflicts.\nRejected: Rewriting feature code during conflict refresh | the merge conflict was limited to shared CI fixtures.\nConfidence: high\nScope-risk: narrow\nDirective: Keep conflict refreshes limited to shared fixture expectations unless feature code conflicts require design review.\nTested: npx jest tests/tools/console-capture-regression.test.ts tests/cross-env/cursor-verification.test.ts --runInBand --forceExit (cross-env suite may skip locally by platform guard); git diff --check for touched fixture files.\nNot-tested: Full GitHub Actions matrix after this refresh.\nCo-authored-by: OmX <omx@oh-my-codex.dev>
Summary
src/failuretaxonomy/classifier foundation for issue Run harness: introduce structured failure taxonomy and classifier #1024.Direction / duplicate check
src/failure.Tests
npm test -- --runInBand tests/failure/classifier.test.tsnpm run build:srcnpx eslint src/failure --ext .tsLive OpenChrome verification
This PR is intentionally a classifier foundation and does not yet attach categories to live tool responses. Merge-after validation for #1024 should be completed by follow-up harness wiring PRs by triggering:
STALE_REFELEMENT_NOT_FOUNDNAVIGATION_TIMEOUTTAB_UNHEALTHYorCONNECTION_LOSTThe classifier unit tests encode those cases now, so future Run/evidence PRs can assert live metadata attachment without redefining categories.
Closes #1024.