feat(recovery): score trajectory outcomes deterministically (#1019) by shaun0927 · Pull Request #1078 · shaun0927/openchrome

shaun0927 · 2026-05-12T15:08:07Z

Progress / Review status

Auto-refreshed 2026-05-13 — owner comments cleaned up to reduce review noise.

Field	Value
Branch	`feat/1019-recovery-reward-scorer` → `feat/1017-recovery-trajectory-ledger`
Draft	no
CI	—
Mergeable	❌ CONFLICTING
Review decision	—
Codex (latest)	💡 suggestions posted
Other reviewers (latest)	chatgpt-codex-connector: commented
Head	`57cee81` — Merge latest recovery base again
Commits	25

_{Owner comment cleanup: 7 issue + 0 inline review comments deleted. Outstanding feedback from automated/external reviewers above is unchanged.}

Summary\n- Adds a pure deterministic RecoveryRewardScorer for progress/failure/blocking/no-progress outcomes.\n- Records reward scores on passive recovery trajectory nodes from #1017.\n- Keeps scoring telemetry-only: no LLM calls and no automatic recovery execution.\n\nCloses #1019.\n\nStacked on #1073 / #1017 because it consumes the recovery trajectory boundary.\n\n## Validation\n- npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts\n- npm run build\n\n## Live OpenChrome verification\n- Not run in this PR session; deterministic scorer unit coverage is included and #1073 adds the fixture route for follow-up live verification.\n\n## Safety\n- No LLM value calls.\n- No tool behavior changes beyond passive reward telemetry.\n- Destructive ungated outcomes receive hard negative classification.

Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.

Add a pure recovery reward scorer so trajectory nodes can carry bounded progress/failure scores without introducing LLM value calls or execution changes.\n\nConstraint: Recovery scoring must be deterministic and safe for harness telemetry hot paths.\nRejected: LLM-based value scoring | too expensive and too risky for production browser automation.\nConfidence: high\nScope-risk: narrow\nDirective: Treat scores as advisory telemetry until a later safety-gated ranking/search PR consumes them.\nTested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full live OpenChrome scoring transcript.

qodo-code-review · 2026-05-12T15:09:08Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

gemini-code-assist · 2026-05-12T15:09:09Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cc42051d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T15:12:19Z


+      const score = scoreFromToolResult({
+        toolName,
+        isError: resultStatus === 'error' || resultStatus === 'aborted',


Include tool-level isError when computing reward classification

The scorer currently sets isError from resultStatus === 'error' || 'aborted', which misses the common path where a tool returns isError: true but is recorded as resultStatus: 'no_progress' (see success path call site). In those cases, explicit tool failures get scored as neutral/no-progress unless the text happens to match a hardcoded signal, which will skew telemetry and any downstream ranking based on reward. Pass through result?.isError (or equivalent) so deterministic scoring consistently treats tool-declared errors as failures.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-12T15:12:19Z

+        resultText: summarizeResult(result),
+        errorText: error,
+        repeatedFailureCount: previous?.result === 'error' ? 1 : 0,
+        repeatedNoProgressCount: resultStatus === 'no_progress' ? 1 : 0,


Avoid marking first no-progress attempt as repeated

repeatedNoProgressCount is set to 1 for every no_progress result, even when there is no prior no-progress attempt. In scoreRecoveryOutcome, any positive repeated count triggers an extra penalty, so first-time no-progress outcomes are incorrectly penalized as repeats. This biases reward downward and makes repeated/no-repeat states indistinguishable; compute this from actual prior consecutive no-progress events instead of hardcoding 1.

Useful? React with 👍 / 👎.

Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.

Constraint: Codex P1 found passive recovery scoring ignored MCP results that returned isError=true on the nominal success path.\nRejected: Treating every no_progress result as a repeat | that penalizes first observations and hides actual repeated stalls.\nConfidence: high\nScope-risk: narrow\nDirective: Keep recovery reward telemetry passive and evidence-derived; do not let scoring alter tool behavior.\nTested: npm ci; npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Full CI matrix locally.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>

Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix

chatgpt-codex-connector · 2026-05-13T00:47:06Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8329d8118

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T00:49:32Z

+        resultStatus === 'success' &&
+        previous !== undefined &&
+        previous.result === 'error' &&
+        previous.toolName !== toolName;


Restrict recovered tagging to same tab trajectory

recordRecoveryTrajectory currently marks any successful call as recovered when the immediately prior completed call in the same session was an error from a different tool, without checking tab or trajectory context. In interleaved multi-tab sessions, an unrelated success (for example on tab B) right after an error on tab A will be mislabeled as a recovery, which pollutes recovery telemetry and can skew any reward-based ranking built on this data.

Useful? React with 👍 / 👎.

Resolve PR #1078's stacked-base conflict by taking the updated trajectory ledger from feat/1017 and reapplying the reward scorer wiring, including tool-level isError handling and prior no-progress counting. Constraint: GitHub marked PR #1078 as CONFLICTING against feat/1017-recovery-trajectory-ledger after review fixes were pushed. Rejected: Force-rebasing the stacked PR | preserving merge context avoids invalidating prior review history more than necessary. Confidence: high Scope-risk: moderate Directive: Keep recovery telemetry best-effort; scorer changes must not alter tool execution behavior. Tested: npm ci Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.

Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix

The schema-lint pipeline reads the introspection JSON through a pipe, so exiting immediately after one large stdout write can truncate manifests near the pipe buffer boundary. Chunking the output and waiting for drain preserves the no-Chrome introspection contract while keeping normal serve startup untouched. Constraint: CI invokes node dist/index.js serve --introspect-tools-list | node scripts/lint-tool-schemas.mjs -. Rejected: Increasing schema lint tolerance | it would hide a transport-level truncation bug instead of preserving valid JSON. Confidence: high Scope-risk: narrow Directive: Keep introspection free of Chrome/CDP startup and preserve JSON-only stdout. Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas. Not-tested: Full GitHub Actions matrix before push.

The latest develop branch adds crawl_start/status/cancel to the introspection manifest, and required fields must use the documented REQUIRED prefix for schema lint to pass. Updating the descriptions preserves tool behavior while keeping PR merge refs green. Constraint: PR checks run against the pull/merge ref, so every branch must absorb the latest develop schema-lint contract. Rejected: Updating the baseline | these are new tools and the required-field descriptions can satisfy the contract directly. Confidence: high Scope-risk: narrow Directive: New required input fields should start descriptions with REQUIRED instead of growing the lint baseline. Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas. Not-tested: Full GitHub Actions matrix before push.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56b72b76c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T01:08:33Z

+      const score = scoreFromToolResult({
+        toolName,
+        isError: resultStatus === 'error' || resultStatus === 'aborted' || result?.isError === true,
+        resultText: summarizeResult(result),
+        errorText: error,


Feed progress evidence into reward scoring

This call only passes toolName, error flags, and text, but scoreRecoveryOutcome marks progress only when evidence fields like urlChanged, domChanged, networkChanged, dataItemsExtracted, or freshRefsDiscovered are present. As wired here, successful non-observation tools (e.g. click, navigate) cannot reach the progress branch and are typically scored as no_progress with a negative reward, which will systematically bias recovery telemetry and downstream ranking/training data.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T01:08:33Z

+      const priorNoProgressCount = resultStatus === 'no_progress'
+        ? this.countConsecutiveNoProgress(this.recoveryLedger.readRecent(8, sessionId), toolName, tabId)


Avoid synchronous ledger reads on no-progress path

Computing priorNoProgressCount calls readRecent(...), which performs a synchronous fs.readFileSync and JSON parse of the ledger file on the request path. Under repeated tool failures/no-progress loops, this adds blocking disk I/O to each completion and can increase tool latency or reduce throughput; this should use in-memory state (or async I/O) instead of sync file reads in recordRecoveryTrajectory.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T01:09:49Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Update PR #1078 onto the latest feat/1017 base while preserving deterministic reward scorer wiring, tool-level isError handling, and prior no-progress counting. Constraint: feat/1017-recovery-trajectory-ledger advanced after the previous conflict-resolution push and GitHub marked PR #1078 conflicting again. Rejected: Accepting the base version of recovery wiring alone | it would drop PR #1019 scoring semantics and reintroduce the reviewed issue. Confidence: high Scope-risk: moderate Directive: When merging stacked recovery branches, verify scoreFromToolResult remains wired in MCPServer. Tested: npm ci Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector · 2026-05-13T01:15:08Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3959048ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T01:19:01Z

+    for (let i = nodes.length - 1; i >= 0; i--) {
+      const node = nodes[i];
+      if (node.toolName !== toolName) break;
+      if (tabId !== undefined && node.tabId !== tabId) break;


Scope no-progress streak to matching tab context

countConsecutiveNoProgress only enforces tabId equality when the current call has a tabId, so calls where tabId is omitted (for example navigate creating a new tab) will count prior no_progress nodes for the same tool from any tab. This can incorrectly apply repeated no-progress penalties to unrelated trajectories and skew reward telemetry; the streak logic should keep the same context keying used by getLastNode(sessionId, tabId) instead of mixing tab-scoped and unscoped attempts.

Useful? React with 👍 / 👎.

Feed deterministic progress signals into recovery reward scoring and avoid the synchronous ledger read for repeated no-progress by using the in-memory previous trajectory node. Constraint: Latest Codex review on PR #1078 flagged missing progress evidence as P1 and synchronous readRecent use as P2. Rejected: Scoring successful actions from text alone | navigation/action/extraction tools expose enough deterministic context to classify progress without LLM calls. Confidence: high Scope-risk: narrow Directive: Recovery reward scoring must stay deterministic and must not add blocking I/O to tool completion. Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector · 2026-05-13T01:26:33Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6ad22cfc9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T01:30:37Z

+    if (toolName === 'extract_data') {
+      const content = result?.content;
+      return { dataItemsExtracted: Array.isArray(content) && content.length > 0 ? content.length : 1, observationOnly: false };


Derive extract_data item count from payload, not content blocks

extract_data success responses are wrapped as a single text block (content: [{ type: 'text', text: JSON.stringify(...) }]) in both single and multiple modes, so using content.length here almost always yields 1 regardless of how many records were actually extracted. That means dataItemsExtracted is effectively hardcoded to 1, which in scoreRecoveryOutcome stays below the progress threshold and frequently classifies successful extractions as no_progress with a negative reward. Parse the JSON payload fields like count/fieldsFound instead of counting content blocks.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T01:30:37Z

+    if (toolName === 'request_intercept' || toolName === 'network_capture') {
+      return { networkChanged: true, observationOnly: false };


Match network capture tool names when tagging network progress

This condition checks for network_capture, but the registered tools are network_capture_lite and network_capture_full (see src/tools/network-capture-*.ts). As a result, successful network capture calls never set networkChanged evidence and are systematically under-scored as no-progress/observation. Include the actual tool names in this branch so recovery rewards reflect those outcomes.

Useful? React with 👍 / 👎.

Derive extract_data progress from JSON payload fields instead of content-block count, and tag the registered network_capture_lite/full tool names as network progress evidence. Constraint: Latest Codex review on PR #1078 found extract_data rewards were effectively hardcoded and network capture tool names did not match registered tools. Rejected: Counting MCP content blocks as extracted items | extract_data wraps records inside one JSON text payload. Confidence: high Scope-risk: narrow Directive: Recovery scoring evidence must reflect tool payload contracts, not MCP envelope shape. Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.

Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix

The schema-lint pipeline reads the introspection JSON through a pipe, so exiting immediately after one large stdout write can truncate manifests near the pipe buffer boundary. Chunking the output and waiting for drain preserves the no-Chrome introspection contract while keeping normal serve startup untouched. Constraint: CI invokes node dist/index.js serve --introspect-tools-list | node scripts/lint-tool-schemas.mjs -. Rejected: Increasing schema lint tolerance | it would hide a transport-level truncation bug instead of preserving valid JSON. Confidence: high Scope-risk: narrow Directive: Keep introspection free of Chrome/CDP startup and preserve JSON-only stdout. Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas. Not-tested: Full GitHub Actions matrix before push.

Recovery telemetry is persisted to disk, so argument summarization must recognize common camelCase credential names such as accessToken, sessionId, and authHeader instead of relying only on separator-delimited keys. Constraint: Codex review flagged this as P1 secret exposure in trajectory ledger persistence. Rejected: Hashing all short strings | it would remove useful non-sensitive telemetry and reduce ledger diagnostic value. Confidence: high Scope-risk: narrow Directive: Keep recovery ledger summaries useful, but prefer false-positive redaction over credential leakage. Tested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts. Not-tested: Full GitHub Actions matrix before push.

Recovery trajectory summaries are persisted before the response envelope leaves the server, so they must receive the same secret-redacted MCP result that the client receives. Constraint: Codex review flagged raw result summaries in recovery ledger persistence as P1 secret exposure. Rejected: Depending on summarizeResult pattern redaction | literal secret substitutions require the process secret store redactor. Confidence: high Scope-risk: narrow Directive: Persisted telemetry should consume redacted result/error text, never pre-redaction handler output. Tested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts. Not-tested: Full GitHub Actions matrix before push.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68910253ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T02:15:12Z

+      const priorNoProgressCount =
+        resultStatus === 'no_progress' &&
+        previousTrajectory?.toolName === toolName &&
+        previousTrajectory?.resultStatus === 'no_progress'
+          ? 1
+          : 0;


Track no-progress streaks for successful observation retries

The repeatedNoProgressCount signal is effectively never applied to the scoring paths that use it. Here it is only set when the current resultStatus is 'no_progress', but in scoreRecoveryOutcome repeated no-progress penalties are evaluated in the non-error branch, while 'no_progress' attempts from MCP are treated as errors (isError) and return earlier as 'failure'. This means repeated successful-but-stalling observation calls (the case this penalty is meant to capture) are always scored as first-time attempts, so stalling trajectories are systematically under-penalized.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T02:15:12Z

+        isError: resultStatus === 'error' || resultStatus === 'aborted' || result?.isError === true,
+        resultText: summarizeResult(result),
+        errorText: error,
+        repeatedFailureCount: previousTrajectory?.resultStatus === 'error' ? 1 : 0,


Count repeated tool-result failures in repeat penalty input

The repeat-failure input only increments when the previous node status is 'error', but many tool failures are recorded as 'no_progress' when result.isError is true. Because those attempts are still scored as failure (isError: true), repeated tool-result failures usually never receive the intended repeat penalty, flattening negative reward signals for persistent failures and reducing ranking quality for recovery trajectories.

Useful? React with 👍 / 👎.

Resolve the latest feat/1017 base drift while preserving reward scoring evidence and the newer secret-redaction improvements in recovery telemetry. Constraint: GitHub marked PR #1078 dirty after the stacked recovery base advanced again. Rejected: Taking the base recovery wiring alone | it would drop scoring evidence needed by PR #1019. Confidence: high Scope-risk: moderate Directive: Merging stacked recovery branches must preserve both secret redaction and deterministic reward evidence. Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector · 2026-05-13T02:17:59Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

…ward-scorer

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fac3b0a9a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T10:35:13Z

  if (typeof value === 'string') {
-    const redacted = redactSecretString(redactSensitiveText(value));
-    if (LARGE_VALUE_KEY_RE.test(key) || redacted.length > 200) return hashValue(redacted);
-    return redacted;
+    if (LARGE_VALUE_KEY_RE.test(key) || value.length > 200) return hashValue(value);
+    return value;


Restore secret redaction for summarized arg strings

sanitizeObject now returns short string values verbatim, so secrets embedded in non-sensitive keys (for example free-form note, url, or query text) are written unredacted into recovery telemetry. This is a security regression because summarizeArgs feeds these values directly into persisted ledger nodes, and the removed redaction path no longer masks inline tokens/passwords unless the key name itself matches the sensitive-key regex.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T10:35:13Z

      const nodes = content.trim().length === 0
        ? []
-        : parseRecoveryNodes(content.trim().split('\n'));
+        : content.trim().split('\n').map((line) => JSON.parse(line) as RecoveryTrajectoryNode);


Parse recovery JSONL defensively per line

This parses every persisted line with a bare JSON.parse inside map, so a single torn/truncated JSONL record causes the whole read to throw and fall back to pending in-memory nodes only. In practice that drops valid persisted history from readRecent after any partial append or file corruption, which can misclassify trajectories and erase prior context until new writes rebuild memory state.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T10:35:13Z

 /** Lifecycle tools that must work even when the CDP connection is broken (e.g., after
 *  sleep/wake). Skip session initialization so recovery handlers can always run. */
-const SKIP_SESSION_INIT_TOOLS = new Set(['oc_stop', 'oc_reap_orphans', 'oc_profile_status', 'oc_session_snapshot', 'oc_session_resume', 'oc_journal', 'oc_run_start', 'oc_run_status', 'oc_run_events', 'oc_run_finish']);
+const SKIP_SESSION_INIT_TOOLS = new Set(['oc_stop', 'oc_reap_orphans', 'oc_profile_status', 'oc_session_snapshot', 'oc_session_resume', 'oc_journal']);


Keep oc_run lifecycle tools outside session-init gate

Removing oc_run_start, oc_run_status, oc_run_events, and oc_run_finish from SKIP_SESSION_INIT_TOOLS forces these local run-ledger tools through CDP reconnect/session initialization, even though they do not require a browser session. When Chrome is disconnected or reconnecting, these calls can now fail with reconnection/session-init errors, making run diagnostics unavailable in the exact failure scenarios they are intended to support.

Useful? React with 👍 / 👎.

# Conflicts: # src/mcp-server.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ed8ee9546

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T10:57:03Z

+    const callId = this.activityTracker!.startCall(toolName, sessionId || 'default', toolArgs, requestId);
+    getDashboardState().recordToolStart(sessionId || 'default', toolName, toolArgs, callId);


Redact act variables before telemetry writes

This now sends raw toolArgs into activity/dashboard/journal/recording paths, but the prior redactActVariablesForTelemetry guard was removed. For act calls, variables often contain user-provided credentials or tokens under arbitrary keys, so key-based redaction in downstream stores will not reliably catch them and those values can be persisted to local logs/JSONL. Please restore tool-level redaction for act.variables before these telemetry sinks receive args.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T10:57:03Z

-    getDashboardState().recordToolStart(sessionId || 'default', toolName, telemetryToolArgs, callId);
+    const callId = this.activityTracker!.startCall(toolName, sessionId || 'default', toolArgs, requestId);
+    getDashboardState().recordToolStart(sessionId || 'default', toolName, toolArgs, callId);
    const toolStartTime = Date.now();


Preserve run-harness tool event logging

The tool-call path no longer extracts run_id or appends tool_call_started / tool_call_finished events to the run store, so with run harness enabled a run captures only lifecycle events from oc_run_* and loses per-tool execution history. That breaks oc_run_events for the main diagnostics/benchmark use case where callers start a run and then execute normal tools with the same run_id.

Useful? React with 👍 / 👎.

shaun0927 · 2026-05-13T15:14:08Z

Deferred from this review pass: at 9,119 additions / 882 deletions across 108 files, this PR is too large to safely batch-review and merge in a single session. The stack also bottoms out here (no parent PR), so any consolidation into develop would have to retarget this branch first. Recommend opening a dedicated review pass for the trajectory scorer changes, validated against the recovery-reward-scorer test suite, before merging. (No Codex P0/P1/P2 outstanding per current snapshot.)

Constraint: PR #1078 review flagged persisted telemetry corruption and inline secret leakage in summarized args. Rejected: Dropping all persisted history on one bad JSONL line | It erases valid trajectory context after torn writes. Confidence: high Scope-risk: narrow Directive: Recovery telemetry remains best-effort but must preserve valid records and redact secrets regardless of key name. Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts --runInBand; npm run build Not-tested: Full CI matrix pending on GitHub

Constraint: PR #1078 was retargeted from its already-merged stacked base to develop and needed current telemetry, hint, and automation paths preserved. Rejected: Keeping the stale mcp-server conflict side | It would drop current develop hint, journal, and telemetry-redaction semantics. Confidence: medium Scope-risk: moderate Directive: Preserve reward scoring as additive passive telemetry; do not alter browser action behavior. Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts --runInBand; npm run build Not-tested: Full CI pending after push

Constraint: PR #1078 was dirty against develop after run-harness, task-ledger, policy, and metrics changes landed. Rejected: Reworking the reward scorer on an obsolete base | stale merge state obscured real CI and Codex review signals. Confidence: high Scope-risk: moderate Directive: Preserve develop's run-harness event logging and automation fallback paths while keeping recovery scoring changes layered on top. Tested: conflict resolved; post-merge recovery tests and build will run before push. Not-tested: CI after push is pending. Co-authored-by: OmX <omx@oh-my-codex.dev>

Constraint: Codex review found act variables could still flow into recovery trajectory telemetry despite the activity and recording paths redacting them. Rejected: Relying on generic trajectory argument summarization | the variables envelope is user-provided and not inherently keyed as a secret. Confidence: high Scope-risk: narrow Directive: Use the same telemetryToolArgs object for recovery trajectory writes that other telemetry paths use. Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts tests/run-harness/store.test.ts tests/run-harness/tools-budget.test.ts --runInBand; npm run build; git diff --check Not-tested: Live act tool invocation with variables under the full MCP server. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector · 2026-05-13T18:50:45Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

shaun0927 · 2026-05-13T20:33:41Z

Reviewed for merge readiness in this dedicated pass. This PR adds deterministic recovery reward scoring for trajectory outcomes and related ledger/harness integration so recovery candidates can be ranked from observed execution evidence rather than ad hoc judgment. Earlier bot/reviewer noise is either quota-related or from older reviewed commits; I did not find an unresolved current-head blocker after revalidating the branch.\n\nImprovements/validation performed during this review pass: the branch was checked against the recovery reward scorer, trajectory ledger, run-harness store, and tools-budget coverage, then build, lint, dependency-tier lint, tool-schema lint, and diff checks. The GitHub Actions matrix is now fully green across Ubuntu, macOS, and Windows on Node 18/20/22. Although the PR is large, its scope is internally coherent around deterministic recovery scoring and the current checks cover the main integration seams, so it is safe to merge.

Merged the stacked recovery candidate-ranking branch onto develop after the recovery scorer base landed, preserving advisory candidate ranking, bounded read-only recovery, policy learning, and current plan verification contracts. Constraint: PR #1088 was stacked on the now-merged #1078 branch and had to be retargeted to develop with conflicts resolved. Rejected: Dropping bounded recovery or current plan-verification evidence | either choice would regress already-reviewed stack behavior. Confidence: high Scope-risk: broad Directive: Keep recovery candidate ranking advisory and keep bounded recovery limited to safe read-only candidates. Tested: npm test -- --runTestsByPath tests/recovery/candidate-ranker.test.ts tests/recovery/policy-learner.test.ts tests/hints/hint-engine.test.ts tests/orchestration/plan-executor-bounded-recovery.test.ts tests/orchestration/plan-cache.test.ts tests/run-harness/tools-budget.test.ts --runInBand; npm run build; npm run lint -- --quiet; npm run lint:tier; npm run lint:tool-schemas; node scripts/lint-tools-capabilities.js; git diff --check Not-tested: Full repository test suite and live browser recovery campaign. Co-authored-by: OmX <omx@oh-my-codex.dev>

shaun0927 added 2 commits May 13, 2026 00:01

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

shaun0927 mentioned this pull request May 12, 2026

feat(hints): rank safe recovery candidates (#1018) #1088

Closed

shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from 796924a to 27ea971 Compare May 12, 2026 16:31

shaun0927 and others added 5 commits May 13, 2026 09:27

fix(recovery): compare progress against last trajectory status

e077359

fix(recovery): queue trajectory ledger persistence

c421a61

shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from 5e9af6b to b1402e3 Compare May 13, 2026 00:44

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 and others added 7 commits May 13, 2026 09:58

fix(recovery): compare progress against last trajectory status

f1c591a

fix(recovery): queue trajectory ledger persistence

3bd0d72

shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from 2093ff0 to 3dd32a7 Compare May 13, 2026 01:08

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 and others added 8 commits May 13, 2026 11:07

fix(recovery): compare progress against last trajectory status

d73568a

fix(recovery): queue trajectory ledger persistence

e25765a

shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from c5a6624 to 1b85b91 Compare May 13, 2026 02:14

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 mentioned this pull request May 13, 2026

feat(recovery): deterministic evidence reward scorer for progress and recovery #1019

Closed

9 tasks

shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from c1b18b7 to 0a68816 Compare May 13, 2026 09:21

Merge feat/1017-recovery-trajectory-ledger into feat/1019-recovery-re…

fac3b0a

…ward-scorer

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Merge develop into feat/1019-recovery-reward-scorer

2ed8ee9

# Conflicts: # src/mcp-server.ts

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 changed the base branch from feat/1017-recovery-trajectory-ledger to develop May 13, 2026 17:35

shaun0927 and others added 3 commits May 14, 2026 02:38

shaun0927 merged commit 0a8c547 into develop May 13, 2026
9 checks passed

shaun0927 deleted the feat/1019-recovery-reward-scorer branch May 14, 2026 13:21

		const priorNoProgressCount = resultStatus === 'no_progress'
		? this.countConsecutiveNoProgress(this.recoveryLedger.readRecent(8, sessionId), toolName, tabId)

		if (toolName === 'request_intercept' \|\| toolName === 'network_capture') {
		return { networkChanged: true, observationOnly: false };

		const callId = this.activityTracker!.startCall(toolName, sessionId \|\| 'default', toolArgs, requestId);
		getDashboardState().recordToolStart(sessionId \|\| 'default', toolName, toolArgs, callId);

Conversation

shaun0927 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress / Review status

Uh oh!

qodo-code-review Bot commented May 12, 2026

Uh oh!

gemini-code-assist Bot commented May 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

shaun0927 commented May 12, 2026 •

edited

Loading