Skip to content

feat(recovery): score trajectory outcomes deterministically (#1019)#1078

Merged
shaun0927 merged 31 commits into
developfrom
feat/1019-recovery-reward-scorer
May 13, 2026
Merged

feat(recovery): score trajectory outcomes deterministically (#1019)#1078
shaun0927 merged 31 commits into
developfrom
feat/1019-recovery-reward-scorer

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

@shaun0927 shaun0927 commented May 12, 2026

Progress / Review status

Auto-refreshed 2026-05-13 — owner comments cleaned up to reduce review noise.

Field Value
Branch feat/1019-recovery-reward-scorerfeat/1017-recovery-trajectory-ledger
Draft no
CI
Mergeable ❌ CONFLICTING
Review decision
Codex (latest) 💡 suggestions posted
Other reviewers (latest) chatgpt-codex-connector: commented
Head 57cee81 — Merge latest recovery base again
Commits 25

Owner comment cleanup: 7 issue + 0 inline review comments deleted. Outstanding feedback from automated/external reviewers above is unchanged.


Summary\n- Adds a pure deterministic RecoveryRewardScorer for progress/failure/blocking/no-progress outcomes.\n- Records reward scores on passive recovery trajectory nodes from #1017.\n- Keeps scoring telemetry-only: no LLM calls and no automatic recovery execution.\n\nCloses #1019.\n\nStacked on #1073 / #1017 because it consumes the recovery trajectory boundary.\n\n## Validation\n- npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts\n- npm run build\n\n## Live OpenChrome verification\n- Not run in this PR session; deterministic scorer unit coverage is included and #1073 adds the fixture route for follow-up live verification.\n\n## Safety\n- No LLM value calls.\n- No tool behavior changes beyond passive reward telemetry.\n- Destructive ungated outcomes receive hard negative classification.

shaun0927 added 2 commits May 13, 2026 00:01
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Add a pure recovery reward scorer so trajectory nodes can carry bounded progress/failure scores without introducing LLM value calls or execution changes.\n\nConstraint: Recovery scoring must be deterministic and safe for harness telemetry hot paths.\nRejected: LLM-based value scoring | too expensive and too risky for production browser automation.\nConfidence: high\nScope-risk: narrow\nDirective: Treat scores as advisory telemetry until a later safety-gated ranking/search PR consumes them.\nTested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full live OpenChrome scoring transcript.
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cc42051d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts Outdated

const score = scoreFromToolResult({
toolName,
isError: resultStatus === 'error' || resultStatus === 'aborted',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include tool-level isError when computing reward classification

The scorer currently sets isError from resultStatus === 'error' || 'aborted', which misses the common path where a tool returns isError: true but is recorded as resultStatus: 'no_progress' (see success path call site). In those cases, explicit tool failures get scored as neutral/no-progress unless the text happens to match a hardcoded signal, which will skew telemetry and any downstream ranking based on reward. Pass through result?.isError (or equivalent) so deterministic scoring consistently treats tool-declared errors as failures.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts Outdated
resultText: summarizeResult(result),
errorText: error,
repeatedFailureCount: previous?.result === 'error' ? 1 : 0,
repeatedNoProgressCount: resultStatus === 'no_progress' ? 1 : 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid marking first no-progress attempt as repeated

repeatedNoProgressCount is set to 1 for every no_progress result, even when there is no prior no-progress attempt. In scoreRecoveryOutcome, any positive repeated count triggers an extra penalty, so first-time no-progress outcomes are incorrectly penalized as repeats. This biases reward downward and makes repeated/no-repeat states indistinguishable; compute this from actual prior consecutive no-progress events instead of hardcoding 1.

Useful? React with 👍 / 👎.

shaun0927 and others added 5 commits May 13, 2026 09:27
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Constraint: Codex P1 found passive recovery scoring ignored MCP results that returned isError=true on the nominal success path.\nRejected: Treating every no_progress result as a repeat | that penalizes first observations and hides actual repeated stalls.\nConfidence: high\nScope-risk: narrow\nDirective: Keep recovery reward telemetry passive and evidence-derived; do not let scoring alter tool behavior.\nTested: npm ci; npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Full CI matrix locally.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix
@shaun0927 shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from 5e9af6b to b1402e3 Compare May 13, 2026 00:44
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8329d8118

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts Outdated
Comment on lines +1726 to +1729
resultStatus === 'success' &&
previous !== undefined &&
previous.result === 'error' &&
previous.toolName !== toolName;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict recovered tagging to same tab trajectory

recordRecoveryTrajectory currently marks any successful call as recovered when the immediately prior completed call in the same session was an error from a different tool, without checking tab or trajectory context. In interleaved multi-tab sessions, an unrelated success (for example on tab B) right after an error on tab A will be mislabeled as a recovery, which pollutes recovery telemetry and can skew any reward-based ranking built on this data.

Useful? React with 👍 / 👎.

shaun0927 and others added 7 commits May 13, 2026 09:58
Resolve PR #1078's stacked-base conflict by taking the updated trajectory ledger from feat/1017 and reapplying the reward scorer wiring, including tool-level isError handling and prior no-progress counting.

Constraint: GitHub marked PR #1078 as CONFLICTING against feat/1017-recovery-trajectory-ledger after review fixes were pushed.

Rejected: Force-rebasing the stacked PR | preserving merge context avoids invalidating prior review history more than necessary.

Confidence: high

Scope-risk: moderate

Directive: Keep recovery telemetry best-effort; scorer changes must not alter tool execution behavior.

Tested: npm ci

Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix
The schema-lint pipeline reads the introspection JSON through a pipe, so exiting immediately after one large stdout write can truncate manifests near the pipe buffer boundary. Chunking the output and waiting for drain preserves the no-Chrome introspection contract while keeping normal serve startup untouched.

Constraint: CI invokes node dist/index.js serve --introspect-tools-list | node scripts/lint-tool-schemas.mjs -.

Rejected: Increasing schema lint tolerance | it would hide a transport-level truncation bug instead of preserving valid JSON.

Confidence: high

Scope-risk: narrow

Directive: Keep introspection free of Chrome/CDP startup and preserve JSON-only stdout.

Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas.

Not-tested: Full GitHub Actions matrix before push.
The latest develop branch adds crawl_start/status/cancel to the introspection manifest, and required fields must use the documented REQUIRED prefix for schema lint to pass. Updating the descriptions preserves tool behavior while keeping PR merge refs green.

Constraint: PR checks run against the pull/merge ref, so every branch must absorb the latest develop schema-lint contract.

Rejected: Updating the baseline | these are new tools and the required-field descriptions can satisfy the contract directly.

Confidence: high

Scope-risk: narrow

Directive: New required input fields should start descriptions with REQUIRED instead of growing the lint baseline.

Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas.

Not-tested: Full GitHub Actions matrix before push.
@shaun0927 shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from 2093ff0 to 3dd32a7 Compare May 13, 2026 01:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56b72b76c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
Comment on lines +2242 to +2246
const score = scoreFromToolResult({
toolName,
isError: resultStatus === 'error' || resultStatus === 'aborted' || result?.isError === true,
resultText: summarizeResult(result),
errorText: error,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Feed progress evidence into reward scoring

This call only passes toolName, error flags, and text, but scoreRecoveryOutcome marks progress only when evidence fields like urlChanged, domChanged, networkChanged, dataItemsExtracted, or freshRefsDiscovered are present. As wired here, successful non-observation tools (e.g. click, navigate) cannot reach the progress branch and are typically scored as no_progress with a negative reward, which will systematically bias recovery telemetry and downstream ranking/training data.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts Outdated
Comment on lines +2239 to +2240
const priorNoProgressCount = resultStatus === 'no_progress'
? this.countConsecutiveNoProgress(this.recoveryLedger.readRecent(8, sessionId), toolName, tabId)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid synchronous ledger reads on no-progress path

Computing priorNoProgressCount calls readRecent(...), which performs a synchronous fs.readFileSync and JSON parse of the ledger file on the request path. Under repeated tool failures/no-progress loops, this adds blocking disk I/O to each completion and can increase tool latency or reduce throughput; this should use in-memory state (or async I/O) instead of sync file reads in recordRecoveryTrajectory.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Update PR #1078 onto the latest feat/1017 base while preserving deterministic reward scorer wiring, tool-level isError handling, and prior no-progress counting.

Constraint: feat/1017-recovery-trajectory-ledger advanced after the previous conflict-resolution push and GitHub marked PR #1078 conflicting again.

Rejected: Accepting the base version of recovery wiring alone | it would drop PR #1019 scoring semantics and reintroduce the reviewed issue.

Confidence: high

Scope-risk: moderate

Directive: When merging stacked recovery branches, verify scoreFromToolResult remains wired in MCPServer.

Tested: npm ci

Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3959048ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts Outdated
for (let i = nodes.length - 1; i >= 0; i--) {
const node = nodes[i];
if (node.toolName !== toolName) break;
if (tabId !== undefined && node.tabId !== tabId) break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope no-progress streak to matching tab context

countConsecutiveNoProgress only enforces tabId equality when the current call has a tabId, so calls where tabId is omitted (for example navigate creating a new tab) will count prior no_progress nodes for the same tool from any tab. This can incorrectly apply repeated no-progress penalties to unrelated trajectories and skew reward telemetry; the streak logic should keep the same context keying used by getLastNode(sessionId, tabId) instead of mixing tab-scoped and unscoped attempts.

Useful? React with 👍 / 👎.

Feed deterministic progress signals into recovery reward scoring and avoid the synchronous ledger read for repeated no-progress by using the in-memory previous trajectory node.

Constraint: Latest Codex review on PR #1078 flagged missing progress evidence as P1 and synchronous readRecent use as P2.

Rejected: Scoring successful actions from text alone | navigation/action/extraction tools expose enough deterministic context to classify progress without LLM calls.

Confidence: high

Scope-risk: narrow

Directive: Recovery reward scoring must stay deterministic and must not add blocking I/O to tool completion.

Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6ad22cfc9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts Outdated
Comment on lines +2293 to +2295
if (toolName === 'extract_data') {
const content = result?.content;
return { dataItemsExtracted: Array.isArray(content) && content.length > 0 ? content.length : 1, observationOnly: false };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive extract_data item count from payload, not content blocks

extract_data success responses are wrapped as a single text block (content: [{ type: 'text', text: JSON.stringify(...) }]) in both single and multiple modes, so using content.length here almost always yields 1 regardless of how many records were actually extracted. That means dataItemsExtracted is effectively hardcoded to 1, which in scoreRecoveryOutcome stays below the progress threshold and frequently classifies successful extractions as no_progress with a negative reward. Parse the JSON payload fields like count/fieldsFound instead of counting content blocks.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts Outdated
Comment on lines +2297 to +2298
if (toolName === 'request_intercept' || toolName === 'network_capture') {
return { networkChanged: true, observationOnly: false };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match network capture tool names when tagging network progress

This condition checks for network_capture, but the registered tools are network_capture_lite and network_capture_full (see src/tools/network-capture-*.ts). As a result, successful network capture calls never set networkChanged evidence and are systematically under-scored as no-progress/observation. Include the actual tool names in this branch so recovery rewards reflect those outcomes.

Useful? React with 👍 / 👎.

shaun0927 and others added 8 commits May 13, 2026 11:07
Derive extract_data progress from JSON payload fields instead of content-block count, and tag the registered network_capture_lite/full tool names as network progress evidence.

Constraint: Latest Codex review on PR #1078 found extract_data rewards were effectively hardcoded and network capture tool names did not match registered tools.

Rejected: Counting MCP content blocks as extracted items | extract_data wraps records inside one JSON text payload.

Confidence: high

Scope-risk: narrow

Directive: Recovery scoring evidence must reflect tool payload contracts, not MCP envelope shape.

Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix
The schema-lint pipeline reads the introspection JSON through a pipe, so exiting immediately after one large stdout write can truncate manifests near the pipe buffer boundary. Chunking the output and waiting for drain preserves the no-Chrome introspection contract while keeping normal serve startup untouched.

Constraint: CI invokes node dist/index.js serve --introspect-tools-list | node scripts/lint-tool-schemas.mjs -.

Rejected: Increasing schema lint tolerance | it would hide a transport-level truncation bug instead of preserving valid JSON.

Confidence: high

Scope-risk: narrow

Directive: Keep introspection free of Chrome/CDP startup and preserve JSON-only stdout.

Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas.

Not-tested: Full GitHub Actions matrix before push.
Recovery telemetry is persisted to disk, so argument summarization must recognize common camelCase credential names such as accessToken, sessionId, and authHeader instead of relying only on separator-delimited keys.

Constraint: Codex review flagged this as P1 secret exposure in trajectory ledger persistence.

Rejected: Hashing all short strings | it would remove useful non-sensitive telemetry and reduce ledger diagnostic value.

Confidence: high

Scope-risk: narrow

Directive: Keep recovery ledger summaries useful, but prefer false-positive redaction over credential leakage.

Tested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts.

Not-tested: Full GitHub Actions matrix before push.
Recovery trajectory summaries are persisted before the response envelope leaves the server, so they must receive the same secret-redacted MCP result that the client receives.

Constraint: Codex review flagged raw result summaries in recovery ledger persistence as P1 secret exposure.
Rejected: Depending on summarizeResult pattern redaction | literal secret substitutions require the process secret store redactor.
Confidence: high
Scope-risk: narrow
Directive: Persisted telemetry should consume redacted result/error text, never pre-redaction handler output.
Tested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts.
Not-tested: Full GitHub Actions matrix before push.
@shaun0927 shaun0927 force-pushed the feat/1017-recovery-trajectory-ledger branch from c5a6624 to 1b85b91 Compare May 13, 2026 02:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68910253ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
Comment on lines +2239 to +2244
const priorNoProgressCount =
resultStatus === 'no_progress' &&
previousTrajectory?.toolName === toolName &&
previousTrajectory?.resultStatus === 'no_progress'
? 1
: 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track no-progress streaks for successful observation retries

The repeatedNoProgressCount signal is effectively never applied to the scoring paths that use it. Here it is only set when the current resultStatus is 'no_progress', but in scoreRecoveryOutcome repeated no-progress penalties are evaluated in the non-error branch, while 'no_progress' attempts from MCP are treated as errors (isError) and return earlier as 'failure'. This means repeated successful-but-stalling observation calls (the case this penalty is meant to capture) are always scored as first-time attempts, so stalling trajectories are systematically under-penalized.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts
isError: resultStatus === 'error' || resultStatus === 'aborted' || result?.isError === true,
resultText: summarizeResult(result),
errorText: error,
repeatedFailureCount: previousTrajectory?.resultStatus === 'error' ? 1 : 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count repeated tool-result failures in repeat penalty input

The repeat-failure input only increments when the previous node status is 'error', but many tool failures are recorded as 'no_progress' when result.isError is true. Because those attempts are still scored as failure (isError: true), repeated tool-result failures usually never receive the intended repeat penalty, flattening negative reward signals for persistent failures and reducing ranking quality for recovery trajectories.

Useful? React with 👍 / 👎.

Resolve the latest feat/1017 base drift while preserving reward scoring evidence and the newer secret-redaction improvements in recovery telemetry.

Constraint: GitHub marked PR #1078 dirty after the stacked recovery base advanced again.

Rejected: Taking the base recovery wiring alone | it would drop scoring evidence needed by PR #1019.

Confidence: high

Scope-risk: moderate

Directive: Merging stacked recovery branches must preserve both secret redaction and deterministic reward evidence.

Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fac3b0a9a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 338 to +340
if (typeof value === 'string') {
const redacted = redactSecretString(redactSensitiveText(value));
if (LARGE_VALUE_KEY_RE.test(key) || redacted.length > 200) return hashValue(redacted);
return redacted;
if (LARGE_VALUE_KEY_RE.test(key) || value.length > 200) return hashValue(value);
return value;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore secret redaction for summarized arg strings

sanitizeObject now returns short string values verbatim, so secrets embedded in non-sensitive keys (for example free-form note, url, or query text) are written unredacted into recovery telemetry. This is a security regression because summarizeArgs feeds these values directly into persisted ledger nodes, and the removed redaction path no longer masks inline tokens/passwords unless the key name itself matches the sensitive-key regex.

Useful? React with 👍 / 👎.

Comment thread src/recovery/trajectory-ledger.ts Outdated
Comment on lines +118 to +120
const nodes = content.trim().length === 0
? []
: parseRecoveryNodes(content.trim().split('\n'));
: content.trim().split('\n').map((line) => JSON.parse(line) as RecoveryTrajectoryNode);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse recovery JSONL defensively per line

This parses every persisted line with a bare JSON.parse inside map, so a single torn/truncated JSONL record causes the whole read to throw and fall back to pending in-memory nodes only. In practice that drops valid persisted history from readRecent after any partial append or file corruption, which can misclassify trajectories and erase prior context until new writes rebuild memory state.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts Outdated
/** Lifecycle tools that must work even when the CDP connection is broken (e.g., after
* sleep/wake). Skip session initialization so recovery handlers can always run. */
const SKIP_SESSION_INIT_TOOLS = new Set(['oc_stop', 'oc_reap_orphans', 'oc_profile_status', 'oc_session_snapshot', 'oc_session_resume', 'oc_journal', 'oc_run_start', 'oc_run_status', 'oc_run_events', 'oc_run_finish']);
const SKIP_SESSION_INIT_TOOLS = new Set(['oc_stop', 'oc_reap_orphans', 'oc_profile_status', 'oc_session_snapshot', 'oc_session_resume', 'oc_journal']);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep oc_run lifecycle tools outside session-init gate

Removing oc_run_start, oc_run_status, oc_run_events, and oc_run_finish from SKIP_SESSION_INIT_TOOLS forces these local run-ledger tools through CDP reconnect/session initialization, even though they do not require a browser session. When Chrome is disconnected or reconnecting, these calls can now fail with reconnection/session-init errors, making run diagnostics unavailable in the exact failure scenarios they are intended to support.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ed8ee9546

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts Outdated
Comment on lines +1540 to +1541
const callId = this.activityTracker!.startCall(toolName, sessionId || 'default', toolArgs, requestId);
getDashboardState().recordToolStart(sessionId || 'default', toolName, toolArgs, callId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Redact act variables before telemetry writes

This now sends raw toolArgs into activity/dashboard/journal/recording paths, but the prior redactActVariablesForTelemetry guard was removed. For act calls, variables often contain user-provided credentials or tokens under arbitrary keys, so key-based redaction in downstream stores will not reliably catch them and those values can be persisted to local logs/JSONL. Please restore tool-level redaction for act.variables before these telemetry sinks receive args.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts
getDashboardState().recordToolStart(sessionId || 'default', toolName, telemetryToolArgs, callId);
const callId = this.activityTracker!.startCall(toolName, sessionId || 'default', toolArgs, requestId);
getDashboardState().recordToolStart(sessionId || 'default', toolName, toolArgs, callId);
const toolStartTime = Date.now();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve run-harness tool event logging

The tool-call path no longer extracts run_id or appends tool_call_started / tool_call_finished events to the run store, so with run harness enabled a run captures only lifecycle events from oc_run_* and loses per-tool execution history. That breaks oc_run_events for the main diagnostics/benchmark use case where callers start a run and then execute normal tools with the same run_id.

Useful? React with 👍 / 👎.

@shaun0927
Copy link
Copy Markdown
Owner Author

Deferred from this review pass: at 9,119 additions / 882 deletions across 108 files, this PR is too large to safely batch-review and merge in a single session. The stack also bottoms out here (no parent PR), so any consolidation into develop would have to retarget this branch first. Recommend opening a dedicated review pass for the trajectory scorer changes, validated against the recovery-reward-scorer test suite, before merging. (No Codex P0/P1/P2 outstanding per current snapshot.)

Constraint: PR #1078 review flagged persisted telemetry corruption and inline secret leakage in summarized args.
Rejected: Dropping all persisted history on one bad JSONL line | It erases valid trajectory context after torn writes.
Confidence: high
Scope-risk: narrow
Directive: Recovery telemetry remains best-effort but must preserve valid records and redact secrets regardless of key name.
Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts --runInBand; npm run build
Not-tested: Full CI matrix pending on GitHub
@shaun0927 shaun0927 changed the base branch from feat/1017-recovery-trajectory-ledger to develop May 13, 2026 17:35
shaun0927 and others added 3 commits May 14, 2026 02:38
Constraint: PR #1078 was retargeted from its already-merged stacked base to develop and needed current telemetry, hint, and automation paths preserved.

Rejected: Keeping the stale mcp-server conflict side | It would drop current develop hint, journal, and telemetry-redaction semantics.

Confidence: medium

Scope-risk: moderate

Directive: Preserve reward scoring as additive passive telemetry; do not alter browser action behavior.

Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts --runInBand; npm run build

Not-tested: Full CI pending after push
Constraint: PR #1078 was dirty against develop after run-harness, task-ledger, policy, and metrics changes landed.
Rejected: Reworking the reward scorer on an obsolete base | stale merge state obscured real CI and Codex review signals.
Confidence: high
Scope-risk: moderate
Directive: Preserve develop's run-harness event logging and automation fallback paths while keeping recovery scoring changes layered on top.
Tested: conflict resolved; post-merge recovery tests and build will run before push.
Not-tested: CI after push is pending.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Constraint: Codex review found act variables could still flow into recovery trajectory telemetry despite the activity and recording paths redacting them.
Rejected: Relying on generic trajectory argument summarization | the variables envelope is user-provided and not inherently keyed as a secret.
Confidence: high
Scope-risk: narrow
Directive: Use the same telemetryToolArgs object for recovery trajectory writes that other telemetry paths use.
Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts tests/run-harness/store.test.ts tests/run-harness/tools-budget.test.ts --runInBand; npm run build; git diff --check
Not-tested: Live act tool invocation with variables under the full MCP server.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@shaun0927
Copy link
Copy Markdown
Owner Author

Reviewed for merge readiness in this dedicated pass. This PR adds deterministic recovery reward scoring for trajectory outcomes and related ledger/harness integration so recovery candidates can be ranked from observed execution evidence rather than ad hoc judgment. Earlier bot/reviewer noise is either quota-related or from older reviewed commits; I did not find an unresolved current-head blocker after revalidating the branch.\n\nImprovements/validation performed during this review pass: the branch was checked against the recovery reward scorer, trajectory ledger, run-harness store, and tools-budget coverage, then build, lint, dependency-tier lint, tool-schema lint, and diff checks. The GitHub Actions matrix is now fully green across Ubuntu, macOS, and Windows on Node 18/20/22. Although the PR is large, its scope is internally coherent around deterministic recovery scoring and the current checks cover the main integration seams, so it is safe to merge.

@shaun0927 shaun0927 merged commit 0a8c547 into develop May 13, 2026
9 checks passed
shaun0927 added a commit that referenced this pull request May 13, 2026
Merged the stacked recovery candidate-ranking branch onto develop after the recovery scorer base landed, preserving advisory candidate ranking, bounded read-only recovery, policy learning, and current plan verification contracts.

Constraint: PR #1088 was stacked on the now-merged #1078 branch and had to be retargeted to develop with conflicts resolved.
Rejected: Dropping bounded recovery or current plan-verification evidence | either choice would regress already-reviewed stack behavior.
Confidence: high
Scope-risk: broad
Directive: Keep recovery candidate ranking advisory and keep bounded recovery limited to safe read-only candidates.
Tested: npm test -- --runTestsByPath tests/recovery/candidate-ranker.test.ts tests/recovery/policy-learner.test.ts tests/hints/hint-engine.test.ts tests/orchestration/plan-executor-bounded-recovery.test.ts tests/orchestration/plan-cache.test.ts tests/run-harness/tools-budget.test.ts --runInBand; npm run build; npm run lint -- --quiet; npm run lint:tier; npm run lint:tool-schemas; node scripts/lint-tools-capabilities.js; git diff --check
Not-tested: Full repository test suite and live browser recovery campaign.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@shaun0927 shaun0927 deleted the feat/1019-recovery-reward-scorer branch May 14, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant