feat(recovery): score trajectory outcomes deterministically (#1019)#1078
Conversation
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Add a pure recovery reward scorer so trajectory nodes can carry bounded progress/failure scores without introducing LLM value calls or execution changes.\n\nConstraint: Recovery scoring must be deterministic and safe for harness telemetry hot paths.\nRejected: LLM-based value scoring | too expensive and too risky for production browser automation.\nConfidence: high\nScope-risk: narrow\nDirective: Treat scores as advisory telemetry until a later safety-gated ranking/search PR consumes them.\nTested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full live OpenChrome scoring transcript.
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1cc42051d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| const score = scoreFromToolResult({ | ||
| toolName, | ||
| isError: resultStatus === 'error' || resultStatus === 'aborted', |
There was a problem hiding this comment.
Include tool-level isError when computing reward classification
The scorer currently sets isError from resultStatus === 'error' || 'aborted', which misses the common path where a tool returns isError: true but is recorded as resultStatus: 'no_progress' (see success path call site). In those cases, explicit tool failures get scored as neutral/no-progress unless the text happens to match a hardcoded signal, which will skew telemetry and any downstream ranking based on reward. Pass through result?.isError (or equivalent) so deterministic scoring consistently treats tool-declared errors as failures.
Useful? React with 👍 / 👎.
| resultText: summarizeResult(result), | ||
| errorText: error, | ||
| repeatedFailureCount: previous?.result === 'error' ? 1 : 0, | ||
| repeatedNoProgressCount: resultStatus === 'no_progress' ? 1 : 0, |
There was a problem hiding this comment.
Avoid marking first no-progress attempt as repeated
repeatedNoProgressCount is set to 1 for every no_progress result, even when there is no prior no-progress attempt. In scoreRecoveryOutcome, any positive repeated count triggers an extra penalty, so first-time no-progress outcomes are incorrectly penalized as repeats. This biases reward downward and makes repeated/no-repeat states indistinguishable; compute this from actual prior consecutive no-progress events instead of hardcoding 1.
Useful? React with 👍 / 👎.
796924a to
27ea971
Compare
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Constraint: Codex P1 found passive recovery scoring ignored MCP results that returned isError=true on the nominal success path.\nRejected: Treating every no_progress result as a repeat | that penalizes first observations and hides actual repeated stalls.\nConfidence: high\nScope-risk: narrow\nDirective: Keep recovery reward telemetry passive and evidence-derived; do not let scoring alter tool behavior.\nTested: npm ci; npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Full CI matrix locally.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix
5e9af6b to
b1402e3
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8329d8118
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| resultStatus === 'success' && | ||
| previous !== undefined && | ||
| previous.result === 'error' && | ||
| previous.toolName !== toolName; |
There was a problem hiding this comment.
Restrict recovered tagging to same tab trajectory
recordRecoveryTrajectory currently marks any successful call as recovered when the immediately prior completed call in the same session was an error from a different tool, without checking tab or trajectory context. In interleaved multi-tab sessions, an unrelated success (for example on tab B) right after an error on tab A will be mislabeled as a recovery, which pollutes recovery telemetry and can skew any reward-based ranking built on this data.
Useful? React with 👍 / 👎.
Resolve PR #1078's stacked-base conflict by taking the updated trajectory ledger from feat/1017 and reapplying the reward scorer wiring, including tool-level isError handling and prior no-progress counting. Constraint: GitHub marked PR #1078 as CONFLICTING against feat/1017-recovery-trajectory-ledger after review fixes were pushed. Rejected: Force-rebasing the stacked PR | preserving merge context avoids invalidating prior review history more than necessary. Confidence: high Scope-risk: moderate Directive: Keep recovery telemetry best-effort; scorer changes must not alter tool execution behavior. Tested: npm ci Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix
The schema-lint pipeline reads the introspection JSON through a pipe, so exiting immediately after one large stdout write can truncate manifests near the pipe buffer boundary. Chunking the output and waiting for drain preserves the no-Chrome introspection contract while keeping normal serve startup untouched. Constraint: CI invokes node dist/index.js serve --introspect-tools-list | node scripts/lint-tool-schemas.mjs -. Rejected: Increasing schema lint tolerance | it would hide a transport-level truncation bug instead of preserving valid JSON. Confidence: high Scope-risk: narrow Directive: Keep introspection free of Chrome/CDP startup and preserve JSON-only stdout. Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas. Not-tested: Full GitHub Actions matrix before push.
The latest develop branch adds crawl_start/status/cancel to the introspection manifest, and required fields must use the documented REQUIRED prefix for schema lint to pass. Updating the descriptions preserves tool behavior while keeping PR merge refs green. Constraint: PR checks run against the pull/merge ref, so every branch must absorb the latest develop schema-lint contract. Rejected: Updating the baseline | these are new tools and the required-field descriptions can satisfy the contract directly. Confidence: high Scope-risk: narrow Directive: New required input fields should start descriptions with REQUIRED instead of growing the lint baseline. Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas. Not-tested: Full GitHub Actions matrix before push.
2093ff0 to
3dd32a7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56b72b76c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const score = scoreFromToolResult({ | ||
| toolName, | ||
| isError: resultStatus === 'error' || resultStatus === 'aborted' || result?.isError === true, | ||
| resultText: summarizeResult(result), | ||
| errorText: error, |
There was a problem hiding this comment.
Feed progress evidence into reward scoring
This call only passes toolName, error flags, and text, but scoreRecoveryOutcome marks progress only when evidence fields like urlChanged, domChanged, networkChanged, dataItemsExtracted, or freshRefsDiscovered are present. As wired here, successful non-observation tools (e.g. click, navigate) cannot reach the progress branch and are typically scored as no_progress with a negative reward, which will systematically bias recovery telemetry and downstream ranking/training data.
Useful? React with 👍 / 👎.
| const priorNoProgressCount = resultStatus === 'no_progress' | ||
| ? this.countConsecutiveNoProgress(this.recoveryLedger.readRecent(8, sessionId), toolName, tabId) |
There was a problem hiding this comment.
Avoid synchronous ledger reads on no-progress path
Computing priorNoProgressCount calls readRecent(...), which performs a synchronous fs.readFileSync and JSON parse of the ledger file on the request path. Under repeated tool failures/no-progress loops, this adds blocking disk I/O to each completion and can increase tool latency or reduce throughput; this should use in-memory state (or async I/O) instead of sync file reads in recordRecoveryTrajectory.
Useful? React with 👍 / 👎.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Update PR #1078 onto the latest feat/1017 base while preserving deterministic reward scorer wiring, tool-level isError handling, and prior no-progress counting. Constraint: feat/1017-recovery-trajectory-ledger advanced after the previous conflict-resolution push and GitHub marked PR #1078 conflicting again. Rejected: Accepting the base version of recovery wiring alone | it would drop PR #1019 scoring semantics and reintroduce the reviewed issue. Confidence: high Scope-risk: moderate Directive: When merging stacked recovery branches, verify scoreFromToolResult remains wired in MCPServer. Tested: npm ci Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3959048ff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (let i = nodes.length - 1; i >= 0; i--) { | ||
| const node = nodes[i]; | ||
| if (node.toolName !== toolName) break; | ||
| if (tabId !== undefined && node.tabId !== tabId) break; |
There was a problem hiding this comment.
Scope no-progress streak to matching tab context
countConsecutiveNoProgress only enforces tabId equality when the current call has a tabId, so calls where tabId is omitted (for example navigate creating a new tab) will count prior no_progress nodes for the same tool from any tab. This can incorrectly apply repeated no-progress penalties to unrelated trajectories and skew reward telemetry; the streak logic should keep the same context keying used by getLastNode(sessionId, tabId) instead of mixing tab-scoped and unscoped attempts.
Useful? React with 👍 / 👎.
Feed deterministic progress signals into recovery reward scoring and avoid the synchronous ledger read for repeated no-progress by using the in-memory previous trajectory node. Constraint: Latest Codex review on PR #1078 flagged missing progress evidence as P1 and synchronous readRecent use as P2. Rejected: Scoring successful actions from text alone | navigation/action/extraction tools expose enough deterministic context to classify progress without LLM calls. Confidence: high Scope-risk: narrow Directive: Recovery reward scoring must stay deterministic and must not add blocking I/O to tool completion. Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6ad22cfc9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (toolName === 'extract_data') { | ||
| const content = result?.content; | ||
| return { dataItemsExtracted: Array.isArray(content) && content.length > 0 ? content.length : 1, observationOnly: false }; |
There was a problem hiding this comment.
Derive extract_data item count from payload, not content blocks
extract_data success responses are wrapped as a single text block (content: [{ type: 'text', text: JSON.stringify(...) }]) in both single and multiple modes, so using content.length here almost always yields 1 regardless of how many records were actually extracted. That means dataItemsExtracted is effectively hardcoded to 1, which in scoreRecoveryOutcome stays below the progress threshold and frequently classifies successful extractions as no_progress with a negative reward. Parse the JSON payload fields like count/fieldsFound instead of counting content blocks.
Useful? React with 👍 / 👎.
| if (toolName === 'request_intercept' || toolName === 'network_capture') { | ||
| return { networkChanged: true, observationOnly: false }; |
There was a problem hiding this comment.
Match network capture tool names when tagging network progress
This condition checks for network_capture, but the registered tools are network_capture_lite and network_capture_full (see src/tools/network-capture-*.ts). As a result, successful network capture calls never set networkChanged evidence and are systematically under-scored as no-progress/observation. Include the actual tool names in this branch so recovery rewards reflect those outcomes.
Useful? React with 👍 / 👎.
Derive extract_data progress from JSON payload fields instead of content-block count, and tag the registered network_capture_lite/full tool names as network progress evidence. Constraint: Latest Codex review on PR #1078 found extract_data rewards were effectively hardcoded and network capture tool names did not match registered tools. Rejected: Counting MCP content blocks as extracted items | extract_data wraps records inside one JSON text payload. Confidence: high Scope-risk: narrow Directive: Recovery scoring evidence must reflect tool payload contracts, not MCP envelope shape. Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>
Add a passive, bounded trajectory ledger so failed and recovered tool-call attempts survive as redacted harness telemetry while preserving existing execution semantics.\n\nConstraint: LATS-inspired adoption must avoid live-browser tree search and destructive auto-replay.\nRejected: Full MCTS/browser branching | unsafe for real authenticated Chrome sessions and out of scope for #1017.\nConfidence: high\nScope-risk: narrow\nDirective: Keep this ledger passive; future ranking/search work must remain safety-gated and should consume these nodes rather than replay them.\nTested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts; npm run build\nNot-tested: Full browser E2E transcript; fixture route was added for follow-up live verification.
Constraint: Codex review found no-progress outcomes were not treated as recoverable/stuck and multi-tab sessions could be cross-labeled.\nRejected: Reading the ledger from disk on every tool call | that would keep passive telemetry on the request hot path and reintroduce blocking I/O risk.\nConfidence: high\nScope-risk: narrow\nDirective: Recovery telemetry must remain best-effort, non-blocking, and scoped to the same session/tab context.\nTested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts\nNot-tested: full CI matrix
The schema-lint pipeline reads the introspection JSON through a pipe, so exiting immediately after one large stdout write can truncate manifests near the pipe buffer boundary. Chunking the output and waiting for drain preserves the no-Chrome introspection contract while keeping normal serve startup untouched. Constraint: CI invokes node dist/index.js serve --introspect-tools-list | node scripts/lint-tool-schemas.mjs -. Rejected: Increasing schema lint tolerance | it would hide a transport-level truncation bug instead of preserving valid JSON. Confidence: high Scope-risk: narrow Directive: Keep introspection free of Chrome/CDP startup and preserve JSON-only stdout. Tested: npx tsc -p tsconfig.json --pretty false; npm run lint:tool-schemas. Not-tested: Full GitHub Actions matrix before push.
Recovery telemetry is persisted to disk, so argument summarization must recognize common camelCase credential names such as accessToken, sessionId, and authHeader instead of relying only on separator-delimited keys. Constraint: Codex review flagged this as P1 secret exposure in trajectory ledger persistence. Rejected: Hashing all short strings | it would remove useful non-sensitive telemetry and reduce ledger diagnostic value. Confidence: high Scope-risk: narrow Directive: Keep recovery ledger summaries useful, but prefer false-positive redaction over credential leakage. Tested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts. Not-tested: Full GitHub Actions matrix before push.
Recovery trajectory summaries are persisted before the response envelope leaves the server, so they must receive the same secret-redacted MCP result that the client receives. Constraint: Codex review flagged raw result summaries in recovery ledger persistence as P1 secret exposure. Rejected: Depending on summarizeResult pattern redaction | literal secret substitutions require the process secret store redactor. Confidence: high Scope-risk: narrow Directive: Persisted telemetry should consume redacted result/error text, never pre-redaction handler output. Tested: npx tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/recovery/trajectory-ledger.test.ts. Not-tested: Full GitHub Actions matrix before push.
c5a6624 to
1b85b91
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68910253ae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const priorNoProgressCount = | ||
| resultStatus === 'no_progress' && | ||
| previousTrajectory?.toolName === toolName && | ||
| previousTrajectory?.resultStatus === 'no_progress' | ||
| ? 1 | ||
| : 0; |
There was a problem hiding this comment.
Track no-progress streaks for successful observation retries
The repeatedNoProgressCount signal is effectively never applied to the scoring paths that use it. Here it is only set when the current resultStatus is 'no_progress', but in scoreRecoveryOutcome repeated no-progress penalties are evaluated in the non-error branch, while 'no_progress' attempts from MCP are treated as errors (isError) and return earlier as 'failure'. This means repeated successful-but-stalling observation calls (the case this penalty is meant to capture) are always scored as first-time attempts, so stalling trajectories are systematically under-penalized.
Useful? React with 👍 / 👎.
| isError: resultStatus === 'error' || resultStatus === 'aborted' || result?.isError === true, | ||
| resultText: summarizeResult(result), | ||
| errorText: error, | ||
| repeatedFailureCount: previousTrajectory?.resultStatus === 'error' ? 1 : 0, |
There was a problem hiding this comment.
Count repeated tool-result failures in repeat penalty input
The repeat-failure input only increments when the previous node status is 'error', but many tool failures are recorded as 'no_progress' when result.isError is true. Because those attempts are still scored as failure (isError: true), repeated tool-result failures usually never receive the intended repeat penalty, flattening negative reward signals for persistent failures and reducing ranking quality for recovery trajectories.
Useful? React with 👍 / 👎.
Resolve the latest feat/1017 base drift while preserving reward scoring evidence and the newer secret-redaction improvements in recovery telemetry. Constraint: GitHub marked PR #1078 dirty after the stacked recovery base advanced again. Rejected: Taking the base recovery wiring alone | it would drop scoring evidence needed by PR #1019. Confidence: high Scope-risk: moderate Directive: Merging stacked recovery branches must preserve both secret redaction and deterministic reward evidence. Tested: npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts --runInBand Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
c1b18b7 to
0a68816
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fac3b0a9a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (typeof value === 'string') { | ||
| const redacted = redactSecretString(redactSensitiveText(value)); | ||
| if (LARGE_VALUE_KEY_RE.test(key) || redacted.length > 200) return hashValue(redacted); | ||
| return redacted; | ||
| if (LARGE_VALUE_KEY_RE.test(key) || value.length > 200) return hashValue(value); | ||
| return value; |
There was a problem hiding this comment.
Restore secret redaction for summarized arg strings
sanitizeObject now returns short string values verbatim, so secrets embedded in non-sensitive keys (for example free-form note, url, or query text) are written unredacted into recovery telemetry. This is a security regression because summarizeArgs feeds these values directly into persisted ledger nodes, and the removed redaction path no longer masks inline tokens/passwords unless the key name itself matches the sensitive-key regex.
Useful? React with 👍 / 👎.
| const nodes = content.trim().length === 0 | ||
| ? [] | ||
| : parseRecoveryNodes(content.trim().split('\n')); | ||
| : content.trim().split('\n').map((line) => JSON.parse(line) as RecoveryTrajectoryNode); |
There was a problem hiding this comment.
Parse recovery JSONL defensively per line
This parses every persisted line with a bare JSON.parse inside map, so a single torn/truncated JSONL record causes the whole read to throw and fall back to pending in-memory nodes only. In practice that drops valid persisted history from readRecent after any partial append or file corruption, which can misclassify trajectories and erase prior context until new writes rebuild memory state.
Useful? React with 👍 / 👎.
| /** Lifecycle tools that must work even when the CDP connection is broken (e.g., after | ||
| * sleep/wake). Skip session initialization so recovery handlers can always run. */ | ||
| const SKIP_SESSION_INIT_TOOLS = new Set(['oc_stop', 'oc_reap_orphans', 'oc_profile_status', 'oc_session_snapshot', 'oc_session_resume', 'oc_journal', 'oc_run_start', 'oc_run_status', 'oc_run_events', 'oc_run_finish']); | ||
| const SKIP_SESSION_INIT_TOOLS = new Set(['oc_stop', 'oc_reap_orphans', 'oc_profile_status', 'oc_session_snapshot', 'oc_session_resume', 'oc_journal']); |
There was a problem hiding this comment.
Keep oc_run lifecycle tools outside session-init gate
Removing oc_run_start, oc_run_status, oc_run_events, and oc_run_finish from SKIP_SESSION_INIT_TOOLS forces these local run-ledger tools through CDP reconnect/session initialization, even though they do not require a browser session. When Chrome is disconnected or reconnecting, these calls can now fail with reconnection/session-init errors, making run diagnostics unavailable in the exact failure scenarios they are intended to support.
Useful? React with 👍 / 👎.
# Conflicts: # src/mcp-server.ts
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ed8ee9546
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const callId = this.activityTracker!.startCall(toolName, sessionId || 'default', toolArgs, requestId); | ||
| getDashboardState().recordToolStart(sessionId || 'default', toolName, toolArgs, callId); |
There was a problem hiding this comment.
Redact act variables before telemetry writes
This now sends raw toolArgs into activity/dashboard/journal/recording paths, but the prior redactActVariablesForTelemetry guard was removed. For act calls, variables often contain user-provided credentials or tokens under arbitrary keys, so key-based redaction in downstream stores will not reliably catch them and those values can be persisted to local logs/JSONL. Please restore tool-level redaction for act.variables before these telemetry sinks receive args.
Useful? React with 👍 / 👎.
| getDashboardState().recordToolStart(sessionId || 'default', toolName, telemetryToolArgs, callId); | ||
| const callId = this.activityTracker!.startCall(toolName, sessionId || 'default', toolArgs, requestId); | ||
| getDashboardState().recordToolStart(sessionId || 'default', toolName, toolArgs, callId); | ||
| const toolStartTime = Date.now(); |
There was a problem hiding this comment.
Preserve run-harness tool event logging
The tool-call path no longer extracts run_id or appends tool_call_started / tool_call_finished events to the run store, so with run harness enabled a run captures only lifecycle events from oc_run_* and loses per-tool execution history. That breaks oc_run_events for the main diagnostics/benchmark use case where callers start a run and then execute normal tools with the same run_id.
Useful? React with 👍 / 👎.
|
Deferred from this review pass: at 9,119 additions / 882 deletions across 108 files, this PR is too large to safely batch-review and merge in a single session. The stack also bottoms out here (no parent PR), so any consolidation into |
Constraint: PR #1078 review flagged persisted telemetry corruption and inline secret leakage in summarized args. Rejected: Dropping all persisted history on one bad JSONL line | It erases valid trajectory context after torn writes. Confidence: high Scope-risk: narrow Directive: Recovery telemetry remains best-effort but must preserve valid records and redact secrets regardless of key name. Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts --runInBand; npm run build Not-tested: Full CI matrix pending on GitHub
Constraint: PR #1078 was retargeted from its already-merged stacked base to develop and needed current telemetry, hint, and automation paths preserved. Rejected: Keeping the stale mcp-server conflict side | It would drop current develop hint, journal, and telemetry-redaction semantics. Confidence: medium Scope-risk: moderate Directive: Preserve reward scoring as additive passive telemetry; do not alter browser action behavior. Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts --runInBand; npm run build Not-tested: Full CI pending after push
Constraint: PR #1078 was dirty against develop after run-harness, task-ledger, policy, and metrics changes landed. Rejected: Reworking the reward scorer on an obsolete base | stale merge state obscured real CI and Codex review signals. Confidence: high Scope-risk: moderate Directive: Preserve develop's run-harness event logging and automation fallback paths while keeping recovery scoring changes layered on top. Tested: conflict resolved; post-merge recovery tests and build will run before push. Not-tested: CI after push is pending. Co-authored-by: OmX <omx@oh-my-codex.dev>
Constraint: Codex review found act variables could still flow into recovery trajectory telemetry despite the activity and recording paths redacting them. Rejected: Relying on generic trajectory argument summarization | the variables envelope is user-provided and not inherently keyed as a secret. Confidence: high Scope-risk: narrow Directive: Use the same telemetryToolArgs object for recovery trajectory writes that other telemetry paths use. Tested: npm test -- --runTestsByPath tests/recovery/trajectory-ledger.test.ts tests/recovery/reward-scorer.test.ts tests/run-harness/store.test.ts tests/run-harness/tools-budget.test.ts --runInBand; npm run build; git diff --check Not-tested: Live act tool invocation with variables under the full MCP server. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Reviewed for merge readiness in this dedicated pass. This PR adds deterministic recovery reward scoring for trajectory outcomes and related ledger/harness integration so recovery candidates can be ranked from observed execution evidence rather than ad hoc judgment. Earlier bot/reviewer noise is either quota-related or from older reviewed commits; I did not find an unresolved current-head blocker after revalidating the branch.\n\nImprovements/validation performed during this review pass: the branch was checked against the recovery reward scorer, trajectory ledger, run-harness store, and tools-budget coverage, then build, lint, dependency-tier lint, tool-schema lint, and diff checks. The GitHub Actions matrix is now fully green across Ubuntu, macOS, and Windows on Node 18/20/22. Although the PR is large, its scope is internally coherent around deterministic recovery scoring and the current checks cover the main integration seams, so it is safe to merge. |
Merged the stacked recovery candidate-ranking branch onto develop after the recovery scorer base landed, preserving advisory candidate ranking, bounded read-only recovery, policy learning, and current plan verification contracts. Constraint: PR #1088 was stacked on the now-merged #1078 branch and had to be retargeted to develop with conflicts resolved. Rejected: Dropping bounded recovery or current plan-verification evidence | either choice would regress already-reviewed stack behavior. Confidence: high Scope-risk: broad Directive: Keep recovery candidate ranking advisory and keep bounded recovery limited to safe read-only candidates. Tested: npm test -- --runTestsByPath tests/recovery/candidate-ranker.test.ts tests/recovery/policy-learner.test.ts tests/hints/hint-engine.test.ts tests/orchestration/plan-executor-bounded-recovery.test.ts tests/orchestration/plan-cache.test.ts tests/run-harness/tools-budget.test.ts --runInBand; npm run build; npm run lint -- --quiet; npm run lint:tier; npm run lint:tool-schemas; node scripts/lint-tools-capabilities.js; git diff --check Not-tested: Full repository test suite and live browser recovery campaign. Co-authored-by: OmX <omx@oh-my-codex.dev>
Progress / Review status
Auto-refreshed 2026-05-13 — owner comments cleaned up to reduce review noise.
feat/1019-recovery-reward-scorer→feat/1017-recovery-trajectory-ledger57cee81— Merge latest recovery base againOwner comment cleanup: 7 issue + 0 inline review comments deleted. Outstanding feedback from automated/external reviewers above is unchanged.
Summary\n- Adds a pure deterministic RecoveryRewardScorer for progress/failure/blocking/no-progress outcomes.\n- Records reward scores on passive recovery trajectory nodes from #1017.\n- Keeps scoring telemetry-only: no LLM calls and no automatic recovery execution.\n\nCloses #1019.\n\nStacked on #1073 / #1017 because it consumes the recovery trajectory boundary.\n\n## Validation\n- npm test -- --runTestsByPath tests/recovery/reward-scorer.test.ts tests/recovery/trajectory-ledger.test.ts\n- npm run build\n\n## Live OpenChrome verification\n- Not run in this PR session; deterministic scorer unit coverage is included and #1073 adds the fixture route for follow-up live verification.\n\n## Safety\n- No LLM value calls.\n- No tool behavior changes beyond passive reward telemetry.\n- Destructive ungated outcomes receive hard negative classification.