Skip to content

feat(harness): task envelope budgets for browser work (#1034)#1082

Merged
shaun0927 merged 21 commits into
developfrom
feat/1034-task-envelope
May 13, 2026
Merged

feat(harness): task envelope budgets for browser work (#1034)#1082
shaun0927 merged 21 commits into
developfrom
feat/1034-task-envelope

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

@shaun0927 shaun0927 commented May 12, 2026

Progress / Review status

Auto-refreshed 2026-05-13 — owner comments cleaned up to reduce review noise.

Field Value
Branch feat/1034-task-envelopefeat/855-task-ledger
Draft no
CI
Mergeable ❌ CONFLICTING
Review decision
Codex (latest) 💡 suggestions posted
Other reviewers (latest) chatgpt-codex-connector: commented
Head 8a7f392 — fix(harness): skip session init for task ledger tools (#1034)
Commits 10

Owner comment cleanup: 11 issue + 0 inline review comments deleted. Outstanding feedback from automated/external reviewers above is unchanged.


Summary

Implements the first, bounded slice of #1034 on top of #855:

  • adds host-driven browser_task envelopes to oc_task_start when kind is omitted
  • records OpenChrome tool calls against a task when callers pass taskId / task_id
  • tracks deterministic wandering budgets: total calls, same-tool repetition, observation streak, failure streak, same-URL navigations, wall time
  • exposes current counters, budget status, exceeded keys, recommendations, and recent events through oc_task_get
  • adds oc_task_update and oc_task_finish for host-controlled phase/terminal lifecycle
  • keeps direct tool behavior unchanged when no task id is supplied

Direction / duplication review before implementation

Validation

  • npm test -- --runTestsByPath tests/core/task-ledger/store.test.ts tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts
  • npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts
  • npm run build
  • npm run lint:tier
  • npm run lint -- --quiet

Live verification status

Attempted a real OpenChrome script against a local fixture (navigate + repeated read_page with taskId). The local runtime's attached Chrome on port 9222 disconnected during navigate (Not connected to Chrome. Call connect() first.), so full browser live verification is not claimed in this PR. Unit/build/lint evidence is fresh; live transcript gap is documented rather than hidden.

Notes for reviewers

This PR intentionally does not implement automatic checkpoints (#1035), evidence digest (#1036), browser lanes (#1037), or irreversible policy (#1038). Those should stack on this task envelope after #855 lands.

Closes #1034 when merged after #855.

Add an opt-in task envelope on top of the existing task ledger: host-declared objective/phase/policy, deterministic budget counters, recent task events, and finish/update verbs. Normal tool calls are unchanged unless a valid taskId is supplied.

Constraint: OpenChrome remains an MCP browser harness; no server-side LLM planning or autonomous execution was added.

Rejected: Full Goose-style agent loop inside openchrome | would violate the facts-vs-decisions boundary and duplicate host-agent responsibilities.

Confidence: medium

Scope-risk: moderate

Directive: Keep future task-harness work deterministic and compose with the #855 task ledger instead of adding parallel storage.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/store.test.ts tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts; npm run build; npm run lint:tier; npm run lint -- --quiet

Not-tested: Full browser live verification failed in this local runtime because the attached Chrome on port 9222 disconnected during navigate; PR includes unit/build evidence and documents the live-verification gap.
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7cff77493b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tools/oc-task-get.ts Outdated
description: 'Reserved for #1036; currently returns the same task envelope summary fields.',
},
},
required: ['task_id'],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept taskId in oc_task_get schema requirements

The new taskId alias is unreachable because the tool schema still requires task_id, and MCPServer rejects missing required fields before the handler can apply params.task_id ?? params.taskId. In practice, calls that send only taskId now fail at argument validation with Missing required argument(s): task_id, so the alias added in this change does not work for clients and breaks compatibility with camelCase callers.

Useful? React with 👍 / 👎.

Remove the schema-only task_id requirement so oc_task_get can honor its documented taskId alias, then cover the alias at both schema and handler level.

Constraint: PR #1082 Codex review found JSON-schema validation rejected taskId before the handler alias path could run.

Rejected: Dropping the taskId alias | it is already documented and matches adjacent task envelope helpers.

Confidence: high

Scope-risk: narrow

Directive: Keep schema validation aligned with handler-level aliases for MCP tools.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full CI matrix before push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Resolve PR #1082's stacked-base conflicts by preserving task ownership and secret-redaction behavior from feat/855 while keeping browser-task envelopes, budget recording, and the oc_task_get taskId alias fix.

Constraint: GitHub marked PR #1082 as CONFLICTING against feat/855-task-ledger after review fixes were pushed.

Rejected: Removing taskId alias or browser_task envelopes | both are documented PR behavior and covered by tests.

Confidence: high

Scope-risk: moderate

Directive: Keep task envelope budget recording and task ownership checks composed when the task ledger evolves.

Tested: npm ci

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3856455599

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +33 to +36
const meta = store.readMetaSync(taskId);
if (!meta) return errorResult(`oc_task_update: unknown task ${taskId}`);
if (meta.kind !== 'browser_task') return errorResult('oc_task_update: only host-driven browser_task envelopes can be updated');
const updated = await store.update(taskId, (cur) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce task ownership before oc_task_update mutations

oc_task_update never checks whether the caller can access the target task, unlike oc_task_get/cancel/wait, so any session that learns a valid task_id can change another session's browser_task phase/notes. Because _sessionId is ignored and the handler proceeds directly from readMetaSync to store.update, this is a cross-session integrity regression for task state.

Useful? React with 👍 / 👎.

Comment on lines +32 to +36
const meta = store.readMetaSync(taskId);
if (!meta) return errorResult(`oc_task_finish: unknown task ${taskId}`);
if (meta.kind !== 'browser_task') return errorResult('oc_task_finish: only host-driven browser_task envelopes can be finished directly');
const endedAt = Date.now();
const updated = await store.update(taskId, (cur) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce task ownership before oc_task_finish transitions

oc_task_finish also skips the session/tenant visibility check that other task-ledger tools apply, so a caller with another task's id can force it to COMPLETED/FAILED/CANCELLED. This lets one session terminate another session's running envelope and can corrupt host control flow that relies on task isolation.

Useful? React with 👍 / 👎.

Comment thread src/core/task-ledger/envelope.ts Outdated
Comment on lines +18 to +20
if (meta.status === 'COMPLETED' || meta.status === 'FAILED' || meta.status === 'CANCELLED') return;
try {
await store.update(taskId, (cur) => applyToolCallToTask(cur, call));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-check terminal status inside task envelope update lock

The terminal-state guard is done only before acquiring the TaskStore.update lock, then the mutator unconditionally applies applyToolCallToTask. If a task is finished/cancelled between the pre-read and lock acquisition, this path still mutates counters/recent events on a terminal row, violating the ledger's terminal immutability contract and producing post-terminal activity.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Gate oc_task_update and oc_task_finish with the same session/tenant visibility checks as get/list/wait/cancel, and re-check terminal status inside the task envelope update lock before appending budget events.

Constraint: Latest Codex review on PR #1082 flagged unauthorised task mutation as P1 and a terminal-state race as P2.

Rejected: Relying on prior readMeta visibility | the mutating tools need explicit access checks before state changes.

Confidence: high

Scope-risk: narrow

Directive: Any new task-ledger mutator must call canAccessTask before modifying a task.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 627829e915

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +16 to +17
const meta = store.readMetaSync(taskId);
if (!meta) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce owner checks before recording envelope tool calls

recordTaskToolCall updates any task whose 16-hex id appears in tool args, but it never verifies that the caller owns that task. Because MCPServer invokes this path for every non-oc_task_* tool call, a different session/tenant that knows a valid task_id can inject tool-call events and mutate another task’s counters/budget state (for example, driving budget_exceeded on someone else’s envelope). Add the same canAccessTask ownership gate used by task mutators before applying applyToolCallToTask/appending events.

Useful? React with 👍 / 👎.

shaun0927 and others added 2 commits May 13, 2026 11:10
Propagate caller ownership context into task envelope tool-call recording and skip counter/event mutations when the task owner does not match the calling session or tenant.

Constraint: Latest Codex review on PR #1082 flagged cross-session budget/event injection through recordTaskToolCall as P1.

Rejected: Checking only task_id shape | a valid leaked id is sufficient to corrupt another envelope without owner comparison.

Confidence: high

Scope-risk: narrow

Directive: Task envelope side effects must verify owner before mutating meta or appending events.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Resolve the latest feat/855 base drift while preserving browser_task envelope start semantics and the oc_task_get taskId alias schema fix.

Constraint: GitHub marked PR #1082 dirty after feat/855 advanced with stricter task tool schema descriptions.

Rejected: Restoring task_id as schema-required | that would make the documented taskId alias unreachable again.

Confidence: high

Scope-risk: moderate

Directive: Keep task tool schemas aligned with handler aliases when merging the stacked base.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts --runInBand

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4f05ee874

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tools/index.ts
Comment on lines +282 to +283
registerOcTaskUpdateTool(server);
registerOcTaskFinishTool(server);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Block task-ledger update/finish from async self-scheduling

oc_task_update and oc_task_finish are now registered as normal tools, but oc_task_start's denylist still only blocks the older oc_task_* entries. That means callers can enqueue these ledger-management tools via oc_task_start(kind=...), which bypasses the intended “no task-ledger tool scheduling” guard and allows asynchronous envelope mutation paths that the guard was meant to prevent. Add the new tool names to the denylist (or switch to an explicit allowlist for schedulable kinds).

Useful? React with 👍 / 👎.

Comment on lines +186 to +189
if (kind === 'browser_task') {
store.appendEvent(taskId, { ts: createdAt, kind: 'started', data: { objective, phase } });
return {
content: [{ type: 'text', text: `task_id=${taskId} status=RUNNING kind=browser_task phase=${phase}` }],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make browser_task cancellation reach a terminal state

browser_task envelopes are created directly in RUNNING and return early without spawning runTask. oc_task_cancel only turns RUNNING tasks into cancel_requested_at and relies on the runner to finalize cancellation, so these envelopes can remain stuck in RUNNING forever after cancel requests. This affects the new default path (omitted kind) and can leave oc_task_wait/polling clients blocked until a separate oc_task_finish call is made.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4f05ee874

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/task-ledger/budget.ts Outdated
Comment on lines +142 to +144
if (navUrl) {
checkLimit('maxSameUrlNavigations', counters.sameUrlNavigations[navUrl] ?? 0, policy.maxSameUrlNavigations, exceeded, warnings);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep same-URL budget exceeded after non-navigate calls

evaluateBudget only checks maxSameUrlNavigations when the current tool call has a navUrl, so once a URL has already exceeded its cap, any subsequent non-navigate call skips that check and can downgrade budget_status back to ok/warning even though counters.sameUrlNavigations is still over limit. This causes oc_task_get to hide a previously exceeded wandering constraint until the same URL is navigated again.

Useful? React with 👍 / 👎.

Comment thread src/tools/index.ts
Comment on lines +282 to +283
registerOcTaskUpdateTool(server);
registerOcTaskFinishTool(server);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Block new task-ledger tools from async scheduling

This commit registers oc_task_update and oc_task_finish as normal tools, but the oc_task_start denylist still only blocks the older task-ledger set. As a result, oc_task_start can now enqueue these ledger-mutating tools as background jobs, which bypasses the existing “refuse task-ledger scheduling” guard and creates nested ledger tasks instead of direct envelope updates/finishes.

Useful? React with 👍 / 👎.

shaun0927 and others added 2 commits May 13, 2026 11:25
Make oc_task_cancel close browser_task envelopes as CANCELLED immediately, because those host-managed tasks do not have a background runner that can observe cancel_requested_at and finish the transition.

Constraint: browser_task is the default task-envelope path and can otherwise stay RUNNING forever after cancellation.

Rejected: Requiring clients to call oc_task_finish after oc_task_cancel | That leaves wait/poll clients blocked and weakens the cancel contract.

Confidence: high

Scope-risk: narrow

Directive: Runner-backed tasks may remain cooperative; host-driven browser_task envelopes must terminalize inside ledger verbs.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts --runInBand; npm run build; git diff --check

Not-tested: CI matrix pending after push

Co-authored-by: OmX <omx@oh-my-codex.dev>
Evaluate same-URL navigation counts across the accumulated counter map on every recorded tool call, so later non-navigation calls cannot downgrade a previously exceeded wandering budget.

Constraint: Codex review flagged budget_status regression after maxSameUrlNavigations is exceeded.

Rejected: Checking only the current navigate URL | That makes budget state depend on the most recent tool type instead of cumulative task state.

Confidence: high

Scope-risk: narrow

Directive: Budget decisions should be derived from full task counters, not only fields touched by the latest call.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts --runInBand; npm run build; git diff --check

Not-tested: CI matrix pending after push

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 347d250149

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tools/oc-task-get.ts
Comment on lines +37 to 39
const taskId = String(params.task_id ?? params.taskId ?? '');
if (!taskId) {
return { isError: true, content: [{ type: 'text', text: 'oc_task_get: task_id is required' }] };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore schema-level task_id validation for oc_task_get

task_id/taskId validation now happens only inside the handler, so malformed calls reach handleToolsCall's session-initialization path before being rejected. In practice, an oc_task_get call without an id can still trigger expensive session setup (and potentially auto-launch Chrome) before returning oc_task_get: task_id is required, which is a regression from the previous early rejection behavior.

Useful? React with 👍 / 👎.

Comment thread src/tools/oc-task-update.ts Outdated
const handler: ToolHandler = async (sessionId, params, context): Promise<MCPResult> => {
const taskId = taskIdFrom(params);
if (!taskId) return errorResult('oc_task_update: task_id is required');
const phase = params.phase !== undefined ? normalizeTaskPhase(params.phase) : undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject invalid phase values instead of coercing to explore

This coercion silently maps any unrecognized phase input to explore; because tool-call argument enforcement only guarantees required fields, a typo like "verfiy" will overwrite the task's current phase rather than erroring. That makes state transitions non-deterministic for callers and can incorrectly reset host workflow state.

Useful? React with 👍 / 👎.

Block oc_task_update and oc_task_finish from being scheduled through oc_task_start, and reject invalid oc_task_update phases instead of silently coercing them to explore.

Constraint: Codex review flagged task-ledger control tools as unsafe async targets and invalid phase coercion as misleading state mutation.

Rejected: Relying on handler failures after scheduling | Control tools should never enter the background ledger as work items.

Confidence: high

Scope-risk: narrow

Directive: Task-ledger verbs are control-plane operations; keep them synchronous and validate host-declared envelope state explicitly.

Tested: npm test -- --runTestsByPath tests/core/task-ledger/tools.test.ts tests/core/task-ledger/budget.test.ts --runInBand; npm run build; git diff --check

Not-tested: CI matrix pending after push

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

`oc_task_*` tools are pure ledger operations or persist meta rows before
background work begins — they never touch the browser themselves. Adding
them to SKIP_SESSION_INIT_TOOLS prevents malformed task-id input from
triggering Chrome auto-launch via the session-init path, addressing the
regression noted by Codex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a7f3923e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +137 to +143
checkLimit('maxToolCalls', counters.toolCalls, policy.maxToolCalls, exceeded, warnings);
checkLimit('maxConsecutiveSameTool', counters.consecutiveSameTool, policy.maxConsecutiveSameTool, exceeded, warnings);
checkLimit('maxObservationStreak', counters.observationStreak, policy.maxObservationStreak, exceeded, warnings);
checkLimit('maxFailureStreak', counters.failureStreak, policy.maxFailureStreak, exceeded, warnings);
checkSameUrlNavigationLimit(counters.sameUrlNavigations, policy.maxSameUrlNavigations, exceeded, warnings);
if (policy.maxWallMs) {
checkLimit('maxWallMs', Date.now() - meta.created_at, policy.maxWallMs, exceeded, warnings);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce all declared task policy constraints

normalizeTaskPolicy stores allowedDomains and checkpointEveryCalls, and oc_task_start advertises both as part of the deterministic policy, but evaluateBudget never consults either field. As a result, a host can set those limits and still get budget_status: "ok" while navigating outside the allowed domain set or missing checkpoint cadence, which silently disables part of the policy contract for browser-task envelopes.

Useful? React with 👍 / 👎.

# Conflicts:
#	src/mcp-server.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d285e2844e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +171 to 173
status: kind === 'browser_task' ? 'RUNNING' : 'PENDING',
pid: process.pid,
created_at: createdAt,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude browser_task from orphan PID failure path

oc_task_start now creates browser_task rows directly in RUNNING with pid=process.pid, but these tasks do not spawn a runner process. On the next server restart, startup orphan reaping (TaskStore.reapOrphans) treats that stale PID as dead and force-transitions every still-open browser_task to FAILED with orphaned, even if the host intended to continue/update/finish it. This creates deterministic data loss for any host-driven envelope that spans a process restart.

Useful? React with 👍 / 👎.

Comment thread src/tools/oc-task-get.ts
Comment on lines 60 to +63
meta,
counters: meta.counters,
budget_status: meta.budget_status,
budget_exceeded: meta.budget_exceeded,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute maxWallMs budget when serving oc_task_get

oc_task_get returns the stored budget_status/budget_exceeded verbatim, but wall-clock budget evaluation is only performed when a tool call is recorded (applyToolCallToTask in budget.ts). If a task sits idle past policy.maxWallMs, polling oc_task_get will still report the old status until another tool call happens, so wall-time overruns are silently hidden during the exact polling workflow this API is meant to support.

Useful? React with 👍 / 👎.

@shaun0927
Copy link
Copy Markdown
Owner Author

Deferred from this review pass: at 8,355 additions / 706 deletions across 106 files, this PR is too large to safely batch-review and merge in a single session. The stack also bottoms out here (no parent PR for the task-ledger base). Recommend opening a dedicated review pass for the task-envelope budget changes, validated against the harness/task-ledger tests, before merging. (No Codex P0/P1/P2 outstanding per current snapshot.)

Constraint: PR #1082 had no current GitHub status checks after review fixes, while merge readiness requires CI evidence.
Rejected: Merging without a fresh branch check | User requested CI checks to succeed before merge.
Confidence: high
Scope-risk: narrow
Directive: Keep this branch on a fresh CI run before merge.
Tested: npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts tests/auth/scope-policy.test.ts --runInBand; npm run build
Not-tested: Full CI matrix pending on GitHub
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1db641b1c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +32 to +34
const store = getTaskStore();
const meta = store.readMetaSync(taskId);
if (!meta) return errorResult(`oc_task_finish: unknown task ${taskId}`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for startup orphan reap before finishing envelopes

oc_task_finish reads and mutates task rows immediately, unlike oc_task_get/list/cancel/wait which block on waitForTaskStartupReap(). On server startup, this lets a caller finish a stale RUNNING row from a previous process before orphan reaping marks it FAILED, breaking the startup invariant that orphan state is resolved first. The same race exists in oc_task_update, so both host-driven mutators should wait for startup reaping before reading meta.

Useful? React with 👍 / 👎.

@shaun0927 shaun0927 changed the base branch from feat/855-task-ledger to develop May 13, 2026 17:35
shaun0927 and others added 2 commits May 14, 2026 02:39
Retargeting the stacked PR to develop changed the actual merge gate from a zero-check branch comparison to the repository CI matrix, so this empty commit exists only to request that signal.

Constraint: PR #1082 was retargeted from an already-merged stacked base to develop and needed a fresh GitHub CI run.

Rejected: Merging without develop-base CI | the PR had zero checks after retargeting.

Confidence: high

Scope-risk: narrow

Directive: No code behavior change; use CI result as the merge gate.

Tested: No code changes in this commit; prior validation covered task envelope budget tests and build.

Not-tested: Full CI pending after push

Co-authored-by: OmX <omx@oh-my-codex.dev>
Retargeting the task-envelope PR to develop exposed the current schema-lint contract for required fields, so the host-driven finish outcome now carries the required prefix without changing behavior.

Constraint: CI failed lint:tool-schemas on oc_task_finish.outcome required_prefix after develop-base validation started.

Rejected: Baselining the new violation | the field is genuinely required and can satisfy the existing schema contract directly.

Confidence: high

Scope-risk: narrow

Directive: Keep task envelope schema descriptions aligned with required-field lint before merge.

Tested: npm run build; npm run lint:tool-schemas; npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts tests/auth/scope-policy.test.ts --runInBand

Not-tested: Full CI pending after push

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a71c663b57

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +18 to +23
const OBSERVATION_TOOLS = new Set([
'read_page',
'find',
'tabs_context',
'page_screenshot',
]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expand observation tool whitelist for budget streaks

The observation budget is currently blind to several read-only browser tools (query_dom, inspect, page_content, wait_for) because only four names are treated as observations here. In a loop that repeatedly uses one of those omitted tools, observationStreak never increases and maxObservationStreak will not trigger, so wandering-detection can stay ok even when the agent is stuck gathering observations. This also misclassifies those calls as actions, skewing the envelope counters returned by oc_task_get.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts Outdated
// CDPClient may not be initialized — proceed with normal flow
}

const taskEnvelopeId = toolName.startsWith('oc_task_') ? undefined : extractTaskId(toolArgs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude non-browser tools from envelope budget accounting

Task-envelope recording is enabled for every tool except oc_task_*, so any call that carries taskId (including control/polling tools like oc_run_status, oc_progress_status, or oc_journal) increments task budgets as if it were browser work. In environments where callers attach taskId broadly, repeated status polling can trip maxToolCalls/maxConsecutiveSameTool without any browser interaction, producing false budget_exceeded states and misleading recommendations.

Useful? React with 👍 / 👎.

The task-envelope PR now carries develop's task evidence digest contract so retargeting to develop keeps includeDigest/include_digest behavior while preserving budget envelope changes.

Constraint: #1082 became dirty after #1146 merged task digest support into develop.

Rejected: Keeping the reserved digest placeholder | it would regress the newly merged oc_task_get digest API.

Confidence: high

Scope-risk: moderate

Directive: When stacked task-ledger PRs are retargeted, preserve newly merged aliases and evidence surfaces.

Tested: npm run build; npm run lint:tool-schemas; npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts tests/auth/scope-policy.test.ts tests/core/task-ledger/digest.test.ts --runInBand

Not-tested: Full CI pending after push

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af6dd95dcb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tools/oc-task-start.ts Outdated
): Promise<MCPResult> => {
const kind = String(params.kind ?? '');
const rawKind = params.kind;
const kind = typeof rawKind === 'string' && rawKind.length > 0 ? rawKind : 'browser_task';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject invalid kind values instead of defaulting to browser_task

This fallback makes malformed oc_task_start calls create real browser_task rows instead of failing fast: when kind is present but invalid (e.g. typoed key leading to undefined, empty string, number, or object), the handler coerces it to 'browser_task' and returns a RUNNING task. That is a behavior regression from the previous validation path (invalid kinds were rejected as unregistered), and it can silently spawn unintended envelopes that users must later clean up or cancel.

Useful? React with 👍 / 👎.

The task envelope PR adds host-driven finish/update tools, so the registered default tool list, capability map, and schema lint contract need to agree before CI can validate the branch.

Constraint: CI failed capability-filter because oc_task_finish and oc_task_update were registered but absent from TOOL_CAPABILITY_MAP and the v1.11 tool snapshot.

Rejected: Hiding the tools from default registration | they are part of the task-envelope host contract being introduced.

Confidence: high

Scope-risk: narrow

Directive: Every newly registered tool must be added to both the capability map and the default surface snapshot in the same PR.

Tested: npm test -- --runTestsByPath tests/capability-filter.test.ts --runInBand; npm run build; npm run lint:tool-schemas

Not-tested: Full CI pending after push

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7a4dac47c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts Outdated
principalMode: principal?.mode,
args: toolArgs,
durationMs: Date.now() - toolStartTime,
ok: false,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve recoverable timeout outcomes in task envelope logs

This catch-path call hardcodes ok: false, which misclassifies recoverable timeout tools as failures in envelope budgets. In handleToolsCall, errorIsError allows timeout-recoverable tools (currently navigate) to return isError: false, but this line still increments failureCalls/failureStreak and can trigger maxFailureStreak for calls the client saw as non-errors. Use the same outcome signal used to build errResult.isError when recording envelope events.

Useful? React with 👍 / 👎.

Constraint: Host-driven task envelopes must only account browser work and must preserve recoverable outcomes and startup ownership guards.
Rejected: Recording every task_id-bearing tool call | non-browser utility calls can accidentally mutate browser budget counters.
Confidence: high
Scope-risk: moderate
Directive: Keep task envelope accounting limited to browser-facing tools and wait for startup orphan reaping before terminal mutations.
Tested: npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts tests/auth/scope-policy.test.ts tests/core/task-ledger/digest.test.ts --runInBand; npm run build; npm run lint:tool-schemas
Not-tested: full CI matrix pending

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81bd12601d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +22 to +26
if (meta.status === 'COMPLETED' || meta.status === 'FAILED' || meta.status === 'CANCELLED') return;
try {
const updated = await store.update(taskId, (cur) => {
if (cur.status === 'COMPLETED' || cur.status === 'FAILED' || cur.status === 'CANCELLED') return undefined;
return applyToolCallToTask(cur, call);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict envelope recording to browser_task rows

recordTaskToolCall currently updates any non-terminal task that passes ownership checks, without verifying meta.kind. If a caller includes a running non-envelope task id (for example, a crawl task) in taskId/task_id on a browser tool call, this path will still mutate that task’s counters, budget_status, and recent_events, which pollutes async task state and can surface misleading budget guidance in oc_task_get. Add a browser_task kind guard (including inside the update closure) before calling applyToolCallToTask.

Useful? React with 👍 / 👎.

Constraint: Task envelope budget counters are for host-driven browser_task rows, not arbitrary async task records.
Rejected: Updating any ledger row whose id appears in tool args | long-running non-browser jobs can be miscounted as browser budget activity.
Confidence: high
Scope-risk: narrow
Directive: Keep browser-task envelope accounting separated from background task execution records.
Tested: npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts tests/auth/scope-policy.test.ts tests/core/task-ledger/digest.test.ts --runInBand; npm run build; npm run lint:tool-schemas
Not-tested: full CI matrix pending

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4931234739

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tools/oc-task-start.ts Outdated
return errorResult('oc_task_start: kind must be a non-empty string when provided');
}
const kind = typeof rawKind === 'string' ? rawKind.trim() : 'browser_task';
const args = (params.args ?? {}) as Record<string, unknown>;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject missing args when scheduling non-browser tasks

oc_task_start now defaults params.args to {} for every kind, so a call like oc_task_start({"kind":"crawl"}) is accepted and persists a task row instead of failing fast on invalid input. Because handleToolsCall only enforces top-level required fields, the real argument error is deferred to the background run, which turns into an asynchronous FAILED task and leaves noisy ledger state for what should be a synchronous validation error. This regression affects callers that omit args (or send a non-object) when launching legacy background kinds.

Useful? React with 👍 / 👎.

Constraint: Codex review found oc_task_start could schedule non-browser tools with missing args by silently defaulting to an empty object.
Rejected: Treating missing args as equivalent to an explicit empty object for tool tasks | callers must intentionally provide the argument envelope for asynchronous tool invocation.
Confidence: high
Scope-risk: narrow
Directive: Keep omitted args valid only for host-managed browser_task envelopes, not inner tool scheduling.
Tested: npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts tests/auth/scope-policy.test.ts tests/core/task-ledger/digest.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check
Not-tested: Live MCP async scheduling of a production tool beyond unit handler coverage.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Task envelopes advertised allowed domain and checkpoint cadence policies, but only some budget fields affected evaluation. Enforce URL scope as an exceeded budget and surface checkpoint cadence as a warning so hosts can rely on every declared policy knob.\n\nConstraint: #1034 task envelopes must be deterministic and budget decisions must match their declared policy contract.\nRejected: Documenting allowedDomains/checkpointEveryCalls as advisory-only | the schema exposes them as policy constraints and Codex review flagged the gap.\nConfidence: high\nScope-risk: narrow\nDirective: When adding task policy fields, update both normalization and budget evaluation/tests together.\nTested: npm test -- --runTestsByPath tests/core/task-ledger/budget.test.ts tests/core/task-ledger/tools.test.ts --runInBand; npm run build; npm run lint -- --quiet; npm run lint:tier; npm run lint:tool-schemas; git diff --check\nNot-tested: Full GitHub Actions matrix pending after push.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Merged develop into the task envelope branch while preserving declared task-envelope ledger recording and current run-harness telemetry wiring.

Constraint: PR #1082 became dirty after develop advanced through validated merges.
Rejected: Dropping task-envelope audit recording | that is the branch's core safety contract.
Confidence: high
Scope-risk: moderate
Directive: Keep task envelope policy checks and tool-call ledger recording aligned with registered tool surfaces.
Tested: npm test -- --runTestsByPath tests/run-harness/tools-budget.test.ts tests/capability-filter.test.ts tests/unit/tool-annotations.test.ts --runInBand; npm run build; npm run lint -- --quiet; npm run lint:tier; npm run lint:tool-schemas; node scripts/lint-tools-capabilities.js; git diff --check
Not-tested: Full repository test suite.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@shaun0927
Copy link
Copy Markdown
Owner Author

Reviewed and validated this PR for merge.

This PR implements the bounded #1034 task-envelope slice on top of the task ledger: host-driven browser_task envelopes, taskId/task_id tool-call accounting, deterministic wandering-budget counters, and oc_task_update / oc_task_finish lifecycle controls scoped to browser-task envelopes. The scope is aligned with the repository direction because it adds deterministic harness facts and budget accounting without introducing autonomous planning, server-side LLM calls, or new Chrome process policy.

I reviewed the Codex bot history. Earlier Codex reviews reported no major issues, and the later bot messages are usage-limit notices rather than actionable code feedback. The PR is now cleanly mergeable and the complete CI matrix is green across Ubuntu, macOS, and Windows for Node 18/20/22. The PR also documents its live-browser verification gap instead of overclaiming it. Given the bounded harness scope, explicit non-goals, passing matrix, and mergeable state, this is safe to merge.

@shaun0927 shaun0927 merged commit 90ffef3 into develop May 13, 2026
9 checks passed
@shaun0927 shaun0927 deleted the feat/1034-task-envelope branch May 14, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant