test(benchmark): standardize agent-loop performance matrix by shaun0927 · Pull Request #1079 · shaun0927/openchrome

shaun0927 · 2026-05-12T15:09:05Z

Progress / Review status

Auto-refreshed 2026-05-13 — owner comments cleaned up to reduce review noise.

Field	Value
Branch	`test/970-benchmark-matrix` → `develop`
Draft	no
CI	✅ all 9 checks passing
Mergeable	✅ MERGEABLE
Review decision	—
Codex (latest)	💡 suggestions posted
Other reviewers (latest)	chatgpt-codex-connector: commented
Head	`b0a9866` — Merge develop into test/970-benchmark-matrix
Commits	15

_{Owner comment cleanup: 11 issue + 0 inline review comments deleted. Outstanding feedback from automated/external reviewers above is unchanged.}

Summary

Adds a standardized benchmark matrix for cold start, DOM/AX read_page, DOM delta, interactive discovery, click/fill actions, screenshots, agent-loop, and 1/5/20 parallel-tab smoke scenarios.
Extends benchmark stats with avg/p50/min/max/stddev latency, response chars, estimated output tokens, screenshot bytes, and Node RSS.
Adds --category, --runs, and --json support to the existing benchmark runner so optimization PRs can produce comparable JSON.

Closes #970

Fit / non-overlap review

Fits OpenChrome's optimization direction by making speed/token/memory comparisons repeatable before further tuning.
Does not overlap with open PR feat(core): mutation-aware snapshot cache for read_page / find / query_dom (closes #879) #929 cache, feat(core): formalize read_page refs as canonical interaction entrypoint (#831) #948 canonical refs, or feat(core): 2-stage fetch with output handles for large-output tools (closes #887) #938 output handles; this PR is measurement-only and does not alter browser behavior.
Keeps real Chrome/full benchmark usage manual/advisory to avoid flaky CI hard gates.

Tests

./node_modules/.bin/jest tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand
./node_modules/.bin/tsc -p tsconfig.json --noEmit
./node_modules/.bin/ts-node tests/benchmark/run.ts --category agent-loop --runs 1 --json

OpenChrome real-validation checklist

After merge, validate with OpenChrome itself:

Build OpenChrome: npm run build.
Run smoke JSON: npm run benchmark -- --category agent-loop --runs 3 --json.
Confirm the JSON task includes samples/runs, meanWallTimeMs, p50WallTimeMs, minWallTimeMs, maxWallTimeMs, stddevWallTimeMs, meanResponseChars, meanEstimatedOutputTokens, and meanNodeRssBytes.
Run a full/manual scenario where feasible: npx ts-node tests/benchmark/run-parallel.ts --mode real --category batch-js --runs 1.
Paste the generated JSON/Markdown summary into future optimization PRs for before/after comparison.

Add a reusable benchmark matrix covering read_page, delta, screenshot, action, cold-start, agent-loop, and parallel-tab smoke scenarios, plus JSON stats for latency distribution, payload size, token estimate, screenshot bytes, and Node RSS. Constraint: CI/default paths remain deterministic stub runs; real Chrome benchmarking stays manual/advisory. Rejected: Hard CI performance gates | environment variance would make current benchmark results flaky. Confidence: high Scope-risk: moderate Directive: Add new optimization PR scenarios through the matrix helper so before/after reports stay comparable. Tested: ./node_modules/.bin/jest tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand Tested: ./node_modules/.bin/tsc -p tsconfig.json --noEmit Tested: ./node_modules/.bin/ts-node tests/benchmark/run.ts --category agent-loop --runs 1 --json Not-tested: Full real Chrome benchmark matrix; left as manual PR checklist because it can be environment-dependent.

gemini-code-assist · 2026-05-12T15:09:57Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

qodo-code-review · 2026-05-12T15:09:57Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d36b0cc95c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T15:15:35Z

+        { tool: 'act', args: { tabId: 'tab1', action: 'click', ref: 'ref_1' } },
+        { tool: 'act', args: { tabId: 'tab1', action: 'fill', ref: 'ref_2', text: 'benchmark' } },


Send instruction to act steps in matrix scenarios

The matrix defines act steps with { action, ref, text }, but the act tool contract requires instruction (and tabId) as its input (src/tools/act.ts, required: ['tabId', 'instruction']). As written, these calls will return tool errors in real mode, so the agent-loop benchmark path advertised in this commit measures error responses instead of real click/fill latency.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-12T15:15:35Z

+          const result = await adapter.callTool(step.tool, step.args);
+          measureCall(result, step.args, counters);


Fail matrix runs when a tool call returns isError

Each step result is consumed only for metrics, and the run is returned as success: true even when MCP returns an error payload. OpenChromeRealAdapter.callTool explicitly maps server/tool failures to { isError: true } (tests/benchmark/adapters/openchrome-real-adapter.ts), so ignoring that flag silently marks broken scenarios as successful and corrupts benchmark comparisons.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-12T15:15:35Z

+      name: 'cold-start-first-tab',
+      category: 'cold-start',
+      description: 'Session/server init to first usable tab',
+      steps: [{ tool: 'open_tabs', args: {} }],


Use a registered tool name for cold-start tab creation

The cold-start scenario calls open_tabs, but there is no registered MCP tool with that name in the repo tool registry (tab creation is registered as tabs_create in src/tools/tabs-create.ts). In real mode this scenario will always produce an error response, so the standardized cold-start metric cannot execute as intended.

Useful? React with 👍 / 👎.

Constraint: Codex P1 found matrix scenarios used non-contract tool arguments and still reported success on isError responses.\nRejected: Leaving real-mode failures as metrics-only samples | that corrupts benchmark comparisons and hides broken scenarios.\nConfidence: high\nScope-risk: narrow\nDirective: Matrix scenarios must use registered OpenChrome tool contracts and fail when the MCP layer reports isError.\nTested: npm ci; npx jest tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Real Chrome benchmark run.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector · 2026-05-13T00:47:06Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10cedd8e04

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T00:51:22Z

+      name: 'warm-read-page-dom',
+      category: 'read-page',
+      description: 'Warm DOM read_page latency and payload size',
+      steps: [{ tool: 'read_page', args: { tabId: 'tab1', mode: 'dom' } }],


Create and reuse real tab IDs in matrix scenarios

The matrix steps hard-code tabId: 'tab1' (and tab2...tab20) instead of using the tabId returned by tabs_create, so real-mode runs will fail once they hit read_page/act/page_screenshot on non-existent tabs. tabs_create returns runtime-generated target IDs (not stable tabN aliases), and read_page explicitly errors when the tab is not found, so most scenarios in this matrix will benchmark tool errors rather than the intended workflows.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T00:51:22Z


 async function main(): Promise<void> {
  const ciMode = process.argv.includes('--ci');
+  const jsonMode = process.argv.includes('--json') || ciMode;


Keep --json mode output strictly JSON

Enabling --json now routes into jsonMode, but this path still emits non-JSON text (progress logs before serialization and status text after it), which makes stdout invalid for direct JSON parsing despite the branch being documented as machine-readable output. This breaks automation that expects --json to produce only JSON bytes.

Useful? React with 👍 / 👎.

Resolve the benchmark CLI/test conflicts by preserving develop's injectable CLI output path while keeping this PR's category and run-count options for the standardized matrix. Constraint: GitHub marked PR #1079 as CONFLICTING against develop after review fixes were pushed. Rejected: Rebasing away branch history | a merge commit preserves the existing PR review context and only resolves base drift. Confidence: high Scope-risk: moderate Directive: Keep benchmark CLI stdout JSON-only in CI while adding future CLI filters. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand Tested: npm ci Tested: npm run build Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9bae99613

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T01:09:23Z

+  if (category) {
+    for (const task of createMatrixTasks({ category })) {
+      runner.addTask(task);


Reject unknown benchmark categories

When --category is provided, createMatrixTasks can return an empty list for typos or unsupported values, but main still runs and exits successfully with empty reports. In CI this produces a false-green benchmark run (no scenarios executed, no regression possible), which can silently corrupt before/after performance comparisons.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T01:09:23Z

+        for (const step of scenario.steps) {
+          const result = await adapter.callTool(step.tool, step.args);


Execute parallel-tab scenarios concurrently

The matrix defines parallel-tabs-* scenarios, but task execution awaits each step in a single for loop, so all read_page calls run serially. This measures cumulative sequential latency instead of parallel tab behavior, making the reported 1/5/20 tab results non-comparable for concurrency regressions.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9bae99613

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T01:14:34Z

+  if (category) {
+    for (const task of createMatrixTasks({ category })) {
+      runner.addTask(task);
+    }


Reject unknown benchmark categories before running

When --category does not match any matrix scenario, this branch adds no tasks and the command still runs both adapters, emitting an empty report (all-zero summary) and a passing regression result. In practice, a typo or stale category name can silently look like a successful benchmark run and corrupt CI/perf comparisons. Please fail fast if createMatrixTasks({ category }) returns an empty list, with a clear error and non-zero exit.

Useful? React with 👍 / 👎.

Fix the develop-merge CI path for PR #1079 by awaiting the introspection stdout write before exit and aligning crawl tool required-field descriptions with the schema lint baseline. Constraint: Remote CI failed in lint:tool-schemas on the PR merge commit before running the benchmark matrix tests. Rejected: Ignoring this as base noise | required checks validate the actual merge commit, so the PR must carry the compatibility fix. Confidence: high Scope-risk: narrow Directive: Large machine-readable stdout output used in pipes must be flushed before process.exit. Tested: npm run build Tested: npm run lint:tool-schemas Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

Create benchmark tabs through tabs_create, substitute the returned tab ids into scenario steps, reject unknown categories, and check isError before accepting step results. Parallel-tab scenarios now dispatch their read steps concurrently after tab setup. Constraint: Latest Codex review on PR #1079 flagged hard-coded tab ids, empty category selections, and weak isError handling as P1 risks. Rejected: Keeping synthetic tab1 placeholders in real-mode runs | real adapters require server-issued tab ids to exercise the intended path. Confidence: high Scope-risk: moderate Directive: Benchmark scenarios may use placeholders in definitions, but runtime execution must resolve them to concrete tool outputs. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand Tested: npm run build Tested: npm run lint:tool-schemas Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e2fceebb9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T01:23:26Z

+  if (category) {
+    for (const task of createMatrixTasks({ category })) {
+      runner.addTask(task);
+    }


Reject unknown benchmark categories

When --category does not match any scenario, createMatrixTasks({ category }) returns an empty array and this loop adds zero tasks, but the CLI still runs and exits successfully with an empty report. That silently turns typos or stale category names into false-green benchmark runs and can corrupt CI/perf comparisons because no workload actually executed. Please fail fast when no tasks are selected for a provided category.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T01:30:05Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99eaf25ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T01:31:34Z

+        { tool: 'act', args: { tabId: 'tab1', instruction: 'click ref_1' } },
+        { tool: 'act', args: { tabId: 'tab1', instruction: 'type benchmark into ref_2' } },


Replace invalid act ref targets in matrix scenarios

These scenarios hardcode act instructions like click ref_1 / type ... ref_2, but the matrix runner creates fresh tabs at about:blank and never performs any ref-id resolution before calling act. In act, targets are resolved via AX text matching (resolveElementsByAXTree(normalizeQuery(query))) and unresolved targets are treated as hard failures (isError: true), so action (and similarly agent-loop) will fail consistently in real runs instead of measuring latency. This makes the standardized matrix unreliable for production benchmarking.

Useful? React with 👍 / 👎.

Load action scenarios with a tiny local data-page fixture and target controls by accessible text instead of synthetic ref ids, so real-mode act calls benchmark the intended action path rather than deterministic unresolved-ref errors. Constraint: Latest Codex review on PR #1079 flagged hard-coded ref_1/ref_2 act instructions as P1 because fresh tabs never resolve those refs. Rejected: Keeping about:blank for action scenarios | act needs concrete controls to measure useful latency. Confidence: high Scope-risk: narrow Directive: Benchmark action fixtures must create their own resolvable DOM/AX targets before invoking act. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand Tested: npm run build Tested: npm run lint:tool-schemas Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector · 2026-05-13T02:12:06Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chatgpt-codex-connector

💡 Codex Review

openchrome/tests/benchmark/run.ts

Lines 44 to 45 in 06b4d58

    
           const ciMode = args.includes('--ci'); 
        
           const modeIndex = args.indexOf('--mode');

Parse and honor the --json CLI flag

The new CLI path never reads --json, so main falls back to interactive mode unless --ci is set. That means ts-node tests/benchmark/run.ts --category agent-loop --runs 1 --json (the command documented in this commit) still writes progress text and a formatted report instead of machine-readable JSON, breaking automation that expects JSON output for baseline comparisons.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T02:13:20Z

+        for (const placeholder of tabPlaceholders(scenario)) {
+          const args = { url: setupUrlForScenario(scenario) };
+          const result = await adapter.callTool('tabs_create', args);


Close matrix-created tabs after each benchmark run

Each matrix run creates one or more tabs via tabs_create for placeholders, but those tabs are never closed before returning. Because BenchmarkRunner.run keeps one adapter/session alive across all tasks and repetitions, tabs accumulate across runs (especially parallel-tabs-20), which increases memory usage and can skew later latency measurements, making matrix results non-comparable within the same benchmark invocation.

Useful? React with 👍 / 👎.

Resolve develop drift in the introspection path by keeping flushed JSON output and returning from the CLI action without process.exit. Constraint: GitHub marked PR #1079 dirty after develop advanced. Rejected: Reintroducing immediate process.exit | that previously truncated piped tool-list JSON in CI. Confidence: high Scope-risk: narrow Directive: Introspection output must be fully written before the command terminates. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand Tested: npm run build Tested: npm run lint:tool-schemas Tested: git diff --check Not-tested: Full remote CI matrix pending after push. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1771784387

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T02:24:32Z

+          measureCall(result, args, counters);
+          const payload = responsePayloadSize(result);


Exclude tab setup from warm scenario measurements

These counters are updated while creating placeholder tabs, so scenarios like warm-read-page-*, interactive, action, and screenshot include tabs_create overhead in toolCallCount, payload sizes, and wallTimeMs. That makes the reported numbers inconsistent with the scenario definitions (which list only the measured step(s)) and contaminates warm-path comparisons with setup cost that should be isolated to cold-start.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T02:24:32Z

+        if (scenario.category === 'parallel-tabs') {
+          await Promise.all(scenario.steps.map(runStep));


Wait for parallel reads to settle before returning failure

Using Promise.all here causes fail-fast behavior: if one read_page rejects, the task returns immediately via catch while the other tab reads keep running in the background. In benchmark loops this can overlap with the next run on the same adapter and also drops late results from counters, producing inconsistent metrics specifically when a subset of parallel calls fails.

Useful? React with 👍 / 👎.

Replace synthetic tabId literals in scenario definitions with logical tab aliases so matrix execution can only pass tab IDs returned by tabs_create. Add a CLI guard test for unknown categories so typos fail before adapters run. Constraint: Codex review flagged hard-coded tab IDs and silent empty category runs as P1 merge blockers. Rejected: Keeping tab1 placeholders in args | They remained indistinguishable from real tab IDs to reviewers and future maintainers. Confidence: high Scope-risk: narrow Directive: Matrix scenarios should describe logical tab ownership; createMatrixTask owns concrete tab ID injection. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check Not-tested: real browser benchmark run Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc16b858c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T02:30:15Z

  const mode = modeIndex !== -1 && modeIndex + 1 < args.length
    ? args[modeIndex + 1]
    : 'stub';


Honor --json flag for benchmark CLI output

The CLI never parses --json, so non-CI runs still go through the interactive text-report path. As a result, commands like tests/benchmark/run.ts --category agent-loop --runs 1 --json produce formatted output/progress instead of machine-readable JSON, which breaks automation that depends on the new matrix results being exportable without --ci.

Useful? React with 👍 / 👎.

Treat --json as a machine-readable benchmark mode, keep progress on stderr, and require tabs_create to return a concrete tabId before matrix steps run. The stub adapter now mirrors that tab creation contract for deterministic local JSON runs. Constraint: Codex review flagged the documented --json benchmark command as still producing interactive output. Rejected: Falling back to logical tab aliases when tabs_create omits tabId | That can hide broken adapter contracts and reintroduce synthetic tab IDs. Confidence: high Scope-risk: narrow Directive: Benchmark automation must fail fast on invalid selectors and adapter contract gaps instead of producing false-green reports. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check Not-tested: real browser benchmark run Co-authored-by: OmX <omx@oh-my-codex.dev>

Keep setup and cleanup outside warm scenario measurements, close every matrix-created tab, wait for all parallel reads to settle before reporting failure, and preserve JSON-mode benchmark output. Constraint: Codex review flagged warm setup contamination, leaked benchmark tabs, parallel early-return failures, and missing --json handling. Rejected: Measuring tab setup in warm scenarios | It obscures read/action payload and latency comparisons the matrix is meant to isolate. Confidence: high Scope-risk: narrow Directive: Matrix task metrics should represent only scenario steps; setup/cleanup should enforce isolation without affecting counters. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check Not-tested: real browser benchmark run Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff30862206

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T03:04:49Z

+          const result = await adapter.callTool(step.tool, args);
+          if (result.isError) {
+            const text = result.content?.find((item) => typeof item.text === 'string')?.text;
+            throw new Error(`Benchmark step failed: ${step.tool}${text ? ` — ${text.slice(0, 160)}` : ''}`);
+          }
+          measureCall(result, args, counters);


Record tabs created in-step so cleanup closes them

When a scenario step itself calls tabs_create (for example cold-start-first-tab), this code executes the call but never adds the returned tabId to createdTabIds, so closeCreatedTabs() cannot close that tab at the end of the run. In real mode this leaks one Chrome tab per run, which distorts memory/latency metrics and can eventually affect benchmark stability. Capture tabId from successful tabs_create step results here and append it to createdTabIds before returning.

Useful? React with 👍 / 👎.

Track tab IDs returned by measured tabs_create steps so cold-start and other in-step tab creation scenarios are cleaned up just like setup-created aliases. Constraint: Codex review flagged real-mode tab leaks from tabs_create scenario steps as a P1 benchmark stability issue. Rejected: Only closing alias setup tabs | It leaves cold-start measurements leaking one Chrome tab per run. Confidence: high Scope-risk: narrow Directive: Every successful tabs_create in matrix tasks must be paired with best-effort tabs_close cleanup outside measured counters. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check Not-tested: real browser benchmark run Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a7ba0e6ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T03:14:52Z

+    : undefined;
+
+  const runner = new BenchmarkRunner({
+    runsPerTask: Number.isFinite(parsedRuns) && parsedRuns! > 0 ? parsedRuns : (ciMode ? 3 : 5),


Parse --runs as an integer before configuring runner

The new --runs parsing accepts any finite positive number, but BenchmarkRunner executes with for (let i = 0; i < runsPerTask; i++), so fractional values silently run an unexpected number of iterations (e.g. --runs 1.5 executes 2 runs). This skews benchmark results and makes CLI behavior inconsistent with the parallel runner (which uses parseInt), so --runs should be normalized to a positive integer or rejected when non-integer.

Useful? React with 👍 / 👎.

Constraint: Codex P2 found Number() accepted fractional --runs values which silently produced unexpected iteration counts (e.g. --runs 1.5 → 2 runs). Rejected: Allowing parseInt of "1.5" to be 1 silently | hides user intent and produces inconsistent results vs the parallel runner. Confidence: high Scope-risk: narrow Directive: CLI numeric options must reject ambiguous input. Tested: npm run build; npx jest --runInBand tests/benchmark/benchmark-runner.test.ts Not-tested: full CI matrix

Constraint: macos-22 CI on PR #1079 failed because workflow log lines ([WorkflowEn...]) interleaved with stdout caused extractJsonArray to anchor on the first bracket and feed non-JSON text to JSON.parse. Rejected: Suppressing workflow logs in the CLI | they are intentional diagnostics, not test concerns. Confidence: high Scope-risk: narrow Directive: In-process CLI test helpers must anchor on JSON-array-shape boundaries. Tested: npx jest --runInBand tests/cli/admin-keys.test.ts Not-tested: full CI matrix

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0a9866d9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T09:37:47Z

+          }
+        }
+
+        await closeCreatedTabs(adapter, createdTabIds);


Stop latency timer before benchmark cleanup

Move tab cleanup out of the measured interval for matrix scenarios. In this success path, closeCreatedTabs runs before wallTimeMs is computed, so every reported latency includes tabs_close overhead (and that overhead grows with tab count, e.g. parallel-tabs-20). This makes the published timings inconsistent with scenario descriptions like “read_page latency” or “first usable tab”, and can distort cross-scenario and before/after comparisons.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 and others added 2 commits May 13, 2026 10:18

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 and others added 2 commits May 13, 2026 11:57

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 mentioned this pull request May 13, 2026

Standardize OpenChrome benchmark matrix for latency, payload, and memory #970

Closed

9 tasks

shaun0927 added 2 commits May 13, 2026 18:18

shaun0927 added 2 commits May 13, 2026 18:27

Merge develop into test/970-benchmark-matrix

6134e9a

Merge develop into test/970-benchmark-matrix

b0a9866

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

shaun0927 merged commit 42ee6de into develop May 13, 2026
9 checks passed

		{ tool: 'act', args: { tabId: 'tab1', action: 'click', ref: 'ref_1' } },
		{ tool: 'act', args: { tabId: 'tab1', action: 'fill', ref: 'ref_2', text: 'benchmark' } },

		const result = await adapter.callTool(step.tool, step.args);
		measureCall(result, step.args, counters);

		for (const step of scenario.steps) {
		const result = await adapter.callTool(step.tool, step.args);

		{ tool: 'act', args: { tabId: 'tab1', instruction: 'click ref_1' } },
		{ tool: 'act', args: { tabId: 'tab1', instruction: 'type benchmark into ref_2' } },

	const ciMode = args.includes('--ci');
	const modeIndex = args.indexOf('--mode');

		measureCall(result, args, counters);
		const payload = responsePayloadSize(result);

		if (scenario.category === 'parallel-tabs') {
		await Promise.all(scenario.steps.map(runStep));

Conversation

shaun0927 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress / Review status

Summary

Fit / non-overlap review

Tests

OpenChrome real-validation checklist

Uh oh!

gemini-code-assist Bot commented May 12, 2026

Uh oh!

qodo-code-review Bot commented May 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

shaun0927 commented May 12, 2026 •

edited

Loading