Skip to content

test(benchmark): standardize agent-loop performance matrix#1079

Merged
shaun0927 merged 15 commits into
developfrom
test/970-benchmark-matrix
May 13, 2026
Merged

test(benchmark): standardize agent-loop performance matrix#1079
shaun0927 merged 15 commits into
developfrom
test/970-benchmark-matrix

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

@shaun0927 shaun0927 commented May 12, 2026

Progress / Review status

Auto-refreshed 2026-05-13 — owner comments cleaned up to reduce review noise.

Field Value
Branch test/970-benchmark-matrixdevelop
Draft no
CI ✅ all 9 checks passing
Mergeable ✅ MERGEABLE
Review decision
Codex (latest) 💡 suggestions posted
Other reviewers (latest) chatgpt-codex-connector: commented
Head b0a9866 — Merge develop into test/970-benchmark-matrix
Commits 15

Owner comment cleanup: 11 issue + 0 inline review comments deleted. Outstanding feedback from automated/external reviewers above is unchanged.


Summary

  • Adds a standardized benchmark matrix for cold start, DOM/AX read_page, DOM delta, interactive discovery, click/fill actions, screenshots, agent-loop, and 1/5/20 parallel-tab smoke scenarios.
  • Extends benchmark stats with avg/p50/min/max/stddev latency, response chars, estimated output tokens, screenshot bytes, and Node RSS.
  • Adds --category, --runs, and --json support to the existing benchmark runner so optimization PRs can produce comparable JSON.

Closes #970

Fit / non-overlap review

Tests

  • ./node_modules/.bin/jest tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand
  • ./node_modules/.bin/tsc -p tsconfig.json --noEmit
  • ./node_modules/.bin/ts-node tests/benchmark/run.ts --category agent-loop --runs 1 --json

OpenChrome real-validation checklist

After merge, validate with OpenChrome itself:

  1. Build OpenChrome: npm run build.
  2. Run smoke JSON: npm run benchmark -- --category agent-loop --runs 3 --json.
  3. Confirm the JSON task includes samples/runs, meanWallTimeMs, p50WallTimeMs, minWallTimeMs, maxWallTimeMs, stddevWallTimeMs, meanResponseChars, meanEstimatedOutputTokens, and meanNodeRssBytes.
  4. Run a full/manual scenario where feasible: npx ts-node tests/benchmark/run-parallel.ts --mode real --category batch-js --runs 1.
  5. Paste the generated JSON/Markdown summary into future optimization PRs for before/after comparison.

Add a reusable benchmark matrix covering read_page, delta, screenshot, action, cold-start, agent-loop, and parallel-tab smoke scenarios, plus JSON stats for latency distribution, payload size, token estimate, screenshot bytes, and Node RSS.

Constraint: CI/default paths remain deterministic stub runs; real Chrome benchmarking stays manual/advisory.

Rejected: Hard CI performance gates | environment variance would make current benchmark results flaky.

Confidence: high

Scope-risk: moderate

Directive: Add new optimization PR scenarios through the matrix helper so before/after reports stay comparable.

Tested: ./node_modules/.bin/jest tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand

Tested: ./node_modules/.bin/tsc -p tsconfig.json --noEmit

Tested: ./node_modules/.bin/ts-node tests/benchmark/run.ts --category agent-loop --runs 1 --json

Not-tested: Full real Chrome benchmark matrix; left as manual PR checklist because it can be environment-dependent.
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d36b0cc95c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/matrix.ts Outdated
Comment on lines +86 to +87
{ tool: 'act', args: { tabId: 'tab1', action: 'click', ref: 'ref_1' } },
{ tool: 'act', args: { tabId: 'tab1', action: 'fill', ref: 'ref_2', text: 'benchmark' } },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Send instruction to act steps in matrix scenarios

The matrix defines act steps with { action, ref, text }, but the act tool contract requires instruction (and tabId) as its input (src/tools/act.ts, required: ['tabId', 'instruction']). As written, these calls will return tool errors in real mode, so the agent-loop benchmark path advertised in this commit measures error responses instead of real click/fill latency.

Useful? React with 👍 / 👎.

Comment thread tests/benchmark/matrix.ts Outdated
Comment on lines +138 to +139
const result = await adapter.callTool(step.tool, step.args);
measureCall(result, step.args, counters);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail matrix runs when a tool call returns isError

Each step result is consumed only for metrics, and the run is returned as success: true even when MCP returns an error payload. OpenChromeRealAdapter.callTool explicitly maps server/tool failures to { isError: true } (tests/benchmark/adapters/openchrome-real-adapter.ts), so ignoring that flag silently marks broken scenarios as successful and corrupts benchmark comparisons.

Useful? React with 👍 / 👎.

Comment thread tests/benchmark/matrix.ts Outdated
name: 'cold-start-first-tab',
category: 'cold-start',
description: 'Session/server init to first usable tab',
steps: [{ tool: 'open_tabs', args: {} }],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a registered tool name for cold-start tab creation

The cold-start scenario calls open_tabs, but there is no registered MCP tool with that name in the repo tool registry (tab creation is registered as tabs_create in src/tools/tabs-create.ts). In real mode this scenario will always produce an error response, so the standardized cold-start metric cannot execute as intended.

Useful? React with 👍 / 👎.

Constraint: Codex P1 found matrix scenarios used non-contract tool arguments and still reported success on isError responses.\nRejected: Leaving real-mode failures as metrics-only samples | that corrupts benchmark comparisons and hides broken scenarios.\nConfidence: high\nScope-risk: narrow\nDirective: Matrix scenarios must use registered OpenChrome tool contracts and fail when the MCP layer reports isError.\nTested: npm ci; npx jest tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Real Chrome benchmark run.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10cedd8e04

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/matrix.ts Outdated
name: 'warm-read-page-dom',
category: 'read-page',
description: 'Warm DOM read_page latency and payload size',
steps: [{ tool: 'read_page', args: { tabId: 'tab1', mode: 'dom' } }],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Create and reuse real tab IDs in matrix scenarios

The matrix steps hard-code tabId: 'tab1' (and tab2...tab20) instead of using the tabId returned by tabs_create, so real-mode runs will fail once they hit read_page/act/page_screenshot on non-existent tabs. tabs_create returns runtime-generated target IDs (not stable tabN aliases), and read_page explicitly errors when the tab is not found, so most scenarios in this matrix will benchmark tool errors rather than the intended workflows.

Useful? React with 👍 / 👎.

Comment thread tests/benchmark/run.ts Outdated

async function main(): Promise<void> {
const ciMode = process.argv.includes('--ci');
const jsonMode = process.argv.includes('--json') || ciMode;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep --json mode output strictly JSON

Enabling --json now routes into jsonMode, but this path still emits non-JSON text (progress logs before serialization and status text after it), which makes stdout invalid for direct JSON parsing despite the branch being documented as machine-readable output. This breaks automation that expects --json to produce only JSON bytes.

Useful? React with 👍 / 👎.

Resolve the benchmark CLI/test conflicts by preserving develop's injectable CLI output path while keeping this PR's category and run-count options for the standardized matrix.

Constraint: GitHub marked PR #1079 as CONFLICTING against develop after review fixes were pushed.

Rejected: Rebasing away branch history | a merge commit preserves the existing PR review context and only resolves base drift.

Confidence: high

Scope-risk: moderate

Directive: Keep benchmark CLI stdout JSON-only in CI while adding future CLI filters.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand

Tested: npm ci

Tested: npm run build

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9bae99613

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/run.ts
Comment on lines +63 to +65
if (category) {
for (const task of createMatrixTasks({ category })) {
runner.addTask(task);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject unknown benchmark categories

When --category is provided, createMatrixTasks can return an empty list for typos or unsupported values, but main still runs and exits successfully with empty reports. In CI this produces a false-green benchmark run (no scenarios executed, no regression possible), which can silently corrupt before/after performance comparisons.

Useful? React with 👍 / 👎.

Comment thread tests/benchmark/matrix.ts Outdated
Comment on lines +137 to +138
for (const step of scenario.steps) {
const result = await adapter.callTool(step.tool, step.args);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Execute parallel-tab scenarios concurrently

The matrix defines parallel-tabs-* scenarios, but task execution awaits each step in a single for loop, so all read_page calls run serially. This measures cumulative sequential latency instead of parallel tab behavior, making the reported 1/5/20 tab results non-comparable for concurrency regressions.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9bae99613

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/run.ts
Comment on lines +63 to +66
if (category) {
for (const task of createMatrixTasks({ category })) {
runner.addTask(task);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject unknown benchmark categories before running

When --category does not match any matrix scenario, this branch adds no tasks and the command still runs both adapters, emitting an empty report (all-zero summary) and a passing regression result. In practice, a typo or stale category name can silently look like a successful benchmark run and corrupt CI/perf comparisons. Please fail fast if createMatrixTasks({ category }) returns an empty list, with a clear error and non-zero exit.

Useful? React with 👍 / 👎.

shaun0927 and others added 2 commits May 13, 2026 10:18
Fix the develop-merge CI path for PR #1079 by awaiting the introspection stdout write before exit and aligning crawl tool required-field descriptions with the schema lint baseline.

Constraint: Remote CI failed in lint:tool-schemas on the PR merge commit before running the benchmark matrix tests.

Rejected: Ignoring this as base noise | required checks validate the actual merge commit, so the PR must carry the compatibility fix.

Confidence: high

Scope-risk: narrow

Directive: Large machine-readable stdout output used in pipes must be flushed before process.exit.

Tested: npm run build

Tested: npm run lint:tool-schemas

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Create benchmark tabs through tabs_create, substitute the returned tab ids into scenario steps, reject unknown categories, and check isError before accepting step results. Parallel-tab scenarios now dispatch their read steps concurrently after tab setup.

Constraint: Latest Codex review on PR #1079 flagged hard-coded tab ids, empty category selections, and weak isError handling as P1 risks.

Rejected: Keeping synthetic tab1 placeholders in real-mode runs | real adapters require server-issued tab ids to exercise the intended path.

Confidence: high

Scope-risk: moderate

Directive: Benchmark scenarios may use placeholders in definitions, but runtime execution must resolve them to concrete tool outputs.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand

Tested: npm run build

Tested: npm run lint:tool-schemas

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e2fceebb9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/run.ts
Comment on lines +63 to +66
if (category) {
for (const task of createMatrixTasks({ category })) {
runner.addTask(task);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject unknown benchmark categories

When --category does not match any scenario, createMatrixTasks({ category }) returns an empty array and this loop adds zero tasks, but the CLI still runs and exits successfully with an empty report. That silently turns typos or stale category names into false-green benchmark runs and can corrupt CI/perf comparisons because no workload actually executed. Please fail fast when no tasks are selected for a provided category.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99eaf25ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/matrix.ts Outdated
Comment on lines +107 to +108
{ tool: 'act', args: { tabId: 'tab1', instruction: 'click ref_1' } },
{ tool: 'act', args: { tabId: 'tab1', instruction: 'type benchmark into ref_2' } },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace invalid act ref targets in matrix scenarios

These scenarios hardcode act instructions like click ref_1 / type ... ref_2, but the matrix runner creates fresh tabs at about:blank and never performs any ref-id resolution before calling act. In act, targets are resolved via AX text matching (resolveElementsByAXTree(normalizeQuery(query))) and unresolved targets are treated as hard failures (isError: true), so action (and similarly agent-loop) will fail consistently in real runs instead of measuring latency. This makes the standardized matrix unreliable for production benchmarking.

Useful? React with 👍 / 👎.

Load action scenarios with a tiny local data-page fixture and target controls by accessible text instead of synthetic ref ids, so real-mode act calls benchmark the intended action path rather than deterministic unresolved-ref errors.

Constraint: Latest Codex review on PR #1079 flagged hard-coded ref_1/ref_2 act instructions as P1 because fresh tabs never resolve those refs.

Rejected: Keeping about:blank for action scenarios | act needs concrete controls to measure useful latency.

Confidence: high

Scope-risk: narrow

Directive: Benchmark action fixtures must create their own resolvable DOM/AX targets before invoking act.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand

Tested: npm run build

Tested: npm run lint:tool-schemas

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

const ciMode = args.includes('--ci');
const modeIndex = args.indexOf('--mode');

P1 Badge Parse and honor the --json CLI flag

The new CLI path never reads --json, so main falls back to interactive mode unless --ci is set. That means ts-node tests/benchmark/run.ts --category agent-loop --runs 1 --json (the command documented in this commit) still writes progress text and a formatted report instead of machine-readable JSON, breaking automation that expects JSON output for baseline comparisons.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/matrix.ts
Comment on lines +165 to +167
for (const placeholder of tabPlaceholders(scenario)) {
const args = { url: setupUrlForScenario(scenario) };
const result = await adapter.callTool('tabs_create', args);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Close matrix-created tabs after each benchmark run

Each matrix run creates one or more tabs via tabs_create for placeholders, but those tabs are never closed before returning. Because BenchmarkRunner.run keeps one adapter/session alive across all tasks and repetitions, tabs accumulate across runs (especially parallel-tabs-20), which increases memory usage and can skew later latency measurements, making matrix results non-comparable within the same benchmark invocation.

Useful? React with 👍 / 👎.

Resolve develop drift in the introspection path by keeping flushed JSON output and returning from the CLI action without process.exit.

Constraint: GitHub marked PR #1079 dirty after develop advanced.

Rejected: Reintroducing immediate process.exit | that previously truncated piped tool-list JSON in CI.

Confidence: high

Scope-risk: narrow

Directive: Introspection output must be fully written before the command terminates.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand

Tested: npm run build

Tested: npm run lint:tool-schemas

Tested: git diff --check

Not-tested: Full remote CI matrix pending after push.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1771784387

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/matrix.ts Outdated
Comment on lines +172 to +173
measureCall(result, args, counters);
const payload = responsePayloadSize(result);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude tab setup from warm scenario measurements

These counters are updated while creating placeholder tabs, so scenarios like warm-read-page-*, interactive, action, and screenshot include tabs_create overhead in toolCallCount, payload sizes, and wallTimeMs. That makes the reported numbers inconsistent with the scenario definitions (which list only the measured step(s)) and contaminates warm-path comparisons with setup cost that should be isolated to cold-start.

Useful? React with 👍 / 👎.

Comment thread tests/benchmark/matrix.ts Outdated
Comment on lines +195 to +196
if (scenario.category === 'parallel-tabs') {
await Promise.all(scenario.steps.map(runStep));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for parallel reads to settle before returning failure

Using Promise.all here causes fail-fast behavior: if one read_page rejects, the task returns immediately via catch while the other tab reads keep running in the background. In benchmark loops this can overlap with the next run on the same adapter and also drops late results from counters, producing inconsistent metrics specifically when a subset of parallel calls fails.

Useful? React with 👍 / 👎.

Replace synthetic tabId literals in scenario definitions with logical tab aliases so matrix execution can only pass tab IDs returned by tabs_create. Add a CLI guard test for unknown categories so typos fail before adapters run.

Constraint: Codex review flagged hard-coded tab IDs and silent empty category runs as P1 merge blockers.

Rejected: Keeping tab1 placeholders in args | They remained indistinguishable from real tab IDs to reviewers and future maintainers.

Confidence: high

Scope-risk: narrow

Directive: Matrix scenarios should describe logical tab ownership; createMatrixTask owns concrete tab ID injection.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check

Not-tested: real browser benchmark run

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc16b858c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/run.ts
Comment on lines 46 to 48
const mode = modeIndex !== -1 && modeIndex + 1 < args.length
? args[modeIndex + 1]
: 'stub';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor --json flag for benchmark CLI output

The CLI never parses --json, so non-CI runs still go through the interactive text-report path. As a result, commands like tests/benchmark/run.ts --category agent-loop --runs 1 --json produce formatted output/progress instead of machine-readable JSON, which breaks automation that depends on the new matrix results being exportable without --ci.

Useful? React with 👍 / 👎.

shaun0927 and others added 2 commits May 13, 2026 11:57
Treat --json as a machine-readable benchmark mode, keep progress on stderr, and require tabs_create to return a concrete tabId before matrix steps run. The stub adapter now mirrors that tab creation contract for deterministic local JSON runs.

Constraint: Codex review flagged the documented --json benchmark command as still producing interactive output.

Rejected: Falling back to logical tab aliases when tabs_create omits tabId | That can hide broken adapter contracts and reintroduce synthetic tab IDs.

Confidence: high

Scope-risk: narrow

Directive: Benchmark automation must fail fast on invalid selectors and adapter contract gaps instead of producing false-green reports.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check

Not-tested: real browser benchmark run

Co-authored-by: OmX <omx@oh-my-codex.dev>
Keep setup and cleanup outside warm scenario measurements, close every matrix-created tab, wait for all parallel reads to settle before reporting failure, and preserve JSON-mode benchmark output.

Constraint: Codex review flagged warm setup contamination, leaked benchmark tabs, parallel early-return failures, and missing --json handling.

Rejected: Measuring tab setup in warm scenarios | It obscures read/action payload and latency comparisons the matrix is meant to isolate.

Confidence: high

Scope-risk: narrow

Directive: Matrix task metrics should represent only scenario steps; setup/cleanup should enforce isolation without affecting counters.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check

Not-tested: real browser benchmark run

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff30862206

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/matrix.ts
Comment on lines +190 to +195
const result = await adapter.callTool(step.tool, args);
if (result.isError) {
const text = result.content?.find((item) => typeof item.text === 'string')?.text;
throw new Error(`Benchmark step failed: ${step.tool}${text ? ` — ${text.slice(0, 160)}` : ''}`);
}
measureCall(result, args, counters);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Record tabs created in-step so cleanup closes them

When a scenario step itself calls tabs_create (for example cold-start-first-tab), this code executes the call but never adds the returned tabId to createdTabIds, so closeCreatedTabs() cannot close that tab at the end of the run. In real mode this leaks one Chrome tab per run, which distorts memory/latency metrics and can eventually affect benchmark stability. Capture tabId from successful tabs_create step results here and append it to createdTabIds before returning.

Useful? React with 👍 / 👎.

Track tab IDs returned by measured tabs_create steps so cold-start and other in-step tab creation scenarios are cleaned up just like setup-created aliases.

Constraint: Codex review flagged real-mode tab leaks from tabs_create scenario steps as a P1 benchmark stability issue.

Rejected: Only closing alias setup tabs | It leaves cold-start measurements leaking one Chrome tab per run.

Confidence: high

Scope-risk: narrow

Directive: Every successful tabs_create in matrix tasks must be paired with best-effort tabs_close cleanup outside measured counters.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-runner.test.ts tests/benchmark/matrix.test.ts --runInBand; npm run build; npm run lint:tool-schemas; git diff --check

Not-tested: real browser benchmark run

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a7ba0e6ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/run.ts Outdated
: undefined;

const runner = new BenchmarkRunner({
runsPerTask: Number.isFinite(parsedRuns) && parsedRuns! > 0 ? parsedRuns : (ciMode ? 3 : 5),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse --runs as an integer before configuring runner

The new --runs parsing accepts any finite positive number, but BenchmarkRunner executes with for (let i = 0; i < runsPerTask; i++), so fractional values silently run an unexpected number of iterations (e.g. --runs 1.5 executes 2 runs). This skews benchmark results and makes CLI behavior inconsistent with the parallel runner (which uses parseInt), so --runs should be normalized to a positive integer or rejected when non-integer.

Useful? React with 👍 / 👎.

shaun0927 added 2 commits May 13, 2026 18:18
Constraint: Codex P2 found Number() accepted fractional --runs values which
silently produced unexpected iteration counts (e.g. --runs 1.5 → 2 runs).

Rejected: Allowing parseInt of "1.5" to be 1 silently | hides user intent
and produces inconsistent results vs the parallel runner.

Confidence: high
Scope-risk: narrow
Directive: CLI numeric options must reject ambiguous input.

Tested: npm run build; npx jest --runInBand tests/benchmark/benchmark-runner.test.ts
Not-tested: full CI matrix
Constraint: macos-22 CI on PR #1079 failed because workflow log lines
([WorkflowEn...]) interleaved with stdout caused extractJsonArray to
anchor on the first bracket and feed non-JSON text to JSON.parse.

Rejected: Suppressing workflow logs in the CLI | they are intentional
diagnostics, not test concerns.

Confidence: high
Scope-risk: narrow
Directive: In-process CLI test helpers must anchor on JSON-array-shape boundaries.

Tested: npx jest --runInBand tests/cli/admin-keys.test.ts
Not-tested: full CI matrix
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0a9866d9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/benchmark/matrix.ts
}
}

await closeCreatedTabs(adapter, createdTabIds);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop latency timer before benchmark cleanup

Move tab cleanup out of the measured interval for matrix scenarios. In this success path, closeCreatedTabs runs before wallTimeMs is computed, so every reported latency includes tabs_close overhead (and that overhead grows with tab count, e.g. parallel-tabs-20). This makes the published timings inconsistent with scenario descriptions like “read_page latency” or “first usable tab”, and can distort cross-scenario and before/after comparisons.

Useful? React with 👍 / 👎.

@shaun0927 shaun0927 merged commit 42ee6de into develop May 13, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant