Skip to content

feat(observability): measure response tokens and cache status#1063

Open
shaun0927 wants to merge 8 commits into
developfrom
feat/996-token-observability
Open

feat(observability): measure response tokens and cache status#1063
shaun0927 wants to merge 8 commits into
developfrom
feat/996-token-observability

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Summary

  • adds bounded-cardinality output-size and estimated-token metrics for final MCP tool responses
  • records compression savings and cache status metrics when tools expose those fields
  • adds metric registration tests for the new observability surface

Closes #996.

Alignment / scope review

This stays aligned with OpenChrome's direction: no server-side LLM, no tokenizer dependency, no billing integration, and no persisted response bodies. It measures only payload sizes/status after response shaping and redaction.

Verification

  • /Users/jh0927/openchrome/node_modules/.bin/tsc -p tsconfig.json --pretty false
  • npx jest --config jest.config.js --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts

Post-merge smoke

Run the issue's OpenChrome MCP verification script/manual flow against /metrics to confirm openchrome_tool_output_bytes, openchrome_tool_estimated_tokens, and cache-status counters appear with bounded labels.

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d20405b6f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts Outdated
const metrics = getMetricsCollector();
const bytes = estimateResultPayloadBytes(result);
metrics.observe('openchrome_tool_output_bytes', withTenantLabel({ tool: toolName }), bytes);
metrics.observe('openchrome_tool_estimated_tokens', withTenantLabel({ tool: toolName }), estimateOutputTokensFromChars(bytes));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute token estimate from character count

recordToolOutputObservability passes the UTF-8 byte length into estimateOutputTokensFromChars, but that heuristic is documented and named as chars/4. For non-ASCII responses (e.g., CJK or emoji-heavy outputs), byte length is larger than character count, so openchrome_tool_estimated_tokens is systematically inflated and no longer comparable across languages.

Useful? React with 👍 / 👎.

Comment thread src/mcp-server.ts Outdated
const status = typeof obj.status === 'string' ? obj.status : typeof obj.cacheStatus === 'string' ? obj.cacheStatus : null;
if (!status) return null;
const keyVersion = obj.keyVersion ?? obj.version ?? 'unknown';
return { status: status.toUpperCase(), keyVersion: String(keyVersion) };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize cache key version metric label

extractCacheStatus forwards keyVersion as an arbitrary string, and that value is later emitted as the key_version label in openchrome_cache_status_total without any bounding/normalization. If a tool reports per-request versions (hashes/timestamps), this introduces unbounded label cardinality and can degrade Prometheus memory/query performance.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6abe02838

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
if (cache) {
metrics.inc('openchrome_cache_status_total', withTenantLabel({
tool: toolName,
status: cache.status,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Constrain cache status labels before emitting metrics

recordToolOutputObservability emits openchrome_cache_status_total with status taken directly from tool output via extractCacheStatus (only uppercased), so any tool that includes request-specific detail in cache status (for example reason strings or embedded IDs) will create unbounded label cardinality and degrade Prometheus memory/query performance. Please normalize this to a small allowlist (e.g., HIT/MISS/BYPASS/ERROR) or map unknown values to a fixed bucket before incrementing the counter.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d8bb37afb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
if (raw && typeof raw === 'object') {
const obj = raw as Record<string, unknown>;
const status = typeof obj.status === 'string' ? obj.status : typeof obj.cacheStatus === 'string' ? obj.cacheStatus : null;
if (!status) return null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Continue cache-status fallback instead of returning early

When raw is an object but does not contain a string status/cacheStatus, this early return null prevents the later structuredContent.cacheStatus fallback from running. In mixed-shape results (for example, cache: { keyVersion: "v2" } plus structuredContent.cacheStatus: "miss"), openchrome_cache_status_total is silently dropped even though a valid cache status is present.

Useful? React with 👍 / 👎.

shaun0927 and others added 2 commits May 13, 2026 01:40
Constraint: OpenChrome must expose token/output savings without adding tokenizer or billing dependencies.
Rejected: Provider-specific tokenizers | they add dependency and billing semantics outside this MCP server boundary.
Confidence: medium
Scope-risk: narrow
Directive: Keep metric labels bounded; never add URL, selector, instruction, or domain labels for these metrics.
Tested: /Users/jh0927/openchrome/node_modules/.bin/tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts
Not-tested: Live /metrics smoke against a running OpenChrome instance
@shaun0927 shaun0927 force-pushed the feat/996-token-observability branch from 5d8bb37 to 94f4324 Compare May 12, 2026 16:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94f4324d1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
if (typeof originalChars === 'number' && typeof compressedChars === 'number' && originalChars > compressedChars) {
metrics.observe(
'openchrome_tool_compression_saved_bytes',
withTenantLabel({ tool: toolName, mode: level }),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize compression mode labels before emitting metrics

recordToolOutputObservability forwards _compression.level directly into the mode label for openchrome_tool_compression_saved_bytes. Because MCPResult is free-form, any tool can return request-specific strings in _compression.level, which creates unbounded label cardinality and can degrade Prometheus memory/query performance once compression stats are emitted. Please map this label to a small allowlist (and bucket unknowns) before calling observe.

Useful? React with 👍 / 👎.

@shaun0927
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94f4324d1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
],
isError: true,
};
this.recordToolOutputObservability(toolName, forbiddenResult);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound tool labels before recording denied-call output metrics

recordToolOutputObservability is called on the scope-denied path before verifying that toolName is a registered tool, so a read-only API key can send arbitrary params.name values and create unbounded new time series in openchrome_tool_output_bytes / openchrome_tool_estimated_tokens (and potentially cache-status metrics). This introduces a cardinality-amplification vector from user input and can degrade Prometheus memory/query performance; normalize unknown names to a fixed bucket (e.g. unknown) or only emit these metrics after successful tool lookup.

Useful? React with 👍 / 👎.

Merges the latest develop test-infra repairs so PR #1063 is evaluated against the branch it targets without changing the bounded metrics feature itself.

Constraint: Remote CI failed on stale develop test expectations unrelated to token observability.

Rejected: Editing unrelated doctor/tab tests in this PR | the fixes already landed on develop and should be consumed as base updates.

Confidence: medium

Scope-risk: narrow

Directive: Keep output observability labels bounded; do not add URL, selector, instruction, or domain labels.

Tested: pending rerun after dependency install; remote CI required for full matrix.

Not-tested: Full CI matrix locally.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5f9adab7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
Comment on lines +1855 to +1858
const originalChars = (compression as Record<string, unknown>).originalChars;
const compressedChars = (compression as Record<string, unknown>).compressedChars;
const level = String((compression as Record<string, unknown>).level ?? 'unknown');
if (typeof originalChars === 'number' && typeof compressedChars === 'number' && originalChars > compressedChars) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve compression stats before observing saved bytes

recordToolOutputObservability expects _compression.originalChars and _compression.compressedChars, but the success path currently rewrites _compression to { level, verbosity } before this method runs when compression.trackSavings is enabled. In that configuration, any tool-provided compression stats are discarded, so openchrome_tool_compression_saved_bytes never records even when a tool returns valid savings data. Please merge/retain existing _compression fields instead of replacing the object.

Useful? React with 👍 / 👎.

Records original and delta response sizes on real read_page DOM delta responses so compression savings metrics are populated from actual tool output instead of inert placeholder metadata.

Constraint: PR #1063 promised compression savings observability without adding tokenizer or billing dependencies.

Rejected: Provider-specific token accounting | this remains a bounded payload-size heuristic within the MCP server boundary.

Confidence: high

Scope-risk: narrow

Directive: Keep output metric labels bounded and never include URL, selector, instruction, or domain labels.

Tested: npm test -- --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts; ./node_modules/.bin/tsc -p tsconfig.json --pretty false

Not-tested: Live /metrics smoke against a running OpenChrome instance.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d45a2def9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tools/read-page.ts
Comment on lines +551 to +552
originalChars: outputText.length,
compressedChars: compressedText.length,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count full uncompressed response when reporting delta savings

_compression.originalChars is set from outputText.length, but the non-delta DOM response for the same call is outputText + domPaginationSection. When includePagination is left at its default (true) and pagination metadata is non-empty, this undercounts the baseline and can make originalChars <= compressedChars, so openchrome_tool_compression_saved_bytes is underreported or dropped even when delta compression actually saved bytes.

Useful? React with 👍 / 👎.

shaun0927 and others added 2 commits May 13, 2026 04:18
Cleans up module mocks used by the read_page compression-savings regression test so full-suite CI workers do not inherit mocked session or DOM modules.

Constraint: CI runs many Jest files in shared workers, so test-local doMock calls must not leak beyond this suite.

Rejected: Removing the real read_page delta-path test | it is needed to prove the compression-savings metric is populated.

Confidence: high

Scope-risk: narrow

Directive: Keep observability tests hermetic when mocking runtime modules.

Tested: npm test -- --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts

Not-tested: Full Jest suite locally after this test-isolation fix.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53347f80b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp-server.ts
metrics.observe(
'openchrome_tool_compression_saved_bytes',
withTenantLabel({ tool: toolName, mode: level }),
originalChars - compressedChars,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert compression savings metric to true byte deltas

openchrome_tool_compression_saved_bytes is recorded from originalChars - compressedChars, but those _compression fields are character counts (for example read_page sets them from .length). For non-ASCII output (CJK, emoji, etc.), character deltas can differ substantially from UTF-8 byte deltas, so this metric reports incorrect units and can mislead dashboards/alerts that depend on byte savings. Compute the delta from UTF-8 byte lengths (or rename the metric to chars) before observing it.

Useful? React with 👍 / 👎.

Constraint: PR #1063 inherits the expanded Tier 1 tool surface from develop.
Rejected: Loosening the cursor verification assertion | it would stop catching accidental tool-surface drift.
Confidence: high
Scope-risk: narrow
Directive: Keep the exact count aligned whenever Tier 1 tool registration intentionally changes.
Tested: CI=true npx jest --config jest.ci.config.js --runInBand tests/cross-env/cursor-verification.test.ts
Not-tested: Full CI matrix pending on GitHub runners.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@shaun0927
Copy link
Copy Markdown
Owner Author

@codex review

Constraint: PR #1063 must pass Windows CI where Git can check out JSON fixtures with CRLF line endings.
Rejected: Changing the v1.11.0 fixture payload | the response shape did not change.
Confidence: high
Scope-risk: narrow
Directive: Preserve strict JSON shape assertions while normalizing platform line endings at comparison time.
Tested: CI=true npx jest --config jest.ci.config.js --runInBand tests/tools/console-capture-regression.test.ts tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts
Not-tested: Full CI matrix pending on GitHub runners.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@shaun0927
Copy link
Copy Markdown
Owner Author

@codex review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant