shaun0927 · shaun0927 · May 13, 2026 · May 13, 2026
diff --git a/docs/roadmap/2026-oss-library-comparison.md b/docs/roadmap/2026-oss-library-comparison.md
@@ -0,0 +1,113 @@
+# 2026 OSS library comparison for OpenSafari stability, memory, Flutter QA, search, and login
+
+_Last reviewed: 2026-05-14 KST. Source set: official project docs/repos where available._
+
+## Scope and current OpenSafari baseline
+
+OpenSafari is a macOS/iOS-Simulator focused MCP server for iOS Safari, WebKit Remote Debugging Protocol, native Accessibility/SimulatorKit input, and Flutter VM Service inspection. The repository already contains several directionally important surfaces:
+
+- Browser/native automation: `src/webkit/*`, `src/native/*`, `src/tools/app-*`, `src/tools/wait-for.ts`, `src/tools/app-wait-for.ts`.
+- Reliability: `src/reliability/*`, `src/watchdog/*`, private API sentinels, headless smoke workflow, simulator/proxy readiness work.
+- Memory: `src/metrics/memory-tracker.ts`, `src/metrics/heap-snapshot-diff.ts`, `tests/soak/*`, `src/tools/flutter-memory-profile.ts`.
+- Flutter QA: `src/tools/qa-flutter-*`, `src/flutter/vm-service-client.ts`, `src/tools/flutter-*`.
+- Login/auth persistence: `src/auth/manager.ts`, `src/tools/auth.ts`, existing issue #699.
+- Search: no product-facing web-search engine. OpenSafari searches/queries pages and native trees; general web search is outside the core automation runtime.
+
+The safe strategy is therefore not to import large competitors wholesale. OpenSafari should copy proven patterns that reduce flakiness, increase diagnostic evidence, and expose memory/Flutter validation without changing default runtime semantics.
+
+## Comparative analysis
+
+### Browser and mobile automation frameworks
+
+| Library | Strengths OpenSafari can learn from | Weaknesses / mismatch vs OpenSafari | Safe OpenSafari application |
+|---|---|---|---|
+| Appium 2 | Mature mobile-web/native/hybrid abstraction, capability-driven sessions, explicit command timeout (`newCommandTimeout`), broad ecosystem for iOS Safari and Flutter drivers. | Heavy server/driver stack; WebDriver indirection can add latency and hides private Simulator/WebKit details that OpenSafari intentionally controls directly. | Adopt capability/contract style for OpenSafari live validation issues and session health docs; avoid runtime dependency. |
+| WebdriverIO | Auto-waiting around interactable elements, timeout taxonomy, Appium integration, protocol abstraction across WebDriver/BiDi/mobile. | Its model assumes WebDriver sessions; OpenSafari already has direct MCP tools and native bridges. | Improve native `app_wait_for`/action diagnostics with stability windows and timeout metadata; no dependency. |
+| Playwright | Auto-wait, trace viewer, retry-on-failure trace capture, action-level snapshots, console/network correlation. | Desktop WebKit != iOS Safari simulator. Playwright trace format is large and runner-specific. | Add OpenSafari-native lightweight action trace artifacts for failed/long live validations. |
+| Puppeteer | Direct protocol control, useful CDP tracing/perf patterns, low-level browser primitives. | Chrome/CDP-centric; not applicable to iOS Safari/WebKit Remote Debugging Protocol without translation. | Keep direct protocol philosophy; do not adopt as dependency. |
+| Selenium / WebDriver BiDi | Standardization trajectory for bidirectional browser automation and logs/events. | Safari/iOS support still mediated by drivers; less direct than OpenSafari's target. | Track BiDi vocabulary for future event naming, but do not re-platform. |
+
+**Conclusion:** The mandatory improvement is Playwright/WebdriverIO-inspired diagnostics and auto-wait metadata, not Appium/Selenium/Puppeteer adoption.
+
+### Observability and memory tooling
+
+| Library | Strengths OpenSafari can learn from | Weaknesses / mismatch | Safe OpenSafari application |
+|---|---|---|---|
+| OpenTelemetry JS | Standard traces/metrics/logs vocabulary; spans map well to MCP tool calls, simulator boot, proxy readiness, WebKit commands. | SDK/exporter dependencies can be non-trivial and introduce startup/config complexity. | Define an OpenTelemetry-compatible trace schema in docs and JSON artifacts first; optional exporter later. |
+| Sentry | Error grouping, performance traces, crash reporting across Node and Flutter. | External SaaS/self-hosted dependency, privacy concerns, credentials/config burden. | Keep Sentry as optional downstream consumer of structured logs; do not embed. |
+| memlab | Three-snapshot leak detection, class-level heap reasoning, Node/browser snapshot assertions. | Puppeteer/Chromium orientation for browser scenarios; full leak graph analysis can be heavy. | Extend existing heap snapshot diff/memory soak docs with OpenSafari scenario budgets and class-delta thresholds. |
+| Clinic.js | Fast local Node profiling for event loop/flame/heap. | Dev-time tool, not runtime feature. | Document as optional triage command for memory/latency regressions. |
+| Node heap snapshots | Built-in and dependency-free; good for CI artifacts. | Snapshot creation can pause process and double memory temporarily. | Keep behind explicit soak/live validation only; never default-on hot path. |
+| autocannon | Simple HTTP benchmark for transports. | Only applies to HTTP/SSE transport, not stdio/local MCP or simulator latency. | Optional benchmark recipe for HTTP MCP transport; not mandatory now. |
+
+**Conclusion:** Mandatory improvement is a dependency-free OpenSafari memory/trace validation contract that uses existing metrics/heap-snapshot surfaces and avoids default runtime overhead.
+
+### Flutter stability tooling
+
+| Library | Strengths OpenSafari can learn from | Weaknesses / mismatch | Safe OpenSafari application |
+|---|---|---|---|
+| Flutter DevTools Memory | Allocation timeseries, diff snapshots, GC-aware leak workflows. | GUI/manual; release builds lack VM Service. | Make OpenSafari's `flutter_allocation_profile` leak workflow explicit and thresholded. |
+| leak_tracker | Test-time leak assertions around object lifecycle. | Dart package inside target app; OpenSafari cannot require apps to include it. | Provide external VM Service budget checks; recommend leak_tracker only as app-side complement. |
+| Patrol | Flutter-first E2E plus native automation; good at native permission/dialog flows. | Requires app/test harness; not a generic MCP runtime dependency. | Mirror the pattern: combine Flutter VM Service + native AX assertions in recipes. |
+| Maestro | Semantics-tree, black-box flows, simple YAML, device-level interactions. | Separate runner and DSL; would duplicate OpenSafari orchestration. | Strengthen semantics-first QA and live validation scripts; avoid separate DSL dependency. |
+| Appium Flutter Driver | Flutter widget selectors via Appium ecosystem. | Heavy WebDriver/Appium stack; requires app instrumentation. | Keep Flutter VM Service APIs; do not route through Appium. |
+
+**Conclusion:** Mandatory improvement is a Flutter memory budget/live validation recipe and small helper semantics, not importing Patrol/Maestro/Appium.
+
+### Fast web search engines
+
+| Library | Strengths | Weaknesses / mismatch | OpenSafari action |
+|---|---|---|---|
+| Typesense / Meilisearch | Fast typo-tolerant indexing, search-as-you-type. | Product search engine, not browser automation core. Adds service dependency. | Out of scope for runtime. Could inspire local artifact search later, but not mandatory. |
+| SearXNG | Privacy-preserving metasearch. | Running external metasearch is unrelated to iOS Safari automation. | Do not adopt. |
+| Tantivy / Quickwit | Fast indexing/log search. | Rust/service integration heavy. | Only consider if log volume outgrows simple JSON artifacts. Not mandatory. |
+
+**Conclusion:** Fast web search is directionally misaligned for OpenSafari core. The aligned substitute is searchable local trace/report artifacts, not a search engine dependency.
+
+### Fast login / authentication libraries
+
+| Library | Strengths | Weaknesses / mismatch | Safe OpenSafari application |
+|---|---|---|---|
+| SimpleWebAuthn | Clear passkey/WebAuthn ceremony model. | OpenSafari is not an RP/auth server; simulator passkey UX may require system prompts/keychain state. | Add passkey/login validation guidance and prompt-handling recipes; no dependency. |
+| Auth.js / Better Auth | Developer-friendly OAuth/session patterns. | Web app auth frameworks, not OpenSafari runtime concerns. | Use only as examples in docs for test-app login flows. |
+| Keycloak / ZITADEL / Logto / Ory / SuperTokens | Mature IAM/SSO options. | Heavy infra; adopting them would be out of scope and brittle for a browser automation MCP server. | Do not adopt. Existing #699 covers auth profile persistence; avoid duplicate work. |
+
+**Conclusion:** The mandatory login work is already represented by #699. New work should add non-duplicative validation guidance for login/passkey prompt automation only if it directly supports OpenSafari verification.
+
+## Mandatory improvement candidates
+
+After comparing the libraries to repository direction and existing issues, only the following are mandatory now:
+
+1. **Lightweight OpenSafari action trace artifacts** inspired by Playwright traces and OpenTelemetry spans.
+   - Why: failures in simulator/WebKit/native flows need correlated command timing, timeout, context, screenshots/log references, and recovery hints.
+   - Risk control: JSON artifact only; no default behavior change; no dependency.
+
+2. **Stable native wait diagnostics** inspired by WebdriverIO/Playwright auto-wait.
+   - Why: `app_wait_for` currently reports timeout/query/polls but not last observed candidates, stability windows, or why a visible/enabled condition failed.
+   - Risk control: backward-compatible optional parameters and richer JSON response.
+
+3. **Flutter memory budget live validation recipe/tooling** inspired by Flutter DevTools Memory, leak_tracker, and memlab.
+   - Why: OpenSafari already exposes allocation profiles and heap snapshots, but merge/post-merge validation needs a repeatable budget contract.
+   - Risk control: VM Service only, debug/profile builds only, optional scripts/docs; no app package dependency.
+
+4. **Passkey/login prompt live-validation guidance** inspired by SimpleWebAuthn/Auth.js, but scoped to OpenSafari automation.
+   - Why: fast login is central to app QA, but OpenSafari should not become an auth framework. The required work is a validation recipe using existing auth profile, native alert handling, app/webview tools, and explicit decisions.
+   - Risk control: documentation/issue contract unless a gap is found; do not duplicate #699.
+
+Not mandatory now: Appium/WebDriver re-platforming, Sentry/OpenTelemetry SDK embedding, Typesense/Meilisearch/SearXNG/Tantivy/Quickwit services, Auth.js/Keycloak/ZITADEL/Ory/SuperTokens runtime integrations.
+
+## Sources
+
+- Appium introduction/platform support/capabilities: https://appium.io/docs/en/latest/ and https://appium.github.io/appium.io/docs/en/about-appium/intro/
+- WebdriverIO auto-wait/timeouts/protocol docs: https://webdriver.io/docs/autowait/ and https://webdriver.io/docs/timeouts
+- Playwright trace viewer and browser docs: https://playwright.dev/docs/trace-viewer-intro and https://playwright.dev/docs/browsers
+- Puppeteer docs: https://developer.chrome.com/docs/puppeteer
+- Selenium WebDriver BiDi docs: https://www.selenium.dev/documentation/webdriver/bidi/
+- OpenTelemetry JS docs: https://opentelemetry.io/docs/languages/js/
+- memlab docs: https://facebook.github.io/memlab/docs/intro
+- Flutter DevTools Memory docs: https://docs.flutter.dev/tools/devtools/memory
+- Maestro Flutter/how-it-works docs: https://docs.maestro.dev/get-started/supported-platform/flutter and https://docs.maestro.dev/get-started/how-maestro-works
+- Typesense docs: https://typesense.org/docs/
+- Meilisearch docs: https://www.meilisearch.com/docs/
+- SimpleWebAuthn docs: https://simplewebauthn.dev/docs/
+- Auth.js docs: https://authjs.dev/
diff --git a/src/observability/action-trace.ts b/src/observability/action-trace.ts
@@ -0,0 +1,135 @@
+import { promises as fs } from 'fs';
+import * as path from 'path';
+
+export type ActionTraceStatus = 'passed' | 'failed' | 'timeout' | 'skipped';
+export type ActionTraceContext = 'webkit' | 'native' | 'flutter' | 'simulator' | 'orchestration' | 'unknown';
+
+export interface ActionTraceArtifact {
+  kind: 'screenshot' | 'console' | 'network' | 'crash' | 'log' | 'other';
+  path: string;
+}
+
+export interface ActionTraceEventInput {
+  action: string;
+  status: ActionTraceStatus;
+  context?: ActionTraceContext;
+  deviceId?: string;
+  startedAtMs: number;
+  endedAtMs: number;
+  timeoutMs?: number;
+  retryCount?: number;
+  error?: string;
+  metadata?: Record<string, unknown>;
+  artifacts?: ActionTraceArtifact[];
+}
+
+export interface ActionTraceEvent extends ActionTraceEventInput {
+  durationMs: number;
+}
+
+export interface ActionTraceDocument {
+  version: 1;
+  runId: string;
+  createdAt: string;
+  events: ActionTraceEvent[];
+}
+
+const MAX_STRING_LENGTH = 500;
+const MAX_METADATA_KEYS = 30;
+const SECRET_KEY_PATTERN = /(authorization|cookie|password|secret|token|credential|api[-_]?key)/i;
+
+export class ActionTraceRecorder {
+  private readonly events: ActionTraceEvent[] = [];
+  private readonly createdAt = new Date().toISOString();
+
+  constructor(private readonly runId: string) {}
+
+  record(input: ActionTraceEventInput): void {
+    this.events.push(normalizeEvent(input));
+  }
+
+  toJSON(): ActionTraceDocument {
+    return {
+      version: 1,
+      runId: this.runId,
+      createdAt: this.createdAt,
+      events: [...this.events],
+    };
+  }
+
+  async write(filePath: string): Promise<void> {
+    await writeActionTrace(filePath, this.toJSON());
+  }
+}
+
+export async function writeActionTrace(
+  filePath: string,
+  document: ActionTraceDocument,
+): Promise<void> {
+  await fs.mkdir(path.dirname(filePath), { recursive: true });
+  await fs.writeFile(filePath, JSON.stringify(document, null, 2) + '\n', 'utf8');
+}
+
+export function normalizeEvent(input: ActionTraceEventInput): ActionTraceEvent {
+  const startedAtMs = finiteOrZero(input.startedAtMs);
+  const endedAtMs = Math.max(startedAtMs, finiteOrZero(input.endedAtMs));
+  return {
+    action: truncate(input.action || 'unknown'),
+    status: input.status,
+    context: input.context ?? 'unknown',
+    ...(input.deviceId ? { deviceId: truncate(input.deviceId) } : {}),
+    startedAtMs,
+    endedAtMs,
+    durationMs: endedAtMs - startedAtMs,
+    ...(typeof input.timeoutMs === 'number' ? { timeoutMs: Math.max(0, input.timeoutMs) } : {}),
+    ...(typeof input.retryCount === 'number' ? { retryCount: Math.max(0, Math.floor(input.retryCount)) } : {}),
+    ...(input.error ? { error: truncate(input.error) } : {}),
+    ...(input.metadata ? { metadata: sanitizeMetadata(input.metadata) } : {}),
+    ...(input.artifacts ? { artifacts: input.artifacts.slice(0, 20).map(sanitizeArtifact) } : {}),
+  };
+}
+
+export function sanitizeMetadata(metadata: Record<string, unknown>): Record<string, unknown> {
+  const out: Record<string, unknown> = {};
+  for (const key of Object.keys(metadata).slice(0, MAX_METADATA_KEYS)) {
+    if (SECRET_KEY_PATTERN.test(key)) {
+      out[key] = '[REDACTED]';
+      continue;
+    }
+    out[key] = sanitizeValue(metadata[key]);
+  }
+  return out;
+}
+
+function sanitizeValue(value: unknown): unknown {
+  if (typeof value === 'string') return truncate(value);
+  if (typeof value === 'number' || typeof value === 'boolean' || value === null) return value;
+  if (Array.isArray(value)) return value.slice(0, 20).map(sanitizeValue);
+  if (typeof value === 'object' && value !== null) {
+    const out: Record<string, unknown> = {};
+    for (const key of Object.keys(value as Record<string, unknown>).slice(0, MAX_METADATA_KEYS)) {
+      out[key] = SECRET_KEY_PATTERN.test(key)
+        ? '[REDACTED]'
+        : sanitizeValue((value as Record<string, unknown>)[key]);
+    }
+    return out;
+  }
+  return String(value);
+}
+
+function sanitizeArtifact(artifact: ActionTraceArtifact): ActionTraceArtifact {
+  return {
+    kind: artifact.kind,
+    path: truncate(artifact.path),
+  };
+}
+
+function finiteOrZero(value: number): number {
+  return Number.isFinite(value) ? value : 0;
+}
+
+function truncate(value: string): string {
+  return value.length > MAX_STRING_LENGTH
+    ? `${value.slice(0, MAX_STRING_LENGTH)}…`
+    : value;
+}
diff --git a/src/orchestration/scenario-runner.ts b/src/orchestration/scenario-runner.ts
@@ -1,8 +1,11 @@
 import { SimulatorPool, PooledSimulator } from '../simulator/pool';
+import { ActionTraceRecorder } from '../observability/action-trace';
 
 export interface TestScenario {
   name: string;
   steps: TestStep[];
+  /** Optional JSON trace artifact path for action-level live-validation evidence. */
+  tracePath?: string;
 }
 
 export interface TestStep {
@@ -44,18 +47,36 @@ export class ScenarioRunner {
   async run(scenario: TestScenario): Promise<ScenarioResult> {
     const startTime = Date.now();
     const stepResults: StepResult[] = [];
+    const trace = scenario.tracePath ? new ActionTraceRecorder(scenario.name) : null;
     let allPassed = true;
 
     for (let i = 0; i < scenario.steps.length; i++) {
       const step = scenario.steps[i];
       const result = await this.executeStep(i, step);
+      for (const device of result.devices) {
+        trace?.record({
+          action: `${step.action}:${i}`,
+          status: device.passed ? 'passed' : 'failed',
+          context: step.action === 'navigate' || step.action === 'assert' ? 'webkit' : 'orchestration',
+          deviceId: device.deviceId,
+          startedAtMs: startTime + Math.max(0, Date.now() - startTime - device.timing),
+          endedAtMs: startTime + Math.max(0, Date.now() - startTime),
+          timeoutMs: step.timeout,
+          error: device.error,
+          metadata: { device: device.device, result: device.result },
+        });
+      }
       stepResults.push(result);
       if (!result.passed) {
         allPassed = false;
         // Continue executing remaining steps even on failure
       }
     }
 
+    if (trace && scenario.tracePath) {
+      await trace.write(scenario.tracePath);
+    }
+
     const duration = Date.now() - startTime;
     const passedSteps = stepResults.filter(s => s.passed).length;
     const totalDevices = this.pool.getAll().length;

diff --git a/tests/unit/action-trace.test.ts b/tests/unit/action-trace.test.ts
@@ -0,0 +1,50 @@
+import { mkdtempSync, readFileSync } from 'fs';
+import { tmpdir } from 'os';
+import * as path from 'path';
+import { ActionTraceRecorder, normalizeEvent, sanitizeMetadata } from '../../src/observability/action-trace';
+
+describe('action trace artifacts', () => {
+  it('normalizes timing and duration', () => {
+    const event = normalizeEvent({
+      action: 'tap',
+      status: 'passed',
+      startedAtMs: 100,
+      endedAtMs: 350,
+    });
+    expect(event.durationMs).toBe(250);
+    expect(event.context).toBe('unknown');
+  });
+
+  it('redacts secret-like metadata keys recursively', () => {
+    expect(sanitizeMetadata({
+      authorization: 'Bearer secret',
+      nested: { password: 'pw', ok: 'value' },
+    })).toEqual({
+      authorization: '[REDACTED]',
+      nested: { password: '[REDACTED]', ok: 'value' },
+    });
+  });
+
+  it('writes bounded JSON trace documents', async () => {
+    const dir = mkdtempSync(path.join(tmpdir(), 'opensafari-trace-'));
+    const tracePath = path.join(dir, 'trace.json');
+    const recorder = new ActionTraceRecorder('run-1');
+    recorder.record({
+      action: 'navigate',
+      status: 'timeout',
+      context: 'webkit',
+      deviceId: 'device-1',
+      startedAtMs: 0,
+      endedAtMs: 10,
+      timeoutMs: 10,
+      error: 'timed out',
+      metadata: { token: 'secret', url: 'https://example.com' },
+    });
+
+    await recorder.write(tracePath);
+    const parsed = JSON.parse(readFileSync(tracePath, 'utf8'));
+    expect(parsed.version).toBe(1);
+    expect(parsed.events).toHaveLength(1);
+    expect(parsed.events[0].metadata.token).toBe('[REDACTED]');
+  });
+});