Skip to content

Avoid busy-spin waits in zombie cleanup registry lock #754

@shaun0927

Description

@shaun0927

Why OpenSafari should reflect this

src/reliability/zombie-cleanup.ts uses a synchronous filesystem lock for the shared simulator registry. When another process owns the lock, acquireLock() currently waits with a tight busy-spin loop for LOCK_RETRY_MS. In multi-session or CI runs, repeated contention can waste CPU in the main Node process exactly when OpenSafari is trying to boot, clean, or protect simulators.

This is directionally aligned with OpenSafari because simulator lifecycle reliability is a core user-facing feature. The fix should reduce CPU burn under contention without changing registry semantics.

Risk / user impact

  • Severity: medium performance/reliability risk.
  • User impact: slow or CPU-heavy runs during cleanup/boot contention can make automation appear hung or flaky.
  • The code path is safety-sensitive because it protects active simulators from zombie cleanup, so the first fix must preserve existing lock behavior.

How to implement

  • Replace the busy-spin retry wait with a blocking sleep primitive that does not burn CPU, or an equivalent bounded backoff helper.
  • Preserve the current stale-lock timeout, retry interval, and failure behavior.
  • Keep registry APIs synchronous unless a larger async migration is explicitly planned.
  • Add unit coverage for the retry wait helper where practical.

Decisions needed before implementation

  1. Whether to use Atomics.wait for dependency-free synchronous sleeping or migrate the registry lock to async APIs later.
  2. Whether future work should add jitter/backoff; first PR should preserve the current 50ms retry cadence.
  3. Whether lock timeout should remain 5s; first PR should not alter it.

Success criteria

  • Lock retry no longer uses a JavaScript busy-spin loop.
  • Registry lock semantics and timeout remain unchanged.
  • No new dependency is added.
  • Targeted tests/lint/build and CI pass.

Post-merge OpenSafari live validation

  • Start two OpenSafari processes or cleanup routines that contend for the registry lock.
  • Confirm the waiting process does not consume a full CPU core while waiting.
  • Confirm active registered simulators are still protected from cleanup.

Direction/necessity review

  • Aligned: yes, improves simulator lifecycle reliability without changing feature behavior.
  • Necessary: yes, busy-spin contention is avoidable and harmful in long-running automation processes.
  • Minimal first PR: dependency-free sync sleep helper only; no registry redesign.

Metadata

Metadata

Assignees

No one assigned

    Labels

    automation-roadmapOpenSafari automation roadmap work itemsperformancePerformance optimizationreliabilityReliability and stability

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions