fix(server): don't treat a recycled PID as a running daemon#1305
fix(server): don't treat a recycled PID as a running daemon#1305chosen1hyj wants to merge 1 commit into
Conversation
acquirePidLock and isLocked only checked whether the lock file's PID was
alive via process.kill(pid, 0). After an unclean daemon shutdown the stale
paseo.pid is left behind; once the OS recycles that PID to an unrelated
process the liveness check still passes, so the daemon refuses to start
("Another Paseo daemon is already running") and never binds its port —
every client (desktop and CLI) hangs on "connecting".
Verify process identity by comparing the live process's start time
(ps -o lstart) against the lock's startedAt. If they diverge beyond a
60s tolerance the PID was reused and the lock is stale, so we clear it and
start. Falls back to the previous conservative behaviour when the start
time can't be determined (e.g. Windows).
|
| Filename | Overview |
|---|---|
| packages/server/src/server/pid-lock.ts | Adds getProcessStartTimeMs (via ps -o lstart) and isLockProcessAlive to distinguish a recycled PID from the original daemon. Logic restructure in acquirePidLock and isLocked is correct; minor semantic mismatch: startedAt records lock-write time rather than actual process-start time, making the 60s tolerance the load-bearing safety net. |
| packages/server/src/server/pid-lock.test.ts | Adds two targeted staleness tests: one reproducing the recycled-PID bug with a year-2020 startedAt, one confirming a live daemon is still rejected using an independently-derived OS start time. Both test through the public module interface. |
Comments Outside Diff (1)
-
packages/server/src/server/pid-lock.ts, line 142-144 (link)The lock's
startedAtis set tonew Date().toISOString()— the moment the lock is written — butisLockProcessAlivecompares it against the process's actual OS start time fromps lstart. These are different timestamps: the OS start time is always earlier, by however long it took the daemon to reachacquirePidLock. The 60-second tolerance window exists entirely to cover this gap. A daemon whose startup exceeds 60 seconds (loading plugins, cold disk, etc.) would have|liveStartMs - lockStartMs| > 60_000, causingisLockProcessAliveto returnfalseand allowing a second instance to steal the lock. Recording the actual OS start time of the locking process eliminates the gap and lets the tolerance shrink to just a few seconds forlstart's second-granularity rounding.
Reviews (1): Last reviewed commit: "fix(server): don't treat a recycled PID ..." | Re-trigger Greptile
Closes #1304.
Problem
The PID lock's liveness check is identity-blind:
isPidRunning()(used by bothacquirePidLock()andisLocked()inpackages/server/src/server/pid-lock.ts) only doesprocess.kill(pid, 0), which tells you a process with that PID exists — not that it's still the daemon. After an unclean shutdown the stale~/.paseo/paseo.pidsurvives; once the OS recycles that PID to an unrelated process, the daemon refuses to start (Another Paseo daemon is already running), exits 1, never binds 6767, and every client hangs on "connecting". Full evidence in #1304.Fix
Verify process identity before trusting the lock: compare the live process's start time (
ps -o lstart, the portable keyword on both macOS and Linux;LC_ALL=Cfor a parseable timestamp) against the lock'sstartedAt. If they diverge beyond a 60s tolerance, the PID was reused → clear the stale lock and start.lstartis second-granularity and the lock is written a beat after the daemon starts, so a genuine daemon's two timestamps sit within a few seconds, while a recycled PID differs by the daemon's whole lifetime. When the start time can't be determined (e.g.psunavailable on Windows), it falls back to the previous conservative behaviour, so there's no regression there.Testing
pid-lock.test.ts: one reproduces the recycled-PID bug (fails onmain, passes with the fix), one confirms a genuinely-live daemon is still rejected (guards against over-correcting).oxfmt,oxlint, andtsgotypecheck across the whole workspace all pass (via the repo's lefthook pre-commit hook).No protocol/schema changes.
I actually hit this myself — both the desktop app and the CLI were stuck on "connecting" indefinitely. Digging into the logs, the daemon had exited uncleanly the night before and left ~/.paseo/paseo.pid behind, and the OS had since recycled that PID to an unrelated printtool process, so the lock check thought a daemon was still running. Deleting ~/.paseo/paseo.pid and relaunching fixed it immediately, which confirmed the root cause. I tested this change locally: the new test reproduces the recycled-PID case (fails on main, passes with the fix), the guard test confirms a genuinely-live daemon is still rejected, and typecheck/lint/format all pass via the pre-commit hook. Happy to adjust the approach (tolerance value, or verifying identity a different way) if you'd prefer.