Skip to content

fix(server): disable SO_REUSEADDR to prevent silent port sharing (#3289)#3446

Closed
rodboev wants to merge 3 commits into
nesquena:masterfrom
rodboev:pr/disable-port-reuse
Closed

fix(server): disable SO_REUSEADDR to prevent silent port sharing (#3289)#3446
rodboev wants to merge 3 commits into
nesquena:masterfrom
rodboev:pr/disable-port-reuse

Conversation

@rodboev
Copy link
Copy Markdown
Contributor

@rodboev rodboev commented Jun 2, 2026

Thinking Path

  • On Windows, SO_REUSEADDR allows multiple processes to bind the same port simultaneously; requests route unpredictably between them. On macOS, Bug: duplicate local WebUI start can fight launchd-managed 8787 instance #3289 documented the same symptom between a manually started instance and a launchd-managed one.
  • The original approach, setting allow_reuse_address = False, catches the double-bind but breaks legitimate fast restart: after any accepted connection, the socket enters TIME_WAIT for ~60s on Linux/macOS, so ctl.sh restart and the os.execv self-update path in api/updates.py both fail with EADDRINUSE.
  • The reworked approach separates the two concerns: a live-listener probe before bind detects actively-serving instances (connect + GET /health), while SO_REUSEADDR stays enabled for TIME_WAIT rebind. On Windows specifically, SO_EXCLUSIVEADDRUSE replaces the SO_REUSEADDR suppression to prevent port hijacking at the kernel level.
  • A dying instance, one that's past serve_forever(), accepts TCP connections via kernel backlog but never processes HTTP requests. The probe times out and startup proceeds. That's what makes it safe for ctl.sh restart during the brief overlap between stop and start.
  • Added httpd.server_close() to the shutdown finally block so the listening socket is released immediately rather than lingering through cleanup.

What Changed

  • server.py: Removed allow_reuse_address = False. Added _abort_if_already_serving(host, port), a pre-bind probe that connects to the port, sends GET /health, and aborts startup only if a live HTTP response comes back. Added server_bind() override on QuietHTTPServer that sets SO_EXCLUSIVEADDRUSE on Windows. Added httpd.server_close() to the shutdown path.
  • tests/test_server_port_exclusivity.py (new, replaces test_server_no_reuse_address.py): Probe detection (live server → SystemExit), probe passthrough (free port, unresponsive socket, wildcard host normalization), and SO_EXCLUSIVEADDRUSE assertion on Windows.

Why It Matters

A second webui instance silently sharing the same port causes unpredictable request routing, split session state, and confusing behavior that is difficult to diagnose. The fix detects live instances before bind without breaking the fast-restart paths that ctl.sh and the self-update mechanism depend on.

Verification

  • python -m pytest tests/test_server_port_exclusivity.py -v --timeout=60
  • Manual: start the webui, then attempt to start a second instance on the same port; confirm it prints [!!] FATAL: Another server is already responding on ... and exits.
  • Manual: ctl.sh restart completes without EADDRINUSE on Linux/macOS.

Risks / Follow-ups

  • Probe timeout adds ~2s worst-case to cold start when something is listening on the port but not responding to HTTP. In practice this only happens during the brief ctl.sh restart overlap, and the 2s is concurrent with the old instance's shutdown cleanup.
  • ctl.sh already has its own pidfile-based guard (_current_pid in start_cmd). The Python-level probe catches cases that bypass ctl.sh: direct python server.py invocations, Windows Task Scheduler, and the os.execv self-update path.
  • SO_EXCLUSIVEADDRUSE is Windows-only. On Linux/macOS the probe is the sole guard, which is sufficient since POSIX SO_REUSEADDR only affects TIME_WAIT, not active-listener sharing.

Model Used

Claude Opus 4.6 via Claude Code CLI

@nesquena-hermes nesquena-hermes added hold changes-requested Maintainer left detailed feedback requesting changes; PR is waiting on author to address labels Jun 3, 2026
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Holding this for a redesign — putting it on hold + changes-requested.

A release-gate review (Codex) flagged that allow_reuse_address = False breaks legitimate fast restart. Verified on Linux: after even one accepted HTTP request, a shutdown() / server_close() followed by an immediate rebind fails with EADDRINUSE for the TIME_WAIT window (~60s). Since ctl.sh restart stops then immediately starts, a normal restart can fail — and that fail-fast-on-restart behavior is a worse user experience than the silent port-sharing bug (#3289) it's trying to fix.

The goal (stop two instances silently sharing port 8787 on Windows/macOS) is real and worth fixing, but globally disabling SO_REUSEADDR is the wrong lever. Suggested direction:

  • Keep fast-restart reuse (allow_reuse_address = True, i.e. remove the override), and
  • Add a separate duplicate-live-listener guard before bind: probe whether something is already actively serving on the port (a quick connect + health check), and refuse to start only when a live instance is detected — rather than letting SO_REUSEADDR semantics decide.
  • On Windows specifically, if needed, use an exclusive-bind strategy (SO_EXCLUSIVEADDRUSE) rather than relying on the absence of SO_REUSEADDR, since Windows reuse semantics differ from POSIX.

That preserves the crash/restart recovery path while still catching the genuine double-start case. Happy to look again once it's reworked along those lines.

@rodboev
Copy link
Copy Markdown
Contributor Author

rodboev commented Jun 3, 2026

Verified the restart regression. Traced it through ctl.sh restart (stop_cmd → start_cmd, ~5s gap) and the os.execv self-update path in api/updates.py:1076. Both rebind immediately after the prior socket closes, and with allow_reuse_address = False the TIME_WAIT window from any accepted connection would brick the restart for ~60s. Worse than the original bug.

Reworked in 9dbe92f along the lines you suggested. Three changes instead of the one-liner:

  1. Dropped allow_reuse_address = False, restoring the inherited True so fast restart keeps working on all platforms.

  2. Added a live-listener probe (_abort_if_already_serving, called in main() before QuietHTTPServer instantiation). It does a TCP connect + GET /health with a 2s timeout. A live instance responds and the startup aborts with a clear error. A dying instance, one that's past serve_forever() with the socket still lingering in kernel backlog, accepts the TCP connection but never processes the HTTP request. The probe times out and startup proceeds. That distinction is what lets ctl.sh restart work even in the brief overlap window. Connection refused (nothing listening, or TIME_WAIT) also falls through cleanly.

  3. SO_EXCLUSIVEADDRUSE on Windows in a server_bind() override. Windows SO_REUSEADDR semantics are the original problem: it silently allows two processes to share a port, unlike POSIX where it only affects TIME_WAIT rebind. SO_EXCLUSIVEADDRUSE prevents that at the kernel level, catching the race window the probe can't (two instances starting simultaneously). Also sets allow_reuse_address = False on the instance before super().server_bind() to avoid the SO_REUSEADDR/SO_EXCLUSIVEADDRUSE conflict that Windows raises.

Also added httpd.server_close() at the top of the finally block in main(). Without it, the listening socket stayed in LISTEN state through the entire cleanup phase, which made the probe's "dying instance" detection slower than it needed to be.

Tests in tests/test_server_port_exclusivity.py: probe detects a live server (SystemExit), allows startup on a free port, allows startup on an unresponsive socket (the dying-instance case), normalizes wildcard hosts, and verifies SO_EXCLUSIVEADDRUSE is set on Windows.

@nesquena-hermes nesquena-hermes removed hold changes-requested Maintainer left detailed feedback requesting changes; PR is waiting on author to address labels Jun 3, 2026
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Shipped in v0.51.234 (Release HB) via release PR #3488 — thank you @rodboev! 🙏

Unheld and merged this round. This PR was held earlier in the sweep because its original form (globally disabling SO_REUSEADDR) broke legitimate fast restart — ctl.sh restart and the os.execv self-update path rebind immediately and would hit the TIME_WAIT window for ~60s. You reworked it exactly along the suggested lines:

  1. Dropped the global allow_reuse_address = False (POSIX fast restart preserved)
  2. Added the _abort_if_already_serving live-listener probe (TCP connect + GET /health, 2s timeout) — a live instance aborts startup, a dying socket lingering in the kernel backlog times out and lets startup proceed
  3. SO_EXCLUSIVEADDRUSE in a server_bind() override for true Windows exclusive binding

Both pre-release reviewers (Codex regression + Opus correctness/security) confirmed the probe doesn't false-positive on a legitimate restart and the Windows path is correct. Authorship preserved via Co-authored-by and credited in the CHANGELOG. Closes #3289.

Closing as merged-via-release-stage (the change is in master, recommitted on the stage branch rather than via this branch directly).

nesquena-hermes added a commit that referenced this pull request Jun 3, 2026
## Release v0.51.236 — Release HD (stage-q7)

First Phase 3 (deep-review) release — picked by the 3-factor framework (contributor × impact × mitigated-risk): high-impact (#1952 native Windows support), backend-only (no screenshots), well-mitigated risk (POSIX path provably unchanged), from a contributor active this session (@rodboev, #3446/#3486 shipped earlier today).

### Added
| PR | Author | Fix |
|----|--------|-----|
| #1952 | @rodboev | Native Windows support for `bootstrap.py` + the embedded terminal: POSIX-only `fcntl`/`termios`/`select` guarded behind `_TERMINAL_SUPPORTED`; terminal entry points raise `NotImplementedError`/no-op on Windows; bootstrap Windows block → warning; auto-install errors clearly on native Windows (WSL unaffected); foreground uses `Popen`+exit on Windows instead of `os.execv`. **POSIX behavior unchanged on every path.** |

### Absorbed on the way in (fix-it-ourselves, reviewed fresh)
- `subprocess.CREATE_NEW_PROCESS_GROUP` → `getattr(subprocess, ..., 0)` — the constant is Windows-only, so a win32-simulating test `AttributeError`'d on Linux. Mirrors the `SO_EXCLUSIVEADDRUSE` getattr guard.
- Fixed 2 over-reaching tests in `test_windows_native_support.py` — one was launching a **real installer subprocess** via an unstubbed `subprocess.run` (now stubbed; harness 2.8s vs 80s); removed unused imports.
- Updated `test_onboarding_static.py` — it asserted the OLD "Native Windows is not supported" hard-block string this PR intentionally replaces; now asserts the new experimental-warning + auto-install guard.
- Help-text accuracy: `--foreground` help now describes the Windows Popen path (Opus nit).

### Gate results
- **Full pytest suite**: 7478 passed, 9 skipped, 3 xpassed, **0 failed**
- **ruff forward gate**: CLEAN
- **browser-smoke gate**: CLEAN (gate hardened mid-release to auto-detect the cached chromium revision)
- **Codex (regression)**: SAFE TO SHIP (simulated `sys.platform=win32`, verified POSIX modules not imported + all terminal guards complete + POSIX foreground still uses execv)
- **Opus (correctness)**: SAFE TO SHIP (POSIX path provably unchanged, all fcntl/termios/select guarded, Popen+exit correct; noted inherent-Windows trade-offs that aren't PR bugs)

Note: the Windows *runtime* path can't be executed on the Linux CI box; it was reviewed statically by both reviewers + the contributor's 209-line test (win32 simulated via monkeypatch). Linux/POSIX no-regression is fully verified.

Closes #1952.

Co-authored-by: rodboev <rodboev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants