feat(runner): queue-on-running so back-to-back handoffs don't race#24
Conversation
Background ---------- An agent that calls `request_handoff(target_role='<self>')` to schedule its own next step hits a race: Plane delivers the comment-webhook to conductor ~milliseconds before the previous subprocess has cleared `_active`. The old webhook path called `runner.spawn()` directly, which raised `SessionAlreadyRunningError` for the in-flight triple and recorded `skipped: already running` — the chain died there. This blocked the "step-execution discipline" we want to write into the coder / tester / SA prompts (one step per invocation + self-handoff to avoid 14-step token burn and hangs). A prompt rule that depends on a racy spawn would be a lie. Change ------ - New `Runner.request_spawn(...)`: spawn-or-queue. If the `(slug, nick, issue)` triple is already in flight, the request is stored in `_pending` keyed by the same triple (last-write-wins — multiple handoffs targeting the same triple coalesce, since the next agent reads state from Plane, not from queued trigger metadata). - `_supervise` flushes the queued spawn from its `finally` block, on a fresh task so it doesn't pin the dying supervisor's log fp / transport. - Direct `spawn()` keeps raising — webhook callers move to `request_spawn`, long-running callers (none today) can still opt in to the strict path. - Capacity cap stays a hard limit: queued spawns don't count toward the cap, and `request_spawn` re-raises `CapacityFullError` rather than queueing — same semantics as before for that case. - `wait_idle` now re-snapshots `_tasks` until the set drains, because flush tasks are added AFTER their parent `_supervise` is already done; a single `gather(*_tasks)` would miss them and leak the flushed log fp. - `_flush_pending` swallows errors with structured logs (the original webhook returned 200 to Plane long ago — there's no caller to surface to). Lost-race → `SessionAlreadyRunning` → info-log only. Tests ----- - `request_spawn` fires immediately on a free triple. - Second request for the same triple while active → `status: queued`, `pending_count == 1`. - Third request for the same triple → coalesces, still `pending_count == 1`. - Independent triples run in parallel (no false queueing across keys). - Queued spawn flushes after the active session for the same key exits. - Capacity full still raises (request_spawn does not bypass the cap). - Webhook stub updated to mock `request_spawn`. New webhook test covers the queued path returning `skipped: queued`. Old "skipped: already running" path retired (no caller for it). Out of scope ------------ Prompt patches for step-execution discipline (coders / testers / SA) ship in a follow-up PR on claude-sdlc-agents — they depend on this landing first. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Walkthrough
ChangesSpawn-or-queue concurrency feature
Sequence Diagram(s)sequenceDiagram
rect rgba(173, 216, 230, 0.5)
Note over Webhook,Runner: Duplicate spawn request while active
end
Webhook->>Runner: request_spawn(workspace, nickname, issue_uuid)
Runner-->>Webhook: {status: "spawned", log_path: ...}
Webhook->>Runner: request_spawn(workspace, nickname, issue_uuid)
Runner-->>Webhook: {status: "queued"}
Note over Runner: Active session process exits
Runner->>Runner: _supervise removes active state
Runner->>Runner: _flush_pending scheduled as new task
Runner->>Runner: spawn(pending args) — deferred execution
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/plane_conductor/webhook.py (1)
326-349:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestore
SessionAlreadyRunningErrormapping toskippedin webhook spawn handling.This path no longer catches
SessionAlreadyRunningError, so that case can return 500 instead of a skipped response.💡 Suggested patch
try: spawn_result = await runner.request_spawn( workspace=workspace, plane=plane, nickname=nickname, issue_uuid=issue_uuid, triggered_by_email=email, ) if spawn_result.get("status") == "queued": log.info( "spawn_deferred_until_active_clears", workspace=slug, nickname=nickname, issue=str(issue_uuid), ) skipped.append({"nickname": nickname, "reason": "queued"}) else: spawned.append(nickname) + except SessionAlreadyRunningError: + log.info( + "already_running", + workspace=slug, + nickname=nickname, + issue=str(issue_uuid), + ) + skipped.append({"nickname": nickname, "reason": "queued"}) except CapacityFullError: log.warning("capacity_full", workspace=slug, nickname=nickname) skipped.append({"nickname": nickname, "reason": "capacity full"}) except AgentSpawnError as exc: log.error("spawn_failed", workspace=slug, nickname=nickname, error=str(exc)) skipped.append({"nickname": nickname, "reason": "spawn failed"})Also re-add
SessionAlreadyRunningErrorto this module’s exceptions import list.As per coding guidelines,
src/plane_conductor/webhook.py: “webhook must surface SessionAlreadyRunningError asskipped, not 500.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/plane_conductor/webhook.py` around lines 326 - 349, The webhook spawn handling is missing an exception handler for SessionAlreadyRunningError which should be treated as a skipped item rather than an error. Add an except block for SessionAlreadyRunningError after the AgentSpawnError handler in the try-except block surrounding the runner.request_spawn() call, logging a warning with appropriate context (workspace slug, nickname) and appending to the skipped list with reason "session already running". Additionally, ensure SessionAlreadyRunningError is imported at the top of the module in the exceptions import list.Source: Coding guidelines
🧹 Nitpick comments (1)
tests/test_webhook.py (1)
496-511: ⚡ Quick winAdd a regression test for
SessionAlreadyRunningError→skippedmapping.This new queued-status test is good, but it no longer guards the explicit exception contract. Please add a companion case where
request_spawn()raisesSessionAlreadyRunningErrorand assert HTTP 200 with askippedentry.💡 Suggested test shape
+def test_webhook_skips_when_runner_raises_already_running( + settings: Settings, workspace_config: WorkspaceConfig +) -> None: + from plane_conductor.exceptions import SessionAlreadyRunningError + + class DuplicateRunner: + async def request_spawn(self, **kwargs: Any) -> dict[str, Any]: + raise SessionAlreadyRunningError("already running") + + plane = StubPlane({SARK: {"email": "sark@example.io"}}) + client = TestClient(_app(settings, workspace_config, plane, DuplicateRunner())) + resp = _send(client, settings, workspace_config, _comment_body(SARK)) + assert resp.status_code == 200 + assert resp.json()["skipped"] == [{"nickname": "sark", "reason": "queued"}]As per coding guidelines,
src/plane_conductor/webhook.py: “webhook must surface SessionAlreadyRunningError asskipped, not 500.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_webhook.py` around lines 496 - 511, Add a new regression test function alongside test_webhook_queues_when_runner_reports_already_running that explicitly tests the SessionAlreadyRunningError exception path. Create a QueueingRunner class that raises SessionAlreadyRunningError from its request_spawn method (instead of returning a status dict), then send the webhook request and assert HTTP 200 status with a skipped entry in the response JSON containing the appropriate reason for the exception. This ensures the webhook properly handles the SessionAlreadyRunningError exception and surfaces it as a skipped entry rather than returning a 500 error.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/plane_conductor/runner.py`:
- Around line 416-457: The _flush_pending method's docstring promises errors are
never raised, but the current exception handling only catches
SessionAlreadyRunningError, CapacityFullError, and AgentSpawnError. Other
exceptions like PermissionError or OSError from create_subprocess_exec could
escape. Add a catch-all exception handler after the existing except clauses that
catches Exception, logs the error with the same structured logging format
(including workspace.workspace_slug, nickname, issue, and error details), and
ensures the method never re-raises, maintaining the guarantee that all errors
are logged and swallowed.
- Around line 606-610: The final drain loop (where _tasks is awaited) can hang
indefinitely because _flush_pending calls spawn() which creates new supervisor
tasks that weren't killed during the grace period expiration. To fix this,
either add a shutdown flag that prevents spawn() from creating new processes
during the drain phase, or add a timeout parameter to the asyncio.gather call in
the final drain loop to ensure it completes within a reasonable timeframe
regardless of newly spawned tasks. The key is to prevent new tasks from being
indefinitely awaited after the known processes have been killed.
---
Outside diff comments:
In `@src/plane_conductor/webhook.py`:
- Around line 326-349: The webhook spawn handling is missing an exception
handler for SessionAlreadyRunningError which should be treated as a skipped item
rather than an error. Add an except block for SessionAlreadyRunningError after
the AgentSpawnError handler in the try-except block surrounding the
runner.request_spawn() call, logging a warning with appropriate context
(workspace slug, nickname) and appending to the skipped list with reason
"session already running". Additionally, ensure SessionAlreadyRunningError is
imported at the top of the module in the exceptions import list.
---
Nitpick comments:
In `@tests/test_webhook.py`:
- Around line 496-511: Add a new regression test function alongside
test_webhook_queues_when_runner_reports_already_running that explicitly tests
the SessionAlreadyRunningError exception path. Create a QueueingRunner class
that raises SessionAlreadyRunningError from its request_spawn method (instead of
returning a status dict), then send the webhook request and assert HTTP 200
status with a skipped entry in the response JSON containing the appropriate
reason for the exception. This ensures the webhook properly handles the
SessionAlreadyRunningError exception and surfaces it as a skipped entry rather
than returning a 500 error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: da834cce-ceaf-4448-961d-7a0dc317b2e3
📒 Files selected for processing (4)
src/plane_conductor/runner.pysrc/plane_conductor/webhook.pytests/test_runner.pytests/test_webhook.py
…ndling CI on Python 3.11 broke on collection because starlette 1.x's TestClient imports `httpx2` and emits StarletteDeprecationWarning when falling back to httpx — strict-mode pytest (`filterwarnings = ["error"]`) turns the warning into a collection error. Locally the .venv still had starlette 1.0.0 cached; reproduced by upgrading to 1.3.1 and confirmed the same failure. Adding `httpx2` to dev-deps lets TestClient pick its preferred backend and the warning never fires. Once fastapi pulls in httpx2 by default this entry can drop out — leaving a comment. Also addresses two CodeRabbit findings from PR #24 review: - `_flush_pending`: add a catch-all `Exception` arm. The flush task runs detached from any webhook caller, so an uncaught OSError / PermissionError / unexpected subprocess failure would only surface in asyncio's unraisable hook. Log it structurally instead and keep the contract that the flush never escapes. - `wait_idle` post-kill drain: bound the final loop with a 5s deadline. A flushed spawn that itself blocks (broken plane API, hung child) could otherwise pin shutdown indefinitely. Timeout → warning + break. Skipped: CodeRabbit asked to re-add a `SessionAlreadyRunningError` catch in the webhook and a matching test. `request_spawn` contractually does NOT raise that error (it queues), so the catch would be dead code and the test would assert a path that can't be reached. Left a thread reply explaining. Tests: 224 passed, 7 skipped (e2e). ruff + ruff-format + mypy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pyproject.toml`:
- Around line 48-51: The dependency pinned in the test requirements is
incorrect. Replace the package name from httpx2 to httpx in the test
dependencies section. The comment references Starlette 1.0's TestClient backend
switch, but Starlette actually uses the standard httpx library, not httpx2.
Update "httpx2>=0.1" to use the correct httpx package to properly address the
deprecation warning workaround as intended.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 03d9cf5b-14f9-425c-b96d-b10bfd698508
📒 Files selected for processing (2)
pyproject.tomlsrc/plane_conductor/runner.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/plane_conductor/runner.py
| # starlette 1.0 switched its TestClient backend to httpx2; the deprecation | ||
| # warning hits strict-mode pytest at collection. Pinning httpx2 here lets | ||
| # TestClient pick it up without us touching production deps. | ||
| "httpx2>=0.1", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if httpx2 exists on PyPI
echo "=== Checking httpx2 on PyPI ==="
curl -s https://pypi.org/pypi/httpx2/json | jq -r '.info.version // "PACKAGE NOT FOUND"'
echo -e "\n=== Checking httpx (the standard library) on PyPI ==="
curl -s https://pypi.org/pypi/httpx/json | jq -r '.info.version'
echo -e "\n=== Checking current Starlette dependency ==="
rg -n "starlette" pyproject.tomlRepository: volodchenkov/plane-conductor
Length of output: 301
🌐 Web query:
What HTTP client backend does Starlette TestClient use in version 1.0 and later?
💡 Result:
In Starlette version 1.0 and later, the TestClient uses the httpx library as its underlying HTTP client backend [1][2][3]. Prior to this version, Starlette's TestClient originally used the requests library. The transition from requests to httpx was implemented in Starlette 0.21.0 [4][5], ensuring that by the time Starlette reached its 1.0 stable release in March 2026, the TestClient was fully powered by httpx [4][5]. The TestClient provides an interface compatible with standard httpx sessions while adding specialized support for testing ASGI applications and WebSocket connections [2][3][6].
Citations:
- 1: https://pypi.org/project/starlette/1.0.1/
- 2: https://starlette.dev/testclient/
- 3: https://www.starlette.io/testclient/
- 4: https://github.com/encode/starlette/blob/master/docs/release-notes.md
- 5: https://starlette.dev/release-notes/
- 6: Differences between starlette.testclient.TestClient and httpx.AsyncClient Kludex/starlette#1877
Replace httpx2 with httpx—Starlette's TestClient backend is httpx, not httpx2.
Although httpx2 is a real package on PyPI, Starlette's TestClient uses the standard httpx library. The comment incorrectly states that Starlette 1.0 switched to httpx2; it actually uses httpx. Pinning the wrong package will not address the deprecation warning workaround as intended.
Diff
- # starlette 1.0 switched its TestClient backend to httpx2; the deprecation
+ # starlette 1.0 switched its TestClient backend to httpx; the deprecation
# warning hits strict-mode pytest at collection. Pinning httpx here lets
# TestClient pick it up without us touching production deps.
- "httpx2>=0.1",
+ "httpx>=0.23.0",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pyproject.toml` around lines 48 - 51, The dependency pinned in the test
requirements is incorrect. Replace the package name from httpx2 to httpx in the
test dependencies section. The comment references Starlette 1.0's TestClient
backend switch, but Starlette actually uses the standard httpx library, not
httpx2. Update "httpx2>=0.1" to use the correct httpx package to properly
address the deprecation warning workaround as intended.
Response to CodeRabbit findingsAccepted (commit d091884):
Declined as false positive:
Unrelated CI fix in the same commit: starlette 1.x's |
Summary
Runner.request_spawn(...): spawn-or-queue. If the(slug, nick, issue)triple is already in flight, the request is stored in_pending(last-write-wins) and fired automatically when the active supervisor exits.request_spawn— oldSessionAlreadyRunningError → skipped: already runningpath retired and replaced withstatus: queued → skipped: queued.wait_idlere-snapshots_tasksuntil drained so flush tasks scheduled mid-shutdown don't leak open log fps.Why
An agent that calls
request_handoff(target_role='<self>')to schedule its own next step hits a race: Plane delivers the comment-webhook to conductor ~milliseconds before the previous subprocess has cleared_active. The old direct-spawn path silently dropped the re-entry as a duplicate and the chain died.This unblocks planned step-execution discipline prompt patches on
claude-sdlc-agents(coders/testers/SA: one step per invocation + self-handoff → no 14-step token burn or hangs). A prompt rule that depends on a racy spawn would be a lie, so this lands first.Semantics
request_spawnre-raisesCapacityFullErrorrather than queueing.spawn(): unchanged; still raises on conflict. Webhook moves torequest_spawn. No external callers ofspawntoday.Test plan
request_spawnfires immediately on a free triplestatus: queued,pending_count == 1pending_count == 1skipped: queuedOut of scope
Prompt patches for step-execution discipline ship in a follow-up PR on
claude-sdlc-agentsafter this lands.🤖 Generated with Claude Code
Summary by CodeRabbit