feat(web): plan approval UI and Home plan toggle#673
Draft
bleleve wants to merge 19 commits into
Draft
Conversation
…amble)
Adds the backend for plan-mode: a human-in-the-loop planning gate where
the agent persists a markdown plan that must be approved before any
code-changing turn runs. This PR is headless — no UI or bot triggers
yet (those follow as separate PRs in the stack).
Changes:
- New `plans` table in the SessionDO SQLite (monotonic versions per
session); `PlanService` + `plans.handler` covering save / get / list
/ approve / reject.
- New endpoints under `/sessions/:id/plan{,/approve,/reject}` and
`/sessions/:id/plans` (list).
- `SessionMessageQueue` reads `getCurrentPlan()` and attaches
`resumeContext.currentPlan` to dispatched `PromptCommand`s. The
planning-turn gate is terminal-status aware (approved/rejected exits
plan mode without flipping `plan_mode` so the history bubble stays
visible).
- Sandbox `bridge.py` builds a restate-and-confirm preamble from
`resumeContext.currentPlan` and prepends it to the user content
before forwarding to OpenCode.
- New `wrapUntrustedContent` helper in `@open-inspect/shared` wraps
plan + user content in XML tags from the sandbox runtime (security:
isolates plan-as-context from prompt-injection inside user messages).
- Shared `model-defaults.ts` ships the `fetchModelDefaults` helper —
used by bots in subsequent PRs to source the default model + default
plan model from a single place (the control plane). Falls back to
env vars + shared constants until the `/model-preferences` endpoint
is extended (next PR in the stack).
- `MODEL_ALIAS_MAP` and `DEFAULT_PLAN_MODEL` added to shared models.
Tests:
- Unit tests for `PlanService`, `plans.handler`, plan-mode behavior in
`MessageQueue`, plan persistence in the repository.
- Sandbox-runtime tests for `bridge.py` preamble + XML wrapping.
- All test fixtures updated to include the new required `plan_*`
fields on `SessionRow`.
Verification: `npm run typecheck` (workspace), `npm run lint`, `npm test`
(shared 183/183, control-plane 1148/1148), `pytest` (sandbox-runtime
plan-related tests 41/41) — all green on this branch based on upstream/main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…eMurray#671 Must-fix: - bridge.py: escape XML special chars in plan content before interpolating into <saved_plan> and <previous_plan> wrappers (1.1) — prevents a hostile `</saved_plan>` in the plan body from breaking out of the wrapper and injecting instructions outside it. - bridge.py: replace the cumulative token buffer's append with overwrite semantics (1.2) — OpenCode token events carry the FULL accumulated text since the response start (not incremental deltas), so appending produced corrupted plan bodies with compounded prefixes at end-of-turn. - plans.handler.ts: readApprovalBody now throws InvalidApprovalBodyError on JSON parse failure instead of silently coercing the body to {} (1.3) — errorResponseForApproval maps it to HTTP 400 so malformed client requests surface instead of leaking through with partial parameters. Should-fix: - plan.service.ts: dedup guard now requires non-null messageId (1.4) — null is the "no message context" marker, so two identical-body events with null messageId are legitimately distinct saves and must each bump the version. - session/types.ts: import PlanSource from @open-inspect/shared (1.5) instead of redeclaring it locally — prevents contract drift between the control plane row type and the shared API surface. - durable-object.ts: extract DEFAULT_SANDBOX_INACTIVITY_TIMEOUT_MS module constant (1.6); also cleans up a stray /** doc-comment opener left over from an earlier opencode-config strip. - bridge.py: extract PLAN_SAVE_TIMEOUT_SECONDS module constant (1.7). - prompt-safety.ts: <user_content> tag neutralization is now case- insensitive and whitespace-tolerant (1.8) — catches `<USER_CONTENT>`, `< user_content >`, `</ user_content >`, attribute-bearing tags, etc. Nitpicks: - session-lifecycle.handler.ts: remove redundant getValidModelOrDefault after validation; add log.warn when planMode=true but planModel is invalid and we fall back to DEFAULT_PLAN_MODEL (1.9). - router.ts: type forwardPlanApproval's internalPath parameter with SessionInternalPath instead of bypassing via `as never` (1.10). - control-plane/src/types.ts: reference DEFAULT_SANDBOX_INACTIVITY_TIMEOUT_MS in the SANDBOX_INACTIVITY_TIMEOUT_MS env-var comment (1.11). - docs/PLAN_MODE.md: add `text` language tag to architecture diagram fence to satisfy MD040 (1.12). - packages/control-plane/README.md: clarify Plan Mode behavior — prompts continue to queue and dispatch as *planning turns* until approval/rejection/amendment (not "the message queue is gated") (1.13). Test coverage gaps: - repository.test.ts: new case for upsertSession({ planMode: true, planModel }) asserting plan_mode=1 and plan_model are persisted (1.14). - plan.service.test.ts: new cases for null-messageId dedup (skipped guard) and MAX_PLAN_CONTENT_BYTES boundary (accept-at-limit, throw- over-limit) (1.15). Targeted regression tests: - test_bridge_resume_context.py: assertions that resume + planning preambles escape XML specials in the plan body and that token-buffer overwrite semantics produce a single non-duplicated snapshot. - plans.handler.test.ts: assertions that malformed-JSON bodies on approve/reject return HTTP 400 with code "invalid_body". Verification: npm run build -w @open-inspect/shared && npm run lint && npm run typecheck && npm test (shared 16 files, control-plane 65 files all green) && pytest (41 + 4 new regression cases green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#671 Three items flagged on the first fix commit (a2cad10): 🟠 Major (plans.handler.ts) — Don't clear reasoning effort when no model override is provided. Previously the approve handler validated the user's implementationReasoningEffort against `""` when no model override was set; validateReasoningEffort returned null, which then propagated to approvePlanAndFlush. The service treats null as an explicit "clear", so an approval request that sent reasoning effort without a model silently wiped the session's persisted reasoning_effort. New semantics: - undefined (omitted) → no change - explicit null → forwarded as null (intentional clear) - string + model → validated against the model - string, no model → return 400 with code "invalid_reasoning_effort" Two new regression cases in plans.handler.test.ts. ⚡ Quick win (durable-object.ts) — NaN-guard the inactivity timeout parse. parseInt(env_value, 10) returns NaN when the env var is set to a non-numeric string (e.g. "abc" or "5 minutes"); the lifecycle manager would then schedule alarms with NaN ms, which is undefined behavior. New parseSandboxInactivityTimeoutMs() helper checks Number.isFinite and value > 0 before accepting the env value, falling back to DEFAULT_SANDBOX_INACTIVITY_TIMEOUT_MS otherwise. ⚡ Style (durable-object.ts) — Drop the "// 15 minutes" trailing comment on DEFAULT_SANDBOX_INACTIVITY_TIMEOUT_MS. Per repo guideline, don't restate literal timeout values in comments — the constant name is the source of truth. Verification: npm run typecheck, npm run lint, npm test -w @open-inspect/control-plane (65/65 files, all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
462f0b7 to
1b6b6d4
Compare
CodeRabbit follow-up review on ColeMurray#672 (the only thread that didn't auto- resolve after the previous fix pushes). The issue technically lives in ColeMurray#671's plans.handler.ts but CodeRabbit only flagged it now after seeing the post-fix state. `JSON.parse("null")` returns null, `JSON.parse("[1,2]")` returns an array, `JSON.parse("42")` returns a number. All three are syntactically valid JSON but none of them is the object payload approvePlan/rejectPlan expect. The previous code would parse successfully then crash later when dereferencing `body.implementationModel` on `null` (TypeError) or silently accept arrays and primitives as if they were valid bodies. Reject these early with HTTP 400 (`code: "invalid_body"`) via the existing InvalidApprovalBodyError path. Two new regression cases cover JSON-null and JSON-array bodies. Verification: npm run typecheck && npm test -w @open-inspect/control-plane (65/65 files, 2 new test cases green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1b6b6d4 to
7bc1db1
Compare
…leMurray#671 Two new items flagged after the previous fix pushes: 🟠 plans.handler.ts — Reject invalid implementationReasoningEffort. validateReasoningEffort() returns null for unsupported values, but null is also our explicit "clear the persisted effort" sentinel. A request with a typo (e.g. "hgih" instead of "high") was previously accepted and silently cleared the session's reasoning_effort. Now we distinguish the two cases: explicit-null body stays null (intentional clear), but a string that fails validation returns 400 with code "invalid_reasoning_effort" instead of being coerced. 🟡 bridge.py — Use quoteattr for the version XML attribute value. xml_escape() does NOT escape `"`, so a version string containing a quote could break the attribute boundary of <saved_plan version=...> and <previous_plan version=...>. Switching to xml.sax.saxutils. quoteattr (which handles the surrounding quotes itself) closes that edge case. Body content keeps xml_escape since it's not in an attribute context. Verification: npm run typecheck && npm test -w @open-inspect/control-plane (66/66 files green) && pytest tests/test_bridge_resume_context.py (19 cases green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce27b0e to
ed14599
Compare
When a plan is approved, the control-plane now enqueues a synthetic "Implement the approved plan vN..." prompt server-side and flushes the queue, so every client (web + Slack/Linear/GitHub bots) starts the implementation turn through the same code path. Previously only the web client did this (via a follow-up WebSocket prompt after the approve call). Approvals coming from bots stalled at "Execution complete" right after the plan was generated — Build cost stayed at $0 — because nothing was kicking off the build turn. The synthetic prompt is authored by a stable per-session "system" participant (created lazily on first dispatch) and uses the new "system" MessageSource so the UI and event log can distinguish it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
enqueuePromptFromApi already flushes the message queue at the end of its own path, so the implementation-prompt enqueue done in onDispatchImplementationPrompt picks up both the synthetic prompt and any user messages that piled up during awaiting_approval. The separate onPlanApproved callback that fired a second processMessageQueue was a no-op (queue already busy) and the comment claiming it flushed the synthetic prompt was misleading. Remove the dep and the wiring. Addresses reef review on ColeMurray#65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ed14599 to
8a6f1c6
Compare
5 tasks
…D1Tables Addresses CodeRabbit review on ColeMurray#672. Control-plane integration tests run under isolatedStorage=false (workers-sdk SQLite WAL cleanup bug), so D1 state can leak across files when each suite forgets to clean. Match the pattern used by other integration suites and reset D1 in beforeEach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8a6f1c6 to
d874f27
Compare
PLAN_SAVE_TIMEOUT_SECONDS sat between the third-party imports and the relative imports, tripping E402 (module level import not at top of file) and I001 (unsorted xml.sax.saxutils import). Move the constant below all imports and let ruff split the xml import. Caught by upstream PR ColeMurray#671 CI: Lint & Format (Python - sandbox-runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d874f27 to
7a8a459
Compare
…ove/reject When a plan verdict (approve/reject) is set from a different channel than the one that posted the "Plan awaiting" message — e.g. the user approves a Slack-triggered plan from the web UI — the originating bot has no signal to update its UI. Its modal/webhook handler never fired, so the awaiting message sits stale with clickable buttons that 4xx on click. Add `notifyPlanStatus` to CallbackNotificationService, hooked from plans.handler after a successful approve/reject. Routing mirrors `notifyComplete`: lookup the plan's triggering message, read its callback_context + source, POST a signed payload to the bot's `/callbacks/plan-status` endpoint via `getBinding(source)`. Cross-channel guard: skip when the approver's source (parsed from the `prefix:id` shape of approverAuthorId) equals the message source. The bot's own modal/webhook already handled that case (d1d7edc). Automation-sourced messages have no user surface — skipped too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ve/unarchive When the user archives a session from the web UI, the originating Slack/Linear thread gets no signal. The user has to check the web app to know the session is closed. Add `notifySessionLifecycle` to CallbackNotificationService, hooked from the archive and unarchive handlers via a new `notifySessionLifecycle` dep on SessionLifecycleHandlerDeps. Routing mirrors notifyPlanStatus but uses the most recent message with a callback_context (via a new `getLatestCallbackEnvelope` repository helper) since archive is a session-level event not bound to a specific message. Symmetric: both archive and unarchive fire the callback. The bots discriminate via an `event: "archived" | "unarchived"` field in the payload. No-op for sessions with no bot origin (web-only) — the latest-envelope lookup returns null. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7a8a459 to
c8f7ece
Compare
…llbacks
Pass `approverDisplayName` / `actorDisplayName` end-to-end so the
originating Slack/Linear notifications can render the actor's real
name (e.g. "John Doe (via web)") instead of "someone in web". Adds
the field to `notifyPlanStatus` / `notifySessionLifecycle` payloads,
threads it through the plans + session-lifecycle handler deps, and
forwards it via router's archive/unarchive handlers (which previously
re-serialized the body as `{ userId }` only).
Companion changes on the bot branches surface the field in the
rendered messages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the existing `model_preferences` table with `default_model` and
`default_plan_model` columns. Surfaces them via a new section in
Settings → Models so the deployment-wide defaults are configurable from
the UI instead of being scattered across Terraform env vars on each
worker.
Why: with plan-mode (preceding PR in stack), every deployment now
needs two default models — one for build turns, one for plan turns.
The control-plane is the natural source of truth (already serves
`enabledModels` via `/model-preferences`). The bots and the web UI
read from this single place instead of each worker carrying its own
`DEFAULT_MODEL` env var that must be kept in sync.
Changes:
- D1 migration `0021_add_default_models_to_model_preferences.sql`
adds two nullable columns via `ALTER TABLE ADD COLUMN`. Existing
deployments keep working because the read path falls back to the
env var, then to the shared library constant.
- `db/model-preferences.ts` extended to read/write the new fields
atomically across the three-field tuple.
- `GET /model-preferences` now returns `{ enabledModels, defaultModel,
defaultPlanModel }`; `PUT` validates `defaultModel ∈ enabledModels`.
- Web Settings → Models gets a new "Default Models" section with two
combobox pickers. Disabling a model that's the current default is
blocked inline.
- Terraform: production workers gain `DEFAULT_MODEL` (control-plane,
which didn't have one) and `DEFAULT_PLAN_MODEL` (all four). These
remain fallbacks — the DB value wins when set.
- `docs/GETTING_STARTED.md` gets a "Configure Default Models" subsection.
Verification: `npm run typecheck`, `npm run lint`, `npm test`
(shared 183/183, control-plane 1162/1162, web 259/259) — all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Must-fix: - model-defaults.ts: per-field fallback chain (2.1) — defaultModel and defaultPlanModel now resolve independently as `DB > env > shared constant`. The previous all-or-nothing gate required BOTH fields in the CP response; when only one was present, both fields fell back together to env/constants, discarding the partial DB value. Should-fix: - models-settings.tsx: setDirty only fires on an actual state change (2.2). Previously toggling a model off would set dirty even when the toggle was blocked by the default-model guard, enabling Save on a no-op. The fix tracks a `changed` flag inside the updater and only calls setDirty when the underlying Set actually mutated. - models-settings.tsx: SWR data sync moved from render phase into useEffect (2.3). The previous in-render setState would surface a React warning under StrictMode and risked an unnecessary re-render. - terraform/environments/production/workers-{linear,slack}.tf: align DEFAULT_PLAN_MODEL to "anthropic/claude-opus-4-6" (2.4) to match the format used by workers-control-plane.tf and workers-github.tf. The bare "claude-opus-4-6" string was not a fully-qualified model ID and would be rejected by isValidModel. Targeted regression test: - model-defaults.test.ts: two new cases verifying that when the CP response contains only defaultModel (resp. defaultPlanModel), the other field independently falls back to env without the present field being discarded. Verification: npm run build -w @open-inspect/shared && npm run lint && npm run typecheck && npm test (shared 16/16, control-plane 66/66, web 32/32 — all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… GET /model-preferences CodeRabbit follow-up on ColeMurray#672 (only remaining unresolved thread on the PR after the previous fixes auto-resolved): when no DB preferences exist, the fallback chain could return defaultModel / defaultPlanModel sourced from env vars or shared constants that weren't members of the returned enabledModels set. That broke the same invariant enforced by setPreferences() — a subsequent PUT with the unchanged returned tuple would fail validation, leaving the Settings page unable to re-save. New reconcileDefaultsWithEnabled() helper substitutes the first enabled model for any default that isn't in enabledModels (the same fallback the web Settings UI applies when a stored default becomes disabled). Applied at all three GET sites: no-DB-binding fallback, happy path, and store-throws fallback. New regression test asserts that env-var defaults outside the enabled set are reconciled. The existing test capturing the old (broken) behavior was rewritten to use enabledModels that contain the env-var values — the more realistic configuration where both are aligned. Verification: npm run typecheck && npm test -w @open-inspect/control-plane (66/66 files green, 1 new regression case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit follow-up on ColeMurray#672: env vars like `DEFAULT_MODEL` / `DEFAULT_PLAN_MODEL` are sometimes configured with bare model names (e.g. `claude-sonnet-4-6`) rather than the fully-qualified form (`anthropic/claude-sonnet-4-6`). The CP `/model-preferences` endpoint already returns normalized IDs, but the env-var fallback path previously surfaced the raw env value verbatim — so during a CP outage callers could see a mix of qualified and bare IDs depending on how the operator configured the worker. New `normalizeFallback()` helper passes the candidate through `isValidModel` + `normalizeModelId`. Invalid candidates return undefined so the chain falls through to the next link (env → shared constant); valid bare or qualified IDs get normalized to the qualified form before being returned. Three existing tests rewritten to assert the normalized output (they previously captured the bug). Two new regression cases: - Bare env vars get normalized. - Invalid env vars are ignored and the chain falls through to the shared constant. Verification: npm run build -w @open-inspect/shared, npm test -w @open-inspect/shared (16/16 files green) && npm test -w @open-inspect/control-plane (66/66 files green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaces the plan-mode HITL gate end-to-end in the dashboard:
- New <PlanApprovalBanner /> component sits at the bottom of an
awaiting-approval session, with an Approve / Reject pair plus an
inline "Build with" model picker that defaults to the deployment's
configured `defaultModel`.
- Session timeline renders every plan version chronologically as a
plan bubble. The latest version inherits the session's
`planApprovalStatus` (awaiting / approved / rejected); older versions
are "superseded" (struck through). Plan bubbles interleave with the
existing event groups by timestamp.
- Home page Plan toggle. When ON, the model selector auto-swaps to
the deployment's `defaultPlanModel` and the agent label flips from
"build agent" to "plan agent". The auto-swap stops as soon as the
user manually picks a model.
- Sidebar metadata block surfaces plan mode + plan model on a session.
- New API proxy routes under `app/api/sessions/[id]/plan{,/approve,/reject}`
and `app/api/sessions/[id]/plans` forward to the control plane.
`app/api/sessions/route.ts` now passes through `planMode` / `planModel`
at session creation time.
- `use-session-socket.ts` handles `plan_saved` / `plan_status` events
and exposes a `plans` array + the current `planApprovalStatus` on
the hook's return value. `sendPrompt` gains an optional `planMode`
argument.
- HammerIcon added to the shared icons module (used for the "build"
agent label in the sidebar).
Stack: A → B → { **C** | D | E | F → G }
Depends on: A (plan backend), B (defaultModel / defaultPlanModel API).
Parallel to: D (linear), E (github), F (slack).
Verification: `npm run typecheck`, `npm run lint`, `npm test`
(web 259/259, control-plane 1162/1162) — all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The control-plane now dispatches the implementation prompt server-side after plan approval, so the web client no longer needs its post-approve WebSocket follow-up. Removing it avoids a double-dispatch on web. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…notifications Web's session.user.id is typically the GitHub numeric id (from next-auth) — not human-readable. Pass `session.user.name` alongside as `approverDisplayName` / `actorDisplayName` so the originating Slack/Linear thread can render "<name> (via web)" instead of "someone in web" when the user takes an action from the web UI. Affects: plan approve/reject, archive/unarchive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c8f7ece to
cb2fc99
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Surfaces the plan-mode HITL gate end-to-end in the dashboard.
What's in this PR
<PlanApprovalBanner />component sits at the bottom of an awaiting-approval session, with an Approve / Reject pair plus an inline "Build with" model picker that defaults to the deployment'sdefaultModel.planApprovalStatus(awaiting / approved / rejected); older versions are "superseded" (struck through). Plan bubbles interleave with the existing event groups by timestamp.defaultPlanModeland the agent label flips from "build agent" to "plan agent". The auto-swap stops as soon as the user manually picks a model.app/api/sessions/[id]/plan{,/approve,/reject}andapp/api/sessions/[id]/plansforward to the control plane.app/api/sessions/route.tsnow passes throughplanMode/planModelat session creation time.use-session-socket.tshandlesplan_saved/plan_statusevents and exposes aplansarray + the currentplanApprovalStatuson the hook's return value.sendPromptgains an optionalplanModeargument.Stack
Test plan
npm run typecheck(workspace) — greennpm run lint— greennpm test -w @open-inspect/web— 259/259npm test -w @open-inspect/control-plane— 1162/1162🤖 Generated with Claude Code