Skip to content

fix(runtime): make cancelStream() owner-aware and close its SSE source#3345

Closed
franksong2702 wants to merge 6 commits into
nesquena:masterfrom
franksong2702:franksong2702/cancel-stream-owner-guard
Closed

fix(runtime): make cancelStream() owner-aware and close its SSE source#3345
franksong2702 wants to merge 6 commits into
nesquena:masterfrom
franksong2702:franksong2702/cancel-stream-owner-guard

Conversation

@franksong2702
Copy link
Copy Markdown
Contributor

Closes #3344

Thinking Path

  • Hermes WebUI is intentionally a thin Python + vanilla-JS app with
    no build step; the cancel/stop path is one of the few places where
    the browser talks back to the runtime and also has to keep its local
    state in sync with the worker.
  • Today the user-facing cancelStream() in static/boot.js reads
    S.activeStreamId, fires /api/chat/cancel, and then unconditionally
    clears S.activeStreamId, calls setBusy(false), and resets the
    composer status. It does not look at the response, does not call
    closeLiveStream(sid, streamId), and does not guard the clear against
    the original streamId.
  • The sibling function cancelSessionStream() (used by the sidebar)
    already does owner-guarded cleanup and explicitly closes the SSE
    source. The user-facing path was the odd one out.
  • This PR brings cancelStream() in line with cancelSessionStream()
    while staying purely frontend. The backend
    (api/streaming.py:6486 cancel_stream()) already returns
    {ok, cancelled, stream_id}; the only thing missing is a frontend
    that reads it and acts on it.
  • Net effect: the UI no longer lies about being idle while the worker
    is still running, and the old SSE source cannot leak tokens into a
    subsequent turn's transcript.

What Changed

  • static/boot.jscancelStream() rewritten:
    • Snapshot (sid, streamId) at entry.
    • Read r.json() into respBody; ignore non-JSON bodies.
    • Call closeLiveStream(sid, streamId) for the snapshot pair
      (mirrors cancelSessionStream()).
    • Clear local busy state only when S.activeStreamId is still the
      snapshot value (owner guard). If a new turn has rotated
      S.activeStreamId during the cancel fetch, the new turn's state
      is left alone.
    • When respBody.cancelled === false, show a short toast
      Stream is no longer active (2 s, generic — the response shape
      cannot distinguish reasons).
    • Added a 3-line block comment above the function pointing to the
      run-state consistency RFC.
  • tests/test_cancel_stream_owner_guard.py — new file. 13 tests:
    • 7 structural (regex on the function body) asserting the
      snapshot / read-response / closeLiveStream / owner-guard /
      cancelled-false-toast / network-error-doesn't-throw / no-op-on-empty
      invariants are present in source.
    • 6 runtime tests that actually eval the function with mocked
      S, fetch, closeLiveStream, setBusy, setComposerStatus,
      showToast via a node --input-type=module subprocess, and
      verify:
      • T1 no active stream: no-op (no fetch, no close, no busy).
      • T2 happy path (cancelled:true): state cleared AND old SSE
        closed.
      • T3 cancelled:false: toast surfaced AND old SSE closed AND
        local state cleared.
      • T4 network error: no throw, owner guard still applies, old
        SSE still closed.
      • T5 owner guard: S.activeStreamId rotates to a new turn
        during fetch — old SSE is still closed for the original
        (sid, streamId), but the new turn's S.activeStreamId and
        busy state are NOT cleared.
      • T6 cross-check: T2/T3/T4 all clear local state; T5 diverges.
    • The new test file is intentionally named
      test_cancel_stream_owner_guard.py; if reviewers prefer the
      test_issueNNNN_* convention, a follow-up rename is fine.
  • CHANGELOG.md[Unreleased] / Fixed entry describing the
    three behaviours added (close SSE, owner guard, cancelled-false
    toast).

Why It Matters

  • Reliability: After clicking Stop, the UI used to flip to idle
    even if the cancel was silently skipped on the backend (e.g. a
    new turn had already started in the same session). The new
    turn would then run with no busy state, so the user had no
    way to tell the worker was still doing work.
  • Cleanliness: The old SSE EventSource was left in
    LIVE_STREAMS[sid]. A subsequent turn in the same session
    could see its metering / token events for one tick, then
    flip back, then have a stale close fire later. Closing the
    source on the cancel path makes the contract between
    cancelStream() and the SSE registry symmetric.
  • Degraded feedback: When the backend says cancelled:false
    (already finalised / stale writeback / session rotated), the
    user now sees a short toast. They used to see nothing.

Verification

Run on the clean worktree based on origin/master@1fcd81e3:

  • node --check static/boot.js — passes.
  • pytest tests/test_cancel_stream_owner_guard.py -v
    13 passed (7 structural + 6 runtime).
  • pytest tests/test_sprint36.py tests/test_1466_sidebar_cancel_clarify.py tests/test_issue1298_cancel_and_activity.py tests/test_cancelled_turn_status.py tests/test_real_steer.py tests/test_issue2157_sessions_list_stale_stream_state.py -q — 60 passed, 0 failed.
  • pytest tests/ -q --timeout=30 (full suite, excluding two
    pre-existing failures unrelated to this PR):
    • tests/test_issue1144_session_time_sync.py::test_message_footer_timestamp_uses_server_tz
      — pre-existing locale failure on this macOS Python build
      (3月29日 10:00 instead of 10:00 AM); reproduces on
      unchanged master with no diff in boot.js. Not caused by
      this PR.
    • tests/test_ctl_script.py::test_start_can_ignore_repo_dotenv_for_authoritative_test_env
      — pre-existing test-ordering flake; passes in isolation,
      also fails on unchanged master when run after
      test_cancel_stream_owner_guard.py in the full suite.
    • All other 7,041 tests pass, including 65 skipped and 3
      xpassed that are unrelated.
  • Manual reproduction: see the issue's "Reproduction" section.
    Not captured as a video because the fix is a control-plane
    change; the only user-visible delta is a 2 s toast on the rare
    cancelled:false path. No before/after images are attached
    because the happy-path UI is byte-identical to before; the
    only visual change is the rare toast (and, in the
    near-impossible race where the cancel was silently dropped,
    the second turn's spinner no longer disappears for ~200 ms).

Contract Routing

Risks / Follow-ups

  • Pre-existing test fragility: tests/test_sprint36.py
    reads a 400-character window from
    async function cancelStream() to find the catch block.
    This PR happens to keep the function compact enough that
    the outer catch still falls inside that window, but a
    follow-up that adds more than ~5 lines of preamble at the
    top of the function will break that test. Bumping the
    window to 1200 (and the matching brace-finding) is a
    one-line follow-up; not bundled here to keep the PR scope
    minimal.
  • Backend silent-skip still untyped: The
    /api/chat/cancel response only exposes cancelled:bool.
    When it is false, the toast says "no longer active"
    generically; we cannot tell the user whether the stream
    was already finalised, the session rotated, or the
    writeback was stale. A follow-up that adds a status /
    reason field would let the UI be more specific. Not in
    scope for this PR.
  • The LIVE_STREAMS cross-session leak (P0 in the
    streaming-audit skill)
    is unrelated and is being
    addressed in a separate track.
  • No security-sensitive behaviour change. The function
    only adds frontend bookkeeping; it does not change auth,
    path handling, uploads, streaming-protocol parsing, or
    environment handling.

Model Used

  • Provider: custom (Hermes self-hosted M3)
  • Model: MiniMax-M3
  • Mode: agentic, with terminal + file + search tools, no
    separate image-generation / browser tools used in this
    PR.
  • The change was implemented by the agent in a clean git
    worktree based on origin/master@1fcd81e3, validated
    against the existing test suite, and the PR description
    was drafted by the same agent. Final review is human.

@franksong2702 franksong2702 force-pushed the franksong2702/cancel-stream-owner-guard branch from 20af25d to 5ba0d7a Compare June 4, 2026 02:39
@franksong2702 franksong2702 marked this pull request as ready for review June 4, 2026 02:41
@franksong2702 franksong2702 force-pushed the franksong2702/cancel-stream-owner-guard branch from 5ba0d7a to dd37d4c Compare June 4, 2026 02:44
cancelStream() in static/boot.js cleared S.activeStreamId, setBusy(false),
and the composer status unconditionally after issuing /api/chat/cancel.
It did not call closeLiveStream(sid, streamId), did not read the cancel
response, and did not guard the local-state clear against the original
streamId, so a new turn that started while the cancel request was in
flight could be wiped.

This change brings cancelStream() in line with cancelSessionStream():

  - snapshot (sid, streamId) at entry
  - read r.json() into respBody (ignore non-JSON)
  - closeLiveStream(sid, streamId) for the snapshot pair, even on
    cancelled:false and on network error
  - only clear local busy state when S.activeStreamId still matches
    the snapshot (owner guard)
  - short toast 'Stream is no longer active' on cancelled:false

No backend or contract change; this is defensive conformance with
docs/rfcs/webui-run-state-consistency-contract.md (Invariants #2 and
@franksong2702 franksong2702 force-pushed the franksong2702/cancel-stream-owner-guard branch from dd37d4c to 7409247 Compare June 4, 2026 02:45
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 4, 2026

Greptile Summary

This PR fixes cancelStream() in static/boot.js to be owner-aware and SSE-lifecycle-correct, addressing a bug where the UI would report idle even when the backend worker was still running and where the old SSE EventSource could leak events into subsequent turns.

  • static/boot.js: cancelStream() now snapshots (sid, streamId) at entry, reads the cancel response, and delegates cleanup to the SSE cancel event on the happy path (cancelled:true), clearing local state directly only when cancelled:false. Stale/rotated streams get their SSE closed via closeLiveStream, but the active owned stream is intentionally left open so the backend terminal event can settle the transcript before busy state is cleared.
  • static/messages.js: Adds _bailOutOfTerminalEventsFromStaleStream() guards to all terminal SSE event handlers and owner guards in _clearOwnerInflightState, _restoreSettledSession, and _handleStreamError — tightly coupled to boot.js not clearing S.activeStreamId on cancelled:true.
  • tests/test_cancel_stream_owner_guard.py: 13 new tests (7 structural regex + 6 runtime via node --input-type=module) covering no-op, happy path, cancelled:false toast, network-error no-throw, and owner-rotation scenarios.

Confidence Score: 5/5

Safe to merge — the functional logic is sound, boot.js and messages.js changes are a coordinated pair that correctly preserves the SSE cleanup chain, and the test suite covers the key behavioral paths.

The implementation correctly handles all five scenarios (no-op, happy cancel, cancelled:false, network error, owner rotation). The SSE error fallback in _handleStreamError provides a safety net if the cancel event does not arrive. No data loss or state corruption paths were found. The issues flagged are documentation and coupling nits that do not affect runtime correctness.

The implicit coupling between cancelStream() in boot.js not clearing S.activeStreamId and the _clearOwnerInflightState() guard in messages.js is worth noting for future maintainers — these two files must stay in sync.

Important Files Changed

Filename Overview
static/boot.js cancelStream() rewritten to defer local cleanup to the SSE cancel event on the happy path (cancelled:true), and clear state directly only when cancelled:false; logic is sound but the inversion is non-obvious
static/messages.js Adds _bailOutOfTerminalEventsFromStaleStream() guard to terminal SSE events and owner guards in _clearOwnerInflightState/_restoreSettledSession/_handleStreamError; tightly coupled to boot.js not clearing S.activeStreamId on cancelled:true
tests/test_cancel_stream_owner_guard.py 13 tests (7 structural regex + 6 runtime via node subprocess) covering key behavioral paths; some structural assertions are weak (bare 'catch' presence check)
CHANGELOG.md New Unreleased Fixed entry accurately describes the keep-SSE-open behavior, but the PR description's What Changed bullets about closeLiveStream for the snapshot pair are inaccurate

Sequence Diagram

sequenceDiagram
    participant User
    participant cancelStream as cancelStream() [boot.js]
    participant Backend as /api/chat/cancel
    participant SSE as SSE EventSource
    participant handlers as attachLiveStream handlers [messages.js]

    User->>cancelStream: clicks Stop
    note over cancelStream: snapshot sid + streamId
    cancelStream->>Backend: "fetch /api/chat/cancel?stream_id=X"
    Backend-->>cancelStream: cancelled:true
    note over cancelStream: do nothing locally — S.activeStreamId stays set
    Backend->>SSE: emits cancel terminal event
    SSE->>handlers: cancel event fires
    note over handlers: _bailOutOfTerminalEventsFromStaleStream false — proceed
    handlers->>handlers: _clearOwnerInflightState clears INFLIGHT
    handlers->>handlers: "S.activeStreamId=null, setBusy(false)"
    alt SSE dies before cancel event
        SSE->>handlers: error event fires
        handlers->>handlers: _handleStreamError — cleanup + Connection interrupted
    end
    alt Owner rotated during fetch
        note over cancelStream: S.activeStreamId !== streamId
        cancelStream->>handlers: closeLiveStream(sid, streamId)
    end
    alt cancelled:false
        Backend-->>cancelStream: cancelled:false
        note over cancelStream: S.activeStreamId=null, setBusy(false), showToast
    end
Loading

Reviews (4): Last reviewed commit: "Let active cancel settle over SSE" | Re-trigger Greptile

Comment thread static/boot.js Outdated
Comment thread static/boot.js Outdated
@franksong2702
Copy link
Copy Markdown
Contributor Author

Updated with merge commit 4718fd7 to bring the branch onto latest master and resolve conflicts. Also clarified the conditional EventSource close comment and replaced the silent cancel request catch with debug logging. Local verification: git diff --check; node --check static/boot.js static/commands.js static/i18n.js static/panels.js static/terminal.js; pytest tests/test_sprint36.py -q (11 passed).

@franksong2702
Copy link
Copy Markdown
Contributor Author

Follow-up commit 3427f2c fixes the CI regression from the previous debug-log change: cancelStream() now uses console.warn instead of console.debug so Node runtime tests keep JSON on stdout. Local verification: git diff --check; node --check static/boot.js; pytest tests/test_cancel_stream_owner_guard.py tests/test_sprint36.py -q (24 passed).

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 4, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

@nesquena-hermes nesquena-hermes added hold changes-requested Maintainer left detailed feedback requesting changes; PR is waiting on author to address labels Jun 4, 2026
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @franksong2702 — the owner-guard intent is right and the diagnosis (cancelStream() was the odd one out vs cancelSessionStream()) is correct. Our pre-merge review (Codex regression gate, which I verified against the code) found two things; one is a quick fix, the other needs a rethink, so holding:

1. Owner guard has a null-window clobber (small — but pin it)

if(!S.activeStreamId || S.activeStreamId===streamId) clears on a null owner too. A queued/interrupt follow-up turn sets busy in send() but leaves S.activeStreamId null until /api/chat/start returns; during that window a stale cancel continuation hits !S.activeStreamId and calls setBusy(false) over the new in-flight turn. Fix: gate the clear on exact ownership only — if(S.activeStreamId===streamId). (Verified the clobber with a direct repro.)

2. Closing the SSE on the ACTIVE session skips the cancel settle/render (the bigger one)

This is the part that needs a rethink before merge. closeLiveStream(sid, streamId) (messages.js:775) does live.source.close() and sets INFLIGHT[sid].reattach=true — it's built for session-switch teardown of a session you're no longer viewing (the stream survives server-side, marked for reattach). But the user-facing Stop is on the session you are viewing, where you want the server's SSE cancel event (messages.js:~2382-2422) to fire — that handler is what clears INFLIGHT/persisted inflight, clears approval/clarify UI, fetches the settled cancelled transcript, renders the "Task cancelled" message, and refreshes the sidebar.

So mirroring cancelSessionStream() (a not-viewing-this-session teardown) onto the active-session Stop applies the wrong cleanup: it closes the SSE before the cancel event arrives, leaves the inflight in a reattach-pending state instead of cleared, and the settled cancellation transcript never renders (until a manual reload).

Suggested direction

For the active session (S.session?.session_id===sid / S.activeStreamId===streamId), don't client-close the SSE — let the backend cancel flow through the existing cancel SSE handler that already does the full terminal cleanup + render. Reserve closeLiveStream() for the not-viewing case (which is what cancelSessionStream() legitimately handles). Or extract a shared "settle cancelled turn" helper and call it instead of a bare close. Either way, please verify live in-browser: click Stop on an active streaming turn → the "Task cancelled" message renders, inflight state clears, no reload needed, and a new turn started mid-cancel keeps its busy state — a source-contract test won't catch this.

Marking hold + changes-requested. I applied the #1 owner-only guard locally while reviewing (so that part's settled); the SSE-settle rework in #2 is the blocker. Happy to re-review once the active-vs-not-viewing distinction is handled + live-verified.

@franksong2702
Copy link
Copy Markdown
Contributor Author

Pushed follow-up f74eb09a addressing the active-session Stop blocker and bringing the branch onto current master.

What changed:

  • Fixed the null-window owner guard: local state now only clears for the captured active stream when the backend reports cancelled:false; a newer in-flight turn is not clobbered.
  • For an accepted active-session Stop, cancelStream() now keeps the current SSE open and preserves S.activeStreamId so the backend cancel event can run the existing terminal cleanup/render path and show the settled "Task cancelled" transcript without a reload.
  • closeLiveStream(sid, streamId) is now reserved for the stale-owner path where the active stream id has already rotated before the cancel request resolves.
  • Added/updated runtime regression coverage for accepted active Stop, cancelled:false, network error, and owner-rotation behavior.
  • Merged current origin/master and resolved the conflict fallout; PR is now mergeable again.

Verification:

  • python -m pytest tests/test_cancel_interrupt.py tests/test_inflight_stream_reuse.py tests/test_cancelled_turn_status.py tests/test_cancel_stream_owner_guard.py tests/test_sprint36.py -q -> 53 passed
  • node --check static/boot.js static/messages.js static/ui.js static/sessions.js
  • git diff --check
  • GitHub Actions: 11/11 passed on f74eb09a

AI assistance: Codex coordinated the fix with a sub-agent, found and corrected the active SSE settle regression during review, then reran focused verification before pushing.

@nesquena-hermes nesquena-hermes removed hold changes-requested Maintainer left detailed feedback requesting changes; PR is waiting on author to address labels Jun 5, 2026
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Shipped in v0.51.265 (Release IG) — thank you @franksong2702! 🙏 The owner-aware + terminal-settle rework resolved both findings from the earlier holds (null-window clobber + active-session SSE-settle). Codex confirmed the backend reliably emits the terminal cancel event the new design relies on. Closing as merged-via-release-stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cancelStream() should respect active stream ownership and close the SSE source

2 participants