Skip to content

Fix early-cancel live stream race#3476

Closed
franksong2702 wants to merge 1 commit into
nesquena:masterfrom
franksong2702:franksong2702/8787-master-plus-3401
Closed

Fix early-cancel live stream race#3476
franksong2702 wants to merge 1 commit into
nesquena:masterfrom
franksong2702:franksong2702/8787-master-plus-3401

Conversation

@franksong2702

@franksong2702 franksong2702 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Closes #3475

Thinking Path

  • Live turn cancellation should stop the worker, not just the browser transport.
  • The observed bug was an early-cancel startup race: the SSE could detach before the worker was fully reflected in the live stream registry.
  • The fix stays narrow: reconcile cancel against the active-run registry/session cache and report run-journal activity from the live registry.
  • This is distinct from the double-streaming transport bug in Redesign live-to-final assistant replies #3401; it is a state-consistency fix for cancel/recovery.

What Changed

  • api/streaming.py: cancel_stream() now falls back to ACTIVE_RUNS and the session agent cache when STREAMS has already detached, so the worker still receives interrupt("Cancelled by user") and the session is cleaned up.
  • api/routes.py: /api/session now reports run-journal active state from the live active-run registry instead of treating any persisted active_stream_id as proof that the worker is still alive.
  • tests/test_cancel_interrupt.py: adds a regression for canceling after STREAMS has been removed but the worker is still registered as active.
  • tests/test_run_journal_routes.py: locks the session payload invariant for the runtime journal active flag.
  • CHANGELOG.md: records the user-visible behavior fix.

Why It Matters

  • Immediate cancel after a message send is a real user action, not an edge case to ignore.
  • Before this fix, the UI could keep showing a running spinner while the session page was blank, because the browser stream and worker state had split.
  • After this fix, cancel is reconciled against the real live worker state, so the session settles to a cancelled state instead of an inconsistent half-running one.

Verification

  • ./.venv/bin/python -m pytest -q tests/test_cancel_interrupt.py tests/test_run_journal_routes.py
  • git diff --check
  • Live repro was the session startup/cancel race on 8787 where cancel was pressed immediately after send.

Risks / Follow-ups

  • This is intentionally scoped to the startup-race cancel path and session-state reconciliation.
  • It does not change the broader live-stream transport ownership model or the Redesign live-to-final assistant replies #3401 double-streaming fix.
  • If a future runtime path introduces a new source of live worker state, it should be threaded into the same reconciliation path instead of reintroducing a split-brain cancel state.

Contract Routing
Task type: bugfix for live-stream cancellation / session-state reconciliation
Touched areas: api/streaming.py, api/routes.py, tests/test_cancel_interrupt.py, tests/test_run_journal_routes.py, CHANGELOG.md
Relevant public docs:

  • AGENTS.md
  • CONTRIBUTING.md
  • docs/CONTRACTS.md
  • docs/rfcs/webui-run-state-consistency-contract.md
    Scope boundaries: startup-race cancel/state reconciliation only; no live-stream redesign, no double-streaming transport change
    Evidence needed before claiming done: targeted regression tests, diff check, and the user-visible cancel path on the running session

Model Used

  • OpenAI Codex (GPT-5)
  • Used local repo inspection, gh, and targeted pytest to validate the fix.

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Reviewed the cancel-fix commit (cd23dfa3) against origin/master, plus the live-config and routes context on both sides.

Summary

The actual fix is good and correctly scoped to #3475. The previous cancel_stream() opened with if stream_id not in streams: return False, so an early Stop click that lands after the browser SSE has detached from STREAMS — but while the worker is still registered — returned False and never interrupted the agent or cleared session state. That's exactly the split-brain symptom in #3475: sidebar shows streaming, page renders blank, worker keeps running. The new code falls back to ACTIVE_RUNS to recover session_id, then pulls the live agent from SESSION_AGENT_CACHE to deliver interrupt("Cancelled by user") even when STREAMS is already gone.

Code reference

The core change in api/streaming.py — the STREAMS-miss path no longer bails:

    with streams_lock:
        stream_present = stream_id in streams
        if stream_present:
            q = streams.get(stream_id)
        else:
            with _live_config.ACTIVE_RUNS_LOCK:
                active_run_entry = dict((_live_config.ACTIVE_RUNS or {}).get(stream_id) or {})
            if not active_run_entry:
                return False
            active_run_session_id = str(active_run_entry.get("session_id") or "").strip() or None

and the agent recovery:

    agent = agent_instances.get(stream_id)
    if agent is None and active_run_session_id:
        with _live_config.SESSION_AGENT_CACHE_LOCK:
            cached = _live_config.SESSION_AGENT_CACHE.get(active_run_session_id)
        if cached and _cached_agent_matches_session(cached[0], active_run_session_id):
            agent = cached[0]

The _cached_agent_matches_session guard (streaming.py:4082) is the right call here — it returns True only when the cached agent's identity is None or matches the session id, so you can't cross-wire an interrupt to a different session's worker that happens to be cached. The routes.py half (journal_active = bool(original_stream_id in active_stream_ids)) is also correct: deriving live state from the active-run registry instead of trusting a persisted active_stream_id is what keeps the journal from reporting a dead worker as running.

Verification

I confirmed ACTIVE_RUNS, ACTIVE_RUNS_LOCK, SESSION_AGENT_CACHE, _cached_agent_matches_session, and _active_stream_ids all already exist on origin/master (config.py:4803, models.py:336, streaming.py:3987). And the routes block this commit edits is present verbatim on master (routes.py:4998 + 5167-5169). So cd23dfa3 is logically independent of any other in-flight work — the new test_cancel_falls_back_to_active_run_registry exercises exactly the STREAMS-detached path and asserts interrupt is called once plus pending_*/active_stream_id are cleared.

One blocker before merge

The branch (franksong2702/8787-master-plus-3401) is stacked on #3401 (still open), so the diff against master is 6077 insertions across 54 files — not the 5-file change the description lists. git log origin/master..HEAD shows:

cd23dfa3 Fix early-cancel live stream race
f2114b0b Merge PR #3401 into 8787 runtime
a000b20f Redesign live-to-final assistant replies

As-is this can't merge without also pulling #3401's redesign. Since the cancel fix doesn't depend on #3401 (verified above), I'd recommend rebasing cd23dfa3 directly onto origin/master so this PR carries only the 5-file cancel change and can be reviewed/merged on its own merits. Otherwise reviewers have to evaluate the entire #3401 redesign to land a narrow bugfix.

@franksong2702 franksong2702 force-pushed the franksong2702/8787-master-plus-3401 branch from cd23dfa to 079cc9d Compare June 3, 2026 12:20
@franksong2702

Copy link
Copy Markdown
Contributor Author

Rebased this PR onto a clean origin/master base and force-pushed the head branch.

Current state:

  • single commit: 079cc9d5 Fix early-cancel live stream race
  • diff is now limited to the intended 5 files (CHANGELOG.md, api/routes.py, api/streaming.py, tests/test_cancel_interrupt.py, tests/test_run_journal_routes.py)
  • no Redesign live-to-final assistant replies #3401 live-to-final redesign commits are included anymore
  • GitHub checks are green (lint, browser-smoke, and all Python shards)
  • PR reports mergeable=MERGEABLE, mergeStateStatus=CLEAN

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Confirmed the rebase. I re-fetched pull/3476/head and git log origin/master..HEAD now shows a single commit (079cc9d5 Fix early-cancel live stream race), and the diff against origin/master is the intended 5 files — 175 insertions, not the 6077-across-54-files stack I flagged earlier:

 CHANGELOG.md                     |   7 ++
 api/routes.py                    |  14 ++-
 api/streaming.py                 | 182 ++++++++++++++++--------------
 tests/test_cancel_interrupt.py   |  47 +++++++-
 tests/test_run_journal_routes.py |   2 +

No #3401 live-to-final commits remain, so the narrow cancel fix can now be reviewed and landed on its own merits.

What survived the rebase (verified against origin/master)

The core cancel_stream() change is intact — the STREAMS-miss path no longer bails, it falls back to ACTIVE_RUNS to recover the session id and pulls the live agent from SESSION_AGENT_CACHE (guarded by _cached_agent_matches_session, streaming.py:3987) before delivering interrupt("Cancelled by user"). That's the fix for the split-brain symptom in #3475.

The routes.py half is also preserved, and slightly cleaner than the description suggested:

active_stream_ids = _active_stream_ids()
try:
    compact_session = s.compact(
        include_runtime=True,
        active_stream_ids=active_stream_ids,
    )
except TypeError:
    compact_session = s.compact()
...
journal_active = bool(original_stream_id in active_stream_ids)

I confirmed the symbols this leans on all exist on origin/master: _active_stream_ids (models.py:336) and compact(include_runtime=False, active_stream_ids=None) (models.py:836) — so include_runtime=True is supported natively and the TypeError fallback is just belt-and-suspenders for older callers. Deriving journal_active from the live active-run registry instead of the persisted active_stream_id is the right call: it stops the journal from reporting a dead worker as running.

State

mergeable=MERGEABLE, mergeStateStatus=CLEAN, and all 11 checks green (lint, browser-smoke, 9 Python shards across 3.11/3.12/3.13). The blocker I raised is resolved — no further changes requested from my side. Leaving the final merge call to the maintainer.

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Shipped in v0.51.237 (Release HE) via release PR #3492 — thank you @franksong2702! 🙏

Your early-cancel reconciliation fix (fall back to ACTIVE_RUNS + the session agent cache when STREAMS has detached, so an immediate Stop still interrupts the worker) is now on master and tagged. This was the deepest review of the sweep — picked as a high-impact, well-scoped fix from a regular contributor.

Two adjustments absorbed on the way in (both surfaced by the Codex regression gate, fixed + regression-tested):

  1. The refactor moved agent.interrupt() ahead of the partial-text/reasoning/tool-call snapshot, and that snapshot was no longer under streams_lock — so the worker's finally (which pops STREAM_PARTIAL_TEXT/STREAM_REASONING_TEXT/STREAM_LIVE_TOOL_CALLS under STREAMS_LOCK) could clear those buffers before cancel persisted them, silently dropping a cancelled turn's already-streamed text. Fixed by snapshotting all worker-poppable buffers under streams_lock before any interrupt.
  2. That snapshot initially only covered the STREAMS-present path; the detached ACTIVE_RUNS-only path (the case this PR adds) still lost text. Fixed by hoisting the snapshot to run unconditionally inside the lock.

Both are pinned by regression tests verified to fail against the buggy versions (test_cancel_preserves_partial_text_when_interrupt_pops_buffers + test_cancel_preserves_partial_text_on_detached_active_run_path). Codex re-reviewed → SAFE TO SHIP; the deadlock concern was cleared (no path takes ACTIVE_RUNS_LOCK then STREAMS_LOCK).

Authorship preserved via Co-authored-by and credited in the CHANGELOG. Closes #3475. Closing as merged-via-release-stage (recommitted on the stage branch after rebase onto current master).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Early cancel during turn startup can leave a split running state

2 participants