Skip to content

Fix stream_end recovery race for live assistant settle#3885

Closed
franksong2702 wants to merge 6 commits into
nesquena:masterfrom
franksong2702:franksong2702/fix-stream-end-recovery-race
Closed

Fix stream_end recovery race for live assistant settle#3885
franksong2702 wants to merge 6 commits into
nesquena:masterfrom
franksong2702:franksong2702/fix-stream-end-recovery-race

Conversation

@franksong2702

@franksong2702 franksong2702 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thinking Path

  • Issue Streaming content flickers (disappears and reappears) during mid-stream DOM rebuilds in long sessions #3877 describes a visible streaming flicker class where active live assistant DOM can disappear and later reappear after a rebuild or terminal settle.
  • This PR scopes the fix to the stream_end terminal-settle race: stream_end can arrive while live assistant text, Worklog, Thinking, or tool DOM is still present and before the persisted session is safe to use as the canonical final transcript.
  • The invariant is: while the current session still has live stream UI or the persisted session is still active/pending, stream_end must not blindly replace or clear the live DOM.

What Changed

  • Added deferred stream_end recovery state in attachLiveStream():
    • _pendingStreamEndRecovery
    • _streamEndRecoveryTimer
    • _streamEndRecoveryAttempts
    • _clearStreamEndRecovery()
  • Added live-scene detection so empty-text but active UI states still gate recovery:
    • live assistant rows
    • live reasoning text
    • Worklog shell
    • live tool cards
    • active Thinking cards
  • Added _runStreamEndRecovery(source) to retry settled-session restore while the session is still active/pending, then centralize the terminal cleanup path.
  • Extended _restoreSettledSession(source, {status:true}) so callers can distinguish restored, active, missing, stale, and error while preserving the existing public helper signature.
  • Cleared pending recovery on terminal paths: done, stream_end, apperror, cancel, and stream error handling.
  • Added regression coverage in tests/test_stream_end_recovery_gating.py and updated the existing terminal cleanup ownership test for the status-aware restore call.
  • Updated CHANGELOG.md.

Why It Matters

  • State layer: browser live stream/SSE observation plus live UI scene/cache, converging to the persisted session transcript.
  • This prevents the frontend from treating stream_end as permission to clear or replace active live assistant DOM before the backend session has actually settled.
  • The user-visible effect is avoiding a blank assistant region or a "switch away and back to recover" artifact at terminal stream cleanup.

Verification

  • pytest tests/test_stream_end_recovery_gating.py -q
    • 6 passed
  • pytest tests/test_1694_terminal_cleanup_ownership.py::test_stream_end_without_done_restores_settled_session_before_closing tests/test_1694_terminal_cleanup_ownership.py::test_settled_restore_and_error_close_only_the_event_source_owner tests/test_stream_end_recovery_gating.py -q
    • 8 passed
  • node --check static/messages.js
  • npm run lint:runtime
  • git diff --check
  • Browser/runtime verification on local dev runtime:
    • Sent a short live-stream prompt using qwen3.6-plus.
    • Final assistant DOM retained text after completion; no blank assistant region observed.
    • /api/chat/stream/status?stream_id=bc55af7cab4346478015e42f87421db7 returned active=false, journal.last_event=stream_end, journal.terminal=true, journal.terminal_state=completed.
    • Journal tail showed done -> metering -> title_status -> title -> stream_end.
    • Switching to another conversation and back still showed 5 messages and final assistant paragraph OK; content did not depend on switching back to recover.
    • Browser console reported 0 errors.
  • GitHub Actions on PR Fix stream_end recovery race for live assistant settle #3885:
    • browser-smoke, lint, and all Python 3.11/3.12/3.13 test shards passed.

Risks / Follow-ups

Model Used

  • GPT-5 Codex, with local shell verification and Playwright browser validation.

Refs #3877

@franksong2702 franksong2702 force-pushed the franksong2702/fix-stream-end-recovery-race branch from 3b5bfb3 to fc22e3d Compare June 9, 2026 17:21
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces a deferred stream_end recovery mechanism to prevent live assistant DOM from being cleared before the backend session has fully settled. When stream_end arrives while active live content is still on screen, cleanup is now deferred and retried (up to 10 × 200 ms) until the persisted session leaves its active/pending state, at which point a centralized _finalizeStreamEndFallback helper performs the full terminal cleanup.

  • New recovery state machine_scheduleStreamEndRecovery / _runStreamEndRecovery / _finalizeStreamEndFallback replace the previous single-shot _restoreSettledSession call; _clearStreamEndRecovery() is inserted into all terminal event paths (done, apperror, cancel, stream error) to prevent abandoned timers.
  • Richer _restoreSettledSession return type — the function now returns 'restored' | 'active' | 'missing' | 'stale' | 'error' when called with {status:true}, allowing callers to distinguish a still-live backend session from a genuinely missing or failed one without breaking existing callers that use the boolean form.
  • Static regression teststest_stream_end_recovery_gating.py guards six source-level invariants about the new helper chain; existing terminal-cleanup-ownership tests are updated for the new signature.

Confidence Score: 3/5

Safe for the common path, but the new _finalizeStreamEndFallback helper can clobber a concurrently-started stream's active ID and live UI when reached via the stale status branch.

_finalizeStreamEndFallback is missing the stream-ownership guard that _handleStreamError already carries. When _restoreSettledSession returns 'stale' — meaning a newer stream has taken over the same session — _finalizeStreamEndFallback's if(_isActiveSession()) branch still runs, setting S.activeStreamId=null and calling clearLiveToolCards()/renderMessages(), erasing the new stream's live content. The window is narrow (requires a new message within ~180 ms of stream_end arriving with live content), but the session-level state cleared is shared across closures, so the damage affects the new stream rather than just the finalized old one.

static/messages.js — specifically _finalizeStreamEndFallback and the direct stream_end handler's fallback path to it.

Important Files Changed

Filename Overview
static/messages.js Core change: adds deferred stream_end recovery via _scheduleStreamEndRecovery/_runStreamEndRecovery/_finalizeStreamEndFallback. _finalizeStreamEndFallback lacks the stream-ownership guard (S.activeStreamId===streamId) present in _handleStreamError before clearing global session state, causing a potential clobber of a concurrently-started new stream in the stale path.
tests/test_stream_end_recovery_gating.py New regression tests for the stream_end recovery gating logic; all six tests guard source-level invariants via static analysis of messages.js. Tests verify helper structure, retry logic, scene detection, and state clearing on terminal events.
tests/test_1694_terminal_cleanup_ownership.py Updated to accept both the new _restoreSettledSession(source,{status:true}) signature and the refactored _finalizeStreamEndFallback delegation pattern while preserving ownership semantics checks.
tests/test_issue2863_session_index_prime.py Made background-thread assertion conditional to tolerate fast runners where the rebuild thread completes before the assertion runs; preserves the stronger index-content assertion.
CHANGELOG.md Adds unreleased entry describing the stream_end terminal-settle race fix; entry is accurate and appropriately scoped.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[stream_end event] --> B[_clearStreamEndRecovery]
    B --> C{_bailOutOfTerminalEventsFromStaleStream?}
    C -->|yes| Z[return]
    C -->|no| D{activeStreamId===streamId AND _liveStreamEndScenePresent?}
    D -->|yes| E[_scheduleStreamEndRecovery 180ms]
    E --> F[_runStreamEndRecovery fires]
    F --> G{_streamFinalized OR _terminalStateReached?}
    G -->|yes| H[_clearStreamEndRecovery return]
    G -->|no| I[_restoreSettledSession status=true]
    I --> J{status?}
    J -->|restored| K[_clearStreamEndRecovery return]
    J -->|active and attempts lt 10| L[_scheduleStreamEndRecovery 200ms]
    L --> F
    J -->|stale or missing or error or exhausted| M[_finalizeStreamEndFallback]
    D -->|no| N[_restoreSettledSession status=true]
    N --> O{status?}
    O -->|restored| Z
    O -->|active| E
    O -->|stale or missing or error| M
    M --> P[Set _streamFinalized _terminalStateReached]
    P --> Q[_clearOwnerInflightState _clearApprovalForOwner _clearClarifyForOwner]
    Q --> R{_isActiveSession?}
    R -->|yes| S2[S.activeStreamId=null clearLiveToolCards renderMessages]
    R -->|no| T[renderSessionList]
    S2 --> T
    T --> U[_setActivePaneIdleIfOwner _closeSource]
Loading

Reviews (2): Last reviewed commit: "Stabilize session index rebuild test" | Re-trigger Greptile

Comment thread static/messages.js Outdated
Comment thread static/messages.js Outdated
@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Pulled the branch into a read-only worktree and read the full stream_end handler (static/messages.js:3320-3358), the new _runStreamEndRecovery / _scheduleStreamEndRecovery / _liveStreamEndScenePresent helpers (1591-1644), the status-aware _restoreSettledSession (3703-3777), and the trio definitions (_clearApprovalForOwner/_clearClarifyForOwner/_clearOwnerInflightState, 1491-1506), plus the master baseline.

The core design is sound. The deferral + bounded retry is well-built: _runStreamEndRecovery re-checks _streamFinalized || _terminalStateReached || !_pendingStreamEndRecovery at the top of every attempt (1620), so a real done that fires mid-await can't be clobbered — and _restoreSettledSession itself short-circuits to 'restored' when _streamFinalized flipped during its network round-trip (3713). The 10-attempt ceiling at 1630 bounds the timer chain. Scene detection at 1607-1614 correctly treats empty-text-but-active states (Worklog shell, live tool cards, active thinking) as "don't tear down yet," which is the actual #3877 symptom.

On the trio-cleanup concern

I'd reframe Greptile's note: it is not a regression. The master stream_end fallback never called _clearOwnerInflightState / _clearApprovalForOwner / _clearClarifyForOwner either — and on master, when the persisted session was still active_stream_id/pending, _restoreSettledSession returned false and fell straight into the same trio-less inline fallback. So the "session won't converge" terminal path behaves the same as before w.r.t. those three.

That said, the concern lands harder here than it did on master, for a specific reason: _runStreamEndRecovery is reached only when _liveStreamEndScenePresent() was true (3331), i.e. there's live UI — and that live UI can include an owner-held approval or clarify card. When recovery exhausts its 10 attempts because persistence never settles, the give-up path at 1635-1643 closes the source and tears down the live DOM but leaves any approval/clarify prompt on screen with no automatic clear:

// _runStreamEndRecovery terminal give-up, 1635-1643
_clearStreamEndRecovery();
_terminalStateReached=true;
if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;}
_streamFinalized=true;
_cancelAnimationFramePendingStreamRender();
_streamFadeCleanupReduceMotionListener();
_smdEndParser();
if(typeof finalizeThinkingCard==='function') finalizeThinkingCard();
_closeSource(source);

Compare the done (3140-3147), apperror (3479-3481), cancel (3621-3623) and the _restoreSettledSession success path (3794-3797) — all four call the trio. The two stream_end terminal give-up paths are the only finalizers that don't.

Recommendation

The inline stream_end fallback (3350-3356) and the _runStreamEndRecovery give-up (1636-1643) are byte-for-byte the same 7-line finalize sequence duplicated. Lift them into one helper and add the trio there so all terminal finalizers converge:

function _finalizeStreamEndTerminal(source){
  _clearStreamEndRecovery();
  _terminalStateReached=true;
  if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;}
  _streamFinalized=true;
  _cancelAnimationFramePendingStreamRender();
  _streamFadeCleanupReduceMotionListener();
  _smdEndParser();
  if(typeof finalizeThinkingCard==='function') finalizeThinkingCard();
  _clearOwnerInflightState();
  _clearApprovalForOwner();
  _clearClarifyForOwner('terminal');
  _closeSource(source);
}

Both sites then become a single call. Since _clearApprovalForOwner/_clearClarifyForOwner are already owner-gated (_approvalBelongsToOwner() / _clarifyBelongsToOwner() at 1492/1499), this is safe for non-owner streams and can't clear another tab's prompt.

Test note

tests/test_stream_end_recovery_gating.py is entirely source-text assertions (e.g. line 80 greps the helper body for if(_streamFinalized || _terminalStateReached || !_pendingStreamEndRecovery)). Consistent with the repo's pattern for timing-sensitive races, and fine as an invariant pin — but it won't catch the stuck-prompt case above because nothing asserts the trio in the terminal path. If you add the shared helper, a one-line assert that _clearApprovalForOwner() appears in its body would lock the fix in. The manual browser verification in the PR body is solid; the gap is only the auto-give-up-with-pending-approval scenario, which is hard to reach manually.

Net: ship-worthy after consolidating the two duplicated terminal finalizers and adding the trio to the shared path. Nice scoping on keeping this to the stream_end race rather than trying to solve all of #3877.

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Re-reviewed the six new commits (c2e8c6bd99a34f) against my earlier note. This pushes the PR over the line — the consolidation I asked for is in, and it's done correctly.

What the new commits did

The two duplicated 7-line terminal finalizers are now a single helper, _finalizeStreamEndFallback(source) at static/messages.js:1619, and both terminal stream_end paths route through it:

  • the inline stream_end give-up after _restoreSettledSession reports non-restored/non-active (3362)
  • the bounded-retry exhaustion path inside _runStreamEndRecovery (1656)

The helper carries the owner-trio that was previously missing from those two paths:

// static/messages.js:1619-1640  (_finalizeStreamEndFallback)
_clearStreamEndRecovery();
if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;}
_terminalStateReached=true;
_streamFinalized=true;
_cancelAnimationFramePendingStreamRender();
_streamFadeCleanupReduceMotionListener();
_smdEndParser();
if(typeof finalizeThinkingCard==='function') finalizeThinkingCard();
_clearOwnerInflightState();
_clearApprovalForOwner();
_clearClarifyForOwner('terminal');
if(_isActiveSession()){ S.activeStreamId=null; clearLiveToolCards(); if(!assistantText)removeThinking(); renderMessages({preserveScroll:true}); }
renderSessionList();
_setActivePaneIdleIfOwner();
_closeSource(source);

That closes the exact gap I flagged: the auto-give-up-with-pending-approval scenario (10 retries exhausted while an owner-held approval/clarify card is on screen) now clears the prompt instead of leaving it stranded. And it's safe for non-owner streams because the trio is owner-gated — _approvalBelongsToOwner()/_clarifyBelongsToOwner() (1488/1491) and _clearOwnerInflightState's S.activeStreamId!==streamId early-return (1504). So a finalize on a background/foreign stream can't clear another tab's prompt. This now matches the done (3153-3158), apperror, and cancel finalizers — all terminal paths converge on the same owner-aware teardown.

Test lock is in

tests/test_stream_end_recovery_gating.py::test_stream_end_fallback_helper_clears_owner_state_before_closing asserts the trio + renderMessages({preserveScroll:true}) + _setActivePaneIdleIfOwner() + _closeSource inside _finalizeStreamEndFallback — exactly the one-line guard I suggested, so the fix can't silently regress. The other gating tests still pin the deferral, the 10-attempt ceiling, and the empty-text scene detection.

Test-file fixups are benign

The test_issue856_* and test_session_todo marker edits change _restoreSettledSession(source)_restoreSettledSession(source (prefix match) to accommodate the new options=null signature — not a behavior change, just keeping the source-text anchors valid. And test_issue2863_session_index_prime.py now tolerates thread is None when a fast runner clears _SESSION_INDEX_REBUILD_THREAD before the assertion observes it; the real invariants (first-scan correctness + rebuilt index) are still asserted. That tracks the #3894 follow-up rather than weakening coverage.

Net: this addresses the review fully. Nothing further blocking from me. Nice tight scoping — still correctly limited to the stream_end race rather than trying to swallow all of #3877.

Comment thread static/messages.js
Comment on lines +1619 to +1638
function _finalizeStreamEndFallback(source){
_clearStreamEndRecovery();
if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;}
_terminalStateReached=true;
_streamFinalized=true;
_cancelAnimationFramePendingStreamRender();
_streamFadeCleanupReduceMotionListener();
_smdEndParser();
if(typeof finalizeThinkingCard==='function') finalizeThinkingCard();
_clearOwnerInflightState();
_clearApprovalForOwner();
_clearClarifyForOwner('terminal');
if(_isActiveSession()){
S.activeStreamId=null;
clearLiveToolCards();if(!assistantText)removeThinking();
renderMessages({preserveScroll:true});
}
renderSessionList();
_setActivePaneIdleIfOwner();
_closeSource(source);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _finalizeStreamEndFallback missing stream-ownership guard before clearing global session state

When _restoreSettledSession returns 'stale' (i.e. _isActiveSession() is true but S.activeStreamId !== streamId), a newer stream has already taken over the same session. _finalizeStreamEndFallback is called regardless, and its if(_isActiveSession()) branch unconditionally executes S.activeStreamId=null, clearLiveToolCards(), and renderMessages() — clobbering the live UI of that new stream.

The concrete failure path: stream_end arrives while live content is present → 180 ms deferred recovery is scheduled → within that window the user submits a new message on the same session (S.activeStreamId becomes the new stream's ID) → recovery fires → _restoreSettledSession sees S.activeStreamId !== streamId and returns 'stale'_finalizeStreamEndFallback nullifies S.activeStreamId and re-renders from stale persisted state, erasing the new stream's live content.

Compare with _handleStreamError (line 3786), which has an explicit early-return guard for exactly this condition: if(_isActiveSession() && S.activeStreamId!==streamId){ _closeSource(source); return; }. The same guard — or at minimum changing if(_isActiveSession()) to if(_isActiveSession() && S.activeStreamId===streamId) — is needed in _finalizeStreamEndFallback before the S.activeStreamId=null / clearLiveToolCards() / renderMessages() block. The same gap exists in the non-deferred direct stream_end path, which can also reach _finalizeStreamEndFallback after an async API roundtrip during which S.activeStreamId may have advanced.

nesquena-hermes added a commit that referenced this pull request Jun 10, 2026
…#3892 #3898 #3885 #3882 #3868) (#3902)

* stage v0.51.347: render/stream cluster (#3892 #3898 #3885 #3882 #3868) + 2 Opus SHOULD-FIX

* stage v0.51.347: trim #3885 error-guard comment to fit diagnostic-test window

* Stamp v0.51.347 — Release LK (streaming & render reliability cluster)

* Remove stray uv.lock accidentally staged (not part of any cluster PR)

---------

Co-authored-by: nesquena-hermes <[email protected]>
@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Shipped in v0.51.347 (Release LK — streaming & render reliability cluster) 🎉 — live now.

Merged via the combined release PR #3902 alongside #3892, #3898, #3885, #3882, and #3868 (all in the live-to-final streaming/render family). Each was applied onto fresh master, gated through Codex (SAFE TO SHIP) + Opus (ship, no MUST-FIX) + the full suite (8532 passing), with two Opus SHOULD-FIX folded in before merge. Authorship preserved via co-authored commit + CHANGELOG credit.

Thanks for the fix! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants