Fix stream_end recovery race for live assistant settle#3885
Fix stream_end recovery race for live assistant settle#3885franksong2702 wants to merge 6 commits into
Conversation
3b5bfb3 to
fc22e3d
Compare
|
| Filename | Overview |
|---|---|
| static/messages.js | Core change: adds deferred stream_end recovery via _scheduleStreamEndRecovery/_runStreamEndRecovery/_finalizeStreamEndFallback. _finalizeStreamEndFallback lacks the stream-ownership guard (S.activeStreamId===streamId) present in _handleStreamError before clearing global session state, causing a potential clobber of a concurrently-started new stream in the stale path. |
| tests/test_stream_end_recovery_gating.py | New regression tests for the stream_end recovery gating logic; all six tests guard source-level invariants via static analysis of messages.js. Tests verify helper structure, retry logic, scene detection, and state clearing on terminal events. |
| tests/test_1694_terminal_cleanup_ownership.py | Updated to accept both the new _restoreSettledSession(source,{status:true}) signature and the refactored _finalizeStreamEndFallback delegation pattern while preserving ownership semantics checks. |
| tests/test_issue2863_session_index_prime.py | Made background-thread assertion conditional to tolerate fast runners where the rebuild thread completes before the assertion runs; preserves the stronger index-content assertion. |
| CHANGELOG.md | Adds unreleased entry describing the stream_end terminal-settle race fix; entry is accurate and appropriately scoped. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[stream_end event] --> B[_clearStreamEndRecovery]
B --> C{_bailOutOfTerminalEventsFromStaleStream?}
C -->|yes| Z[return]
C -->|no| D{activeStreamId===streamId AND _liveStreamEndScenePresent?}
D -->|yes| E[_scheduleStreamEndRecovery 180ms]
E --> F[_runStreamEndRecovery fires]
F --> G{_streamFinalized OR _terminalStateReached?}
G -->|yes| H[_clearStreamEndRecovery return]
G -->|no| I[_restoreSettledSession status=true]
I --> J{status?}
J -->|restored| K[_clearStreamEndRecovery return]
J -->|active and attempts lt 10| L[_scheduleStreamEndRecovery 200ms]
L --> F
J -->|stale or missing or error or exhausted| M[_finalizeStreamEndFallback]
D -->|no| N[_restoreSettledSession status=true]
N --> O{status?}
O -->|restored| Z
O -->|active| E
O -->|stale or missing or error| M
M --> P[Set _streamFinalized _terminalStateReached]
P --> Q[_clearOwnerInflightState _clearApprovalForOwner _clearClarifyForOwner]
Q --> R{_isActiveSession?}
R -->|yes| S2[S.activeStreamId=null clearLiveToolCards renderMessages]
R -->|no| T[renderSessionList]
S2 --> T
T --> U[_setActivePaneIdleIfOwner _closeSource]
Reviews (2): Last reviewed commit: "Stabilize session index rebuild test" | Re-trigger Greptile
|
Pulled the branch into a read-only worktree and read the full The core design is sound. The deferral + bounded retry is well-built: On the trio-cleanup concernI'd reframe Greptile's note: it is not a regression. The master That said, the concern lands harder here than it did on master, for a specific reason: // _runStreamEndRecovery terminal give-up, 1635-1643
_clearStreamEndRecovery();
_terminalStateReached=true;
if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;}
_streamFinalized=true;
_cancelAnimationFramePendingStreamRender();
_streamFadeCleanupReduceMotionListener();
_smdEndParser();
if(typeof finalizeThinkingCard==='function') finalizeThinkingCard();
_closeSource(source);Compare the RecommendationThe inline function _finalizeStreamEndTerminal(source){
_clearStreamEndRecovery();
_terminalStateReached=true;
if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;}
_streamFinalized=true;
_cancelAnimationFramePendingStreamRender();
_streamFadeCleanupReduceMotionListener();
_smdEndParser();
if(typeof finalizeThinkingCard==='function') finalizeThinkingCard();
_clearOwnerInflightState();
_clearApprovalForOwner();
_clearClarifyForOwner('terminal');
_closeSource(source);
}Both sites then become a single call. Since Test note
Net: ship-worthy after consolidating the two duplicated terminal finalizers and adding the trio to the shared path. Nice scoping on keeping this to the |
a188e97 to
58db7df
Compare
|
Re-reviewed the six new commits (c2e8c6b → d99a34f) against my earlier note. This pushes the PR over the line — the consolidation I asked for is in, and it's done correctly. What the new commits didThe two duplicated 7-line terminal finalizers are now a single helper,
The helper carries the owner-trio that was previously missing from those two paths: // static/messages.js:1619-1640 (_finalizeStreamEndFallback)
_clearStreamEndRecovery();
if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;}
_terminalStateReached=true;
_streamFinalized=true;
_cancelAnimationFramePendingStreamRender();
_streamFadeCleanupReduceMotionListener();
_smdEndParser();
if(typeof finalizeThinkingCard==='function') finalizeThinkingCard();
_clearOwnerInflightState();
_clearApprovalForOwner();
_clearClarifyForOwner('terminal');
if(_isActiveSession()){ S.activeStreamId=null; clearLiveToolCards(); if(!assistantText)removeThinking(); renderMessages({preserveScroll:true}); }
renderSessionList();
_setActivePaneIdleIfOwner();
_closeSource(source);That closes the exact gap I flagged: the auto-give-up-with-pending-approval scenario (10 retries exhausted while an owner-held approval/clarify card is on screen) now clears the prompt instead of leaving it stranded. And it's safe for non-owner streams because the trio is owner-gated — Test lock is in
Test-file fixups are benignThe Net: this addresses the review fully. Nothing further blocking from me. Nice tight scoping — still correctly limited to the |
| function _finalizeStreamEndFallback(source){ | ||
| _clearStreamEndRecovery(); | ||
| if(_persistTimer){clearTimeout(_persistTimer);_persistTimer=null;} | ||
| _terminalStateReached=true; | ||
| _streamFinalized=true; | ||
| _cancelAnimationFramePendingStreamRender(); | ||
| _streamFadeCleanupReduceMotionListener(); | ||
| _smdEndParser(); | ||
| if(typeof finalizeThinkingCard==='function') finalizeThinkingCard(); | ||
| _clearOwnerInflightState(); | ||
| _clearApprovalForOwner(); | ||
| _clearClarifyForOwner('terminal'); | ||
| if(_isActiveSession()){ | ||
| S.activeStreamId=null; | ||
| clearLiveToolCards();if(!assistantText)removeThinking(); | ||
| renderMessages({preserveScroll:true}); | ||
| } | ||
| renderSessionList(); | ||
| _setActivePaneIdleIfOwner(); | ||
| _closeSource(source); |
There was a problem hiding this comment.
_finalizeStreamEndFallback missing stream-ownership guard before clearing global session state
When _restoreSettledSession returns 'stale' (i.e. _isActiveSession() is true but S.activeStreamId !== streamId), a newer stream has already taken over the same session. _finalizeStreamEndFallback is called regardless, and its if(_isActiveSession()) branch unconditionally executes S.activeStreamId=null, clearLiveToolCards(), and renderMessages() — clobbering the live UI of that new stream.
The concrete failure path: stream_end arrives while live content is present → 180 ms deferred recovery is scheduled → within that window the user submits a new message on the same session (S.activeStreamId becomes the new stream's ID) → recovery fires → _restoreSettledSession sees S.activeStreamId !== streamId and returns 'stale' → _finalizeStreamEndFallback nullifies S.activeStreamId and re-renders from stale persisted state, erasing the new stream's live content.
Compare with _handleStreamError (line 3786), which has an explicit early-return guard for exactly this condition: if(_isActiveSession() && S.activeStreamId!==streamId){ _closeSource(source); return; }. The same guard — or at minimum changing if(_isActiveSession()) to if(_isActiveSession() && S.activeStreamId===streamId) — is needed in _finalizeStreamEndFallback before the S.activeStreamId=null / clearLiveToolCards() / renderMessages() block. The same gap exists in the non-deferred direct stream_end path, which can also reach _finalizeStreamEndFallback after an async API roundtrip during which S.activeStreamId may have advanced.
…#3892 #3898 #3885 #3882 #3868) (#3902) * stage v0.51.347: render/stream cluster (#3892 #3898 #3885 #3882 #3868) + 2 Opus SHOULD-FIX * stage v0.51.347: trim #3885 error-guard comment to fit diagnostic-test window * Stamp v0.51.347 — Release LK (streaming & render reliability cluster) * Remove stray uv.lock accidentally staged (not part of any cluster PR) --------- Co-authored-by: nesquena-hermes <[email protected]>
|
Shipped in v0.51.347 (Release LK — streaming & render reliability cluster) 🎉 — live now. Merged via the combined release PR #3902 alongside #3892, #3898, #3885, #3882, and #3868 (all in the live-to-final streaming/render family). Each was applied onto fresh master, gated through Codex (SAFE TO SHIP) + Opus (ship, no MUST-FIX) + the full suite (8532 passing), with two Opus SHOULD-FIX folded in before merge. Authorship preserved via co-authored commit + CHANGELOG credit. Thanks for the fix! 🙏 |
Thinking Path
stream_endterminal-settle race:stream_endcan arrive while live assistant text, Worklog, Thinking, or tool DOM is still present and before the persisted session is safe to use as the canonical final transcript.stream_endmust not blindly replace or clear the live DOM.What Changed
stream_endrecovery state inattachLiveStream():_pendingStreamEndRecovery_streamEndRecoveryTimer_streamEndRecoveryAttempts_clearStreamEndRecovery()_runStreamEndRecovery(source)to retry settled-session restore while the session is still active/pending, then centralize the terminal cleanup path._restoreSettledSession(source, {status:true})so callers can distinguishrestored,active,missing,stale, anderrorwhile preserving the existing public helper signature.done,stream_end,apperror,cancel, and stream error handling.tests/test_stream_end_recovery_gating.pyand updated the existing terminal cleanup ownership test for the status-aware restore call.CHANGELOG.md.Why It Matters
stream_endas permission to clear or replace active live assistant DOM before the backend session has actually settled.Verification
pytest tests/test_stream_end_recovery_gating.py -q6 passedpytest tests/test_1694_terminal_cleanup_ownership.py::test_stream_end_without_done_restores_settled_session_before_closing tests/test_1694_terminal_cleanup_ownership.py::test_settled_restore_and_error_close_only_the_event_source_owner tests/test_stream_end_recovery_gating.py -q8 passednode --check static/messages.jsnpm run lint:runtimegit diff --checkqwen3.6-plus./api/chat/stream/status?stream_id=bc55af7cab4346478015e42f87421db7returnedactive=false,journal.last_event=stream_end,journal.terminal=true,journal.terminal_state=completed.done -> metering -> title_status -> title -> stream_end.OK; content did not depend on switching back to recover.browser-smoke,lint, and all Python 3.11/3.12/3.13 test shards passed.Risks / Follow-ups
stream_endrecovery fix. It does not attempt to solve every mid-streamrenderMessages()rebuild path described in Streaming content flickers (disappears and reappears) during mid-stream DOM rebuilds in long sessions #3877.Model Used
Refs #3877