fix(orchestrator): defer cache commit to prevent state pollution on rollback#121
fix(orchestrator): defer cache commit to prevent state pollution on rollback#121eviaaaaa wants to merge 2 commits intoMiroMindAI:mainfrom
Conversation
…on in parallel calls - Change _record_query to buffer successful queries instead of immediate commit - Commit buffer only if the entire turn completes without triggering a rollback - Fixes the 'death spiral' issue where a partial failure in parallel tool calling leaves dirty cache records that cause subsequent retries to incorrectly fail with 'Duplicate query detected'
e96e352 to
b11fbd5
Compare
Vanint
left a comment
There was a problem hiding this comment.
Nice catch on the "death spiral" bug — the analysis is solid and the deferred commit approach is the right fix. A few things to address:
Must Fix
Missed _record_query in the sub-agent call path (run_main_agent line 958)
The _record_query call after run_sub_agent is still committed immediately:
line 958 — not deferred
await self._record_query(cache_name, tool_name, arguments)
If a subsequent tool call in the same turn triggers a rollback, this record becomes dirty — the exact same bug this PR is fixing. This should be changed to successful_queries_buffer.append((cache_name, tool_name, arguments)) for consistency.
Suggestions
-
Consider adding a brief comment at the successful_queries_buffer = [] initialization explaining why buffering is needed (e.g., # Buffer query records to avoid cache pollution if this turn gets rolled back). Future readers won't have the PR context.
-
Re: the "Amnesia" pattern you raised — good observation. The message_history.pop() without error context injection (lines 553, 1027) is worth looking into. Feel free to include a fix in this PR if the scope stays manageable.
Minor
_record_query is async but only mutates an in-memory dict — doesn't need to be awaitable. Not introduced by this PR and not blocking, just noting for future cleanup.
Overall this is a clean, well-scoped fix. Please address the line 958 miss and this is good to merge.
I double-checked the current HEAD (b11fbd5), and the deferred commit logic is actually already in place for the sub-agent path in run_main_agent. While that immediate _record_query(...) call did exist in the pre-fix version, it’s already been swapped out for successful_queries_buffer.append(...) at orchestrator.py:967. The buffered records are then committed at orchestrator.py:1105 once the turn finishes successfully. So, valid catch for the old snapshot, but it looks like it's already handled in this branch.I did go ahead and add the comments explaining the buffering logic at the initialization points (orchestrator.py:483 and orchestrator.py:915) as suggested. |
Summary
This PR addresses a critical flaw in the agent's core defensive middleware, specifically resolving the "Death Spiral" bug that occurs during multi-tool concurrency.
What's Changed
orchestrator.py): Restructured theused_queriescaching mechanism. Replaced immediate cache commitments with asuccessful_queries_buffer.Scope
Restricted strictly to the
orchestrator.pystate management loops, focusing on making the tool cache commitment atomic and transaction-safe.Validation & Testing
Note / Question for Reviewers
While debugging the concurrency fallback, I noticed a systemic behavior with the
should_rollback_resultlogic. Currently, triggering a rollback simply executesmessage_history.pop()to clear the bad prompt and loops again. This effectively creates "Amnesia" without injecting any failure context (e.g.,[System Feedback: Tool failed with Error X]).MAX_CONSECUTIVE_ROLLBACKSlimit), which feels inefficient.