fix(orchestrator): defer cache commit to prevent state pollution on rollback by eviaaaaa · Pull Request #121 · MiroMindAI/MiroThinker

eviaaaaa · 2026-03-03T08:40:13Z

Summary

This PR addresses a critical flaw in the agent's core defensive middleware, specifically resolving the "Death Spiral" bug that occurs during multi-tool concurrency.

What's Changed

Deferred Tool Cache Commit (orchestrator.py): Restructured the used_queries caching mechanism. Replaced immediate cache commitments with a successful_queries_buffer.
The Fix: Valid parallel tool results are now conditionally committed only if the entire conversation turn executes successfully. This prevents partial turn rollbacks from leaving "dirty" query records that falsely trigger the "Duplicate Query" breaker on subsequent LLM retries.

Scope

Restricted strictly to the orchestrator.py state management loops, focusing on making the tool cache commitment atomic and transaction-safe.

Validation & Testing

Concurrency Stabilization: Simulated high-concurrency partial-failure queries (e.g., executing a valid Google Search alongside an empty/broken search). Confirmed that the dirty cache is successfully discarded when the turn rolls back, allowing clean and error-free LLM retries.

Note / Question for Reviewers

While debugging the concurrency fallback, I noticed a systemic behavior with the should_rollback_result logic. Currently, triggering a rollback simply executes message_history.pop() to clear the bad prompt and loops again. This effectively creates "Amnesia" without injecting any failure context (e.g., [System Feedback: Tool failed with Error X]).

Was it a deliberate design choice to fully omit the failure explanation to save Tokens?
In my tests, this lack of feedback often forces the LLM to blindly repeat the exact same flawed search query up to 4 times (hitting the MAX_CONSECUTIVE_ROLLBACKS limit), which feels inefficient.
Should we consider temporarily injecting error context before allowing the LLM to retry?

…on in parallel calls - Change _record_query to buffer successful queries instead of immediate commit - Commit buffer only if the entire turn completes without triggering a rollback - Fixes the 'death spiral' issue where a partial failure in parallel tool calling leaves dirty cache records that cause subsequent retries to incorrectly fail with 'Duplicate query detected'

Vanint

Nice catch on the "death spiral" bug — the analysis is solid and the deferred commit approach is the right fix. A few things to address:

Must Fix

Missed _record_query in the sub-agent call path (run_main_agent line 958)

The _record_query call after run_sub_agent is still committed immediately:

line 958 — not deferred
await self._record_query(cache_name, tool_name, arguments)
If a subsequent tool call in the same turn triggers a rollback, this record becomes dirty — the exact same bug this PR is fixing. This should be changed to successful_queries_buffer.append((cache_name, tool_name, arguments)) for consistency.

Suggestions

Consider adding a brief comment at the successful_queries_buffer = [] initialization explaining why buffering is needed (e.g., # Buffer query records to avoid cache pollution if this turn gets rolled back). Future readers won't have the PR context.
Re: the "Amnesia" pattern you raised — good observation. The message_history.pop() without error context injection (lines 553, 1027) is worth looking into. Feel free to include a fix in this PR if the scope stays manageable.

Minor

_record_query is async but only mutates an in-memory dict — doesn't need to be awaitable. Not introduced by this PR and not blocking, just noting for future cleanup.

Overall this is a clean, well-scoped fix. Please address the line 958 miss and this is good to merge.

eviaaaaa · 2026-04-10T10:36:21Z

Must Fix

Missed _record_query in the sub-agent call path (run_main_agent line 958)

The _record_query call after run_sub_agent is still committed immediately:

line 958 — not deferred await self._record_query(cache_name, tool_name, arguments) If a subsequent tool call in the same turn triggers a rollback, this record becomes dirty — the exact same bug this PR is fixing. This should be changed to successful_queries_buffer.append((cache_name, tool_name, arguments)) for consistency.

I double-checked the current HEAD (b11fbd5), and the deferred commit logic is actually already in place for the sub-agent path in run_main_agent.

While that immediate _record_query(...) call did exist in the pre-fix version, it’s already been swapped out for successful_queries_buffer.append(...) at orchestrator.py:967. The buffered records are then committed at orchestrator.py:1105 once the turn finishes successfully.

So, valid catch for the old snapshot, but it looks like it's already handled in this branch.

I did go ahead and add the comments explaining the buffering logic at the initialization points (orchestrator.py:483 and orchestrator.py:915) as suggested.

eviaaaaa force-pushed the fix/rollback-state-corruption branch from e96e352 to b11fbd5 Compare March 4, 2026 00:40

Vanint requested changes Apr 10, 2026

View reviewed changes

docs(core): clarify deferred query buffering for rollback safety

7aa841e

eviaaaaa requested a review from Vanint April 10, 2026 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): defer cache commit to prevent state pollution on rollback#121

fix(orchestrator): defer cache commit to prevent state pollution on rollback#121
eviaaaaa wants to merge 2 commits intoMiroMindAI:mainfrom
eviaaaaa:fix/rollback-state-corruption

eviaaaaa commented Mar 3, 2026

Uh oh!

Vanint left a comment

Uh oh!

eviaaaaa commented Apr 10, 2026

Must Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eviaaaaa commented Mar 3, 2026

Summary

What's Changed

Scope

Validation & Testing

Note / Question for Reviewers

Uh oh!

Vanint left a comment

Choose a reason for hiding this comment

Must Fix

Suggestions

Minor

Uh oh!

eviaaaaa commented Apr 10, 2026

Must Fix

So, valid catch for the old snapshot, but it looks like it's already handled in this branch.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants