Skip to content

fix(orchestrator): defer cache commit to prevent state pollution on rollback#121

Open
eviaaaaa wants to merge 2 commits intoMiroMindAI:mainfrom
eviaaaaa:fix/rollback-state-corruption
Open

fix(orchestrator): defer cache commit to prevent state pollution on rollback#121
eviaaaaa wants to merge 2 commits intoMiroMindAI:mainfrom
eviaaaaa:fix/rollback-state-corruption

Conversation

@eviaaaaa
Copy link
Copy Markdown

@eviaaaaa eviaaaaa commented Mar 3, 2026

Summary

This PR addresses a critical flaw in the agent's core defensive middleware, specifically resolving the "Death Spiral" bug that occurs during multi-tool concurrency.

What's Changed

  • Deferred Tool Cache Commit (orchestrator.py): Restructured the used_queries caching mechanism. Replaced immediate cache commitments with a successful_queries_buffer.
  • The Fix: Valid parallel tool results are now conditionally committed only if the entire conversation turn executes successfully. This prevents partial turn rollbacks from leaving "dirty" query records that falsely trigger the "Duplicate Query" breaker on subsequent LLM retries.

Scope

Restricted strictly to the orchestrator.py state management loops, focusing on making the tool cache commitment atomic and transaction-safe.

Validation & Testing

  • Concurrency Stabilization: Simulated high-concurrency partial-failure queries (e.g., executing a valid Google Search alongside an empty/broken search). Confirmed that the dirty cache is successfully discarded when the turn rolls back, allowing clean and error-free LLM retries.

Note / Question for Reviewers

While debugging the concurrency fallback, I noticed a systemic behavior with the should_rollback_result logic. Currently, triggering a rollback simply executes message_history.pop() to clear the bad prompt and loops again. This effectively creates "Amnesia" without injecting any failure context (e.g., [System Feedback: Tool failed with Error X]).

  • Was it a deliberate design choice to fully omit the failure explanation to save Tokens?
  • In my tests, this lack of feedback often forces the LLM to blindly repeat the exact same flawed search query up to 4 times (hitting the MAX_CONSECUTIVE_ROLLBACKS limit), which feels inefficient.
  • Should we consider temporarily injecting error context before allowing the LLM to retry?

…on in parallel calls

- Change _record_query to buffer successful queries instead of immediate commit
- Commit buffer only if the entire turn completes without triggering a rollback
- Fixes the 'death spiral' issue where a partial failure in parallel tool calling leaves dirty cache records that cause subsequent retries to incorrectly fail with 'Duplicate query detected'
@eviaaaaa eviaaaaa force-pushed the fix/rollback-state-corruption branch from e96e352 to b11fbd5 Compare March 4, 2026 00:40
Copy link
Copy Markdown
Member

@Vanint Vanint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the "death spiral" bug — the analysis is solid and the deferred commit approach is the right fix. A few things to address:

Must Fix

Missed _record_query in the sub-agent call path (run_main_agent line 958)

The _record_query call after run_sub_agent is still committed immediately:

line 958 — not deferred
await self._record_query(cache_name, tool_name, arguments)
If a subsequent tool call in the same turn triggers a rollback, this record becomes dirty — the exact same bug this PR is fixing. This should be changed to successful_queries_buffer.append((cache_name, tool_name, arguments)) for consistency.

Suggestions

  1. Consider adding a brief comment at the successful_queries_buffer = [] initialization explaining why buffering is needed (e.g., # Buffer query records to avoid cache pollution if this turn gets rolled back). Future readers won't have the PR context.

  2. Re: the "Amnesia" pattern you raised — good observation. The message_history.pop() without error context injection (lines 553, 1027) is worth looking into. Feel free to include a fix in this PR if the scope stays manageable.

Minor

_record_query is async but only mutates an in-memory dict — doesn't need to be awaitable. Not introduced by this PR and not blocking, just noting for future cleanup.

Overall this is a clean, well-scoped fix. Please address the line 958 miss and this is good to merge.

@eviaaaaa
Copy link
Copy Markdown
Author

Must Fix

Missed _record_query in the sub-agent call path (run_main_agent line 958)

The _record_query call after run_sub_agent is still committed immediately:

line 958 — not deferred await self._record_query(cache_name, tool_name, arguments) If a subsequent tool call in the same turn triggers a rollback, this record becomes dirty — the exact same bug this PR is fixing. This should be changed to successful_queries_buffer.append((cache_name, tool_name, arguments)) for consistency.

I double-checked the current HEAD (b11fbd5), and the deferred commit logic is actually already in place for the sub-agent path in run_main_agent.

While that immediate _record_query(...) call did exist in the pre-fix version, it’s already been swapped out for successful_queries_buffer.append(...) at orchestrator.py:967. The buffered records are then committed at orchestrator.py:1105 once the turn finishes successfully.

So, valid catch for the old snapshot, but it looks like it's already handled in this branch.

I did go ahead and add the comments explaining the buffering logic at the initialization points (orchestrator.py:483 and orchestrator.py:915) as suggested.

@eviaaaaa eviaaaaa requested a review from Vanint April 10, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants