Multi-agent review loop: repeated false positives, no cross-iteration memory, no branch scoping #137

svivekiyer · 2026-02-19T13:23:53Z

svivekiyer
Feb 19, 2026

Summary

After using ralphex's multi-agent review loop (codex/custom external review + Claude quality/implementation agents) on a real branch with ~700 lines of changes, I noticed several areas where the review iteration workflow could be improved.

The multi-agent parallel approach (quality + implementation agents) is genuinely strong — it caught real bugs that a single-pass review would miss (indentation bugs, binary file handling, missing timeouts). The issues below are about the iteration loop and agent coordination, not the core review quality.

1. No cross-iteration memory for dismissed findings

Problem: A finding was flagged as CRITICAL by both agents in review iteration 3. Claude evaluated the code carefully, traced the control flow across two separate try blocks, and dismissed it as a false positive with a detailed explanation. In iteration 4, both agents flagged the exact same pattern as CRITICAL again. Claude had to re-verify and re-dismiss from scratch.

Suggestion: Pass a summary of previously-dismissed findings (with reasoning) to subsequent iterations so agents don't re-flag the same thing. Something like a dismissed_findings context block that the review prompt includes.

2. Cascading fix chains across iterations

Problem: The review found a real bug (mount check not filtering unmounted paths). Claude fixed it. Next iteration found that the fix introduced a new issue (empty paths list not guarded). Claude fixed that. Next iteration found the docstring didn't match the updated behavior.

Each iteration introduces changes that the next iteration catches, creating a cascading chain (4 iterations to converge). This is correct behavior in the sense that each fix is valid, but it suggests the initial fix could be more holistic if the agent were prompted to consider downstream effects of its changes.

Suggestion: After fixing an issue, prompt Claude to also check: "Does this fix introduce any new edge cases or require updates elsewhere (guards, docstrings, callers)?" This could reduce the number of iterations needed to converge.

3. No distinction between branch changes and pre-existing code

Problem: Agents repeatedly flagged pre-existing patterns that existed long before the branch: timezone usage (datetime.now() vs UTC), signal handler ordering, fallback values in .get() calls, sys.path.insert patterns. These are valid observations in a general code review but are not related to the branch under review. Claude had to manually dismiss each one every iteration.

Suggestion: Provide the agents with the branch diff scope and instruct them to focus on code changed or introduced by the branch. Pre-existing issues could be flagged separately (e.g., "pre-existing, not a regression") and not re-flagged in subsequent iterations.

4. External review lacks operational/deployment context

Problem: The external reviewer flagged sys.exit(0) on lock contention as a bug and changed it to sys.exit(75). In the actual deployment, cron jobs and monitoring depend on exit code 0 meaning "nothing to do, no error." Exit 75 would cause cron failure emails and monitoring alerts on a normal, expected condition (overlapping scan schedules).

Similarly, it flagged a theoretical lock name collision (two different paths sharing the same trailing directory name) that would never occur in the actual deployment where each volume runs on a dedicated server with unique paths.

Suggestion: Allow a project-level context file (e.g., .ralphex/context.md) that describes deployment constraints, operational expectations, and known design decisions. The review prompt could include this context so agents don't "fix" things that are intentional.

5. No severity threshold for stopping iterations

Problem: The review loop keeps iterating until zero findings. But in later iterations, the only findings were LOW severity (docstring mismatch, style preferences, theoretical edge cases). There's no way to say "remaining items are optional, stop iterating."

Suggestion: Add a configurable severity threshold (e.g., min_severity = high) that tells the loop to stop when only LOW/MEDIUM findings remain. This would prevent unnecessary iterations on cosmetic issues.

umputun · 2026-02-19T17:11:44Z

umputun
Feb 19, 2026
Maintainer

most of these points are based on incorrect assumptions about how ralphex works. let me go through them.

1. "no cross-iteration memory" - the external review loop already passes Claude's response from the previous iteration into the next codex/custom review prompt, including explanations of dismissed findings. see PREVIOUS REVIEW CONTEXT block in the codex prompt. this is exactly the "dismissed_findings context" you're suggesting.

2. "cascading fix chains" - this is how iterative review works by design. fix → re-review → catch downstream issues → fix again. 4 iterations to converge is the system working correctly, not a problem.

3. "no distinction between branch changes and pre-existing code" - both review prompts explicitly run git diff {{DEFAULT_BRANCH}}...HEAD to scope to branch changes. the agents review the diff, not the whole codebase. if an LLM wanders beyond the diff, that's LLM behavior, not a ralphex design issue.

4. "external review lacks context" - the plan file ({{PLAN_FILE}}) is already passed as context. see below for how to add more.

5. "no severity threshold" - the second review prompt already says "Focus only on critical and major issues. Ignore style/minor issues." iteration limits are also already configurable via max_iterations.

all prompts in ralphex are fully configurable - ~/.config/ralphex/prompts/*.txt or .ralphex/prompts/*.txt (per-project). if the default behavior doesn't match your workflow, edit the prompt files. a few examples:

want Claude to check downstream effects after fixing? edit review_first.txt, add to Step 3: "after fixing each issue, check if your fix introduces new edge cases, requires guard updates, or invalidates existing docstrings"
want deployment context in reviews? add a section to codex.txt or custom_review.txt: "deployment constraints: cron jobs depend on exit code 0 for 'nothing to do'. do not change exit codes without understanding operational impact"
want to stop iterating on low-severity findings? edit review_second.txt signal logic to treat LOW/MEDIUM as clean: "if only LOW or MEDIUM severity issues remain, output REVIEW_DONE"

agents are also customizable in ~/.config/ralphex/agents/*.txt. see customization docs and llms.txt in the repo.

none of the points here are actual bugs or missing features - the described functionality either already exists or is achievable through prompt customization. moving this to Discussions Q&A.

0 replies

ThinkOffApp · 2026-02-20T18:03:57Z

ThinkOffApp
Feb 20, 2026

The cross-iteration memory gap is real. I hit the same thing running 9 agents in a shared environment — agent reviews from round N had no way to reference what was flagged in round N-1, so the same false positive triggered every cycle. What fixed it for me was a simple append-only log per branch that each review step reads before starting. Not elegant but it broke the loop. For branch scoping I ended up giving each agent its own workspace directory so file-level conflicts disappeared.

0 replies

umputun · 2026-02-21T00:50:32Z

umputun
Feb 21, 2026
Maintainer

ralphex already has this - there's an append-only progress file per branch/plan that accumulates all output across iterations. each review step gets the full log via {{PROGRESS_FILE}} template variable.

in practice agents do see and reference prior findings. when one flags a false positive and the other disagrees, they go back and forth across iterations - I've seen them directly mention the "opponent's opinion" from previous rounds. eventually they converge, either fixing it or agreeing to leave it.

not sure I follow your specific case though - could you share an example of what the repeated false positive loop looks like in your setup?

3 replies

ThinkOffApp Mar 1, 2026

Good to know about {{PROGRESS_FILE}} — that sounds like it covers the iteration memory gap I was hitting. My setup was different enough that I probably missed it: I had multiple independent agents reviewing the same branch in parallel rather than sequentially through ralphex's loop, so the progress file wouldn't have been shared across them.

The specific case: two agents reviewing a Go service would both flag an unused error return as a bug. Agent A fixes it in iteration 1, but agent B (running its own review pass) doesn't see A's fix and flags it again. The append-only log I mentioned was my workaround for that cross-agent case, not the single-agent iteration case that ralphex handles natively.

Fair point that the single-agent iterative loop with {{PROGRESS_FILE}} already solves the convergence problem within one review chain. My issue was really about multi-agent coordination, which is a different problem than what ralphex is designed to solve.

umputun Mar 1, 2026
Maintainer

clarification — ralphex review is not a "single-agent iterative loop." each review iteration launches multiple agents in parallel (5 in the first review, 2 in the second). they all read the diff and source files simultaneously and report findings back.

the key difference from your setup: ralphex agents are read-only reporters. they never fix anything themselves. the main Claude session collects all agent reports, verifies each finding, fixes confirmed issues, and commits. next iteration, the agents re-review the updated code.

so the problem you described — "agent A fixes it, agent B doesn't see A's fix" — shouldn't happen with default prompts, since there's only one process making changes and all agents see the same code state at the start of each iteration. unless you customized the prompts in a way that gives agents write access, in which case that's a custom orchestration problem.

your setup with multiple independent agents each making their own fixes is a different orchestration model entirely — that's a distributed coordination problem, not something ralphex is designed to address.

ThinkOffApp Mar 1, 2026

That's a clean architecture — parallel agents with a coordinator that deduplicates and consolidates is exactly the right pattern. The distinction matters: in my setup the agents were fully independent (no coordinator), so duplicate findings were inevitable. Having a merge step that resolves conflicts before the next iteration is what prevents the false-positive loop.

Appreciate the detailed walkthrough. The {{PROGRESS_FILE}} + coordinator combo addresses every issue I ran into. The only thing I'd still want on top is explicit branch scoping so agents reviewing branch A never see findings from branch B, but it sounds like that's already implicit in the per-branch progress file.

gvelesandro · 2026-03-12T18:20:26Z

gvelesandro
Mar 12, 2026

@svivekiyer this is a strong example of context failing at the coordination layer rather than the model layer: dismissed findings, branch scope, and deployment constraints existed somewhere, but the next review turn still didn't reliably use them.

I'm researching these failures across coding-agent workflows. If you're willing, I'd value one short postmortem here: https://www.agentsneedcontext.com/agent-failure-postmortem

The most useful thing would be one concrete review loop that repeated, what context should have carried forward, where it lived, and what extra review time or trust cost it created. No pitch, just research.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-agent review loop: repeated false positives, no cross-iteration memory, no branch scoping #137

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Multi-agent review loop: repeated false positives, no cross-iteration memory, no branch scoping #137

Uh oh!

svivekiyer Feb 19, 2026

Summary

1. No cross-iteration memory for dismissed findings

2. Cascading fix chains across iterations

3. No distinction between branch changes and pre-existing code

4. External review lacks operational/deployment context

5. No severity threshold for stopping iterations

Replies: 4 comments · 3 replies

Uh oh!

umputun Feb 19, 2026 Maintainer

Uh oh!

ThinkOffApp Feb 20, 2026

Uh oh!

umputun Feb 21, 2026 Maintainer

Uh oh!

ThinkOffApp Mar 1, 2026

Uh oh!

Uh oh!

umputun Mar 1, 2026 Maintainer

Uh oh!

ThinkOffApp Mar 1, 2026

Uh oh!

gvelesandro Mar 12, 2026

svivekiyer
Feb 19, 2026

Replies: 4 comments 3 replies

umputun
Feb 19, 2026
Maintainer

ThinkOffApp
Feb 20, 2026

umputun
Feb 21, 2026
Maintainer

umputun Mar 1, 2026
Maintainer

gvelesandro
Mar 12, 2026