Multi-agent review loop: repeated false positives, no cross-iteration memory, no branch scoping #137
Replies: 4 comments 3 replies
-
|
most of these points are based on incorrect assumptions about how ralphex works. let me go through them. 1. "no cross-iteration memory" - the external review loop already passes Claude's response from the previous iteration into the next codex/custom review prompt, including explanations of dismissed findings. see 2. "cascading fix chains" - this is how iterative review works by design. fix → re-review → catch downstream issues → fix again. 4 iterations to converge is the system working correctly, not a problem. 3. "no distinction between branch changes and pre-existing code" - both review prompts explicitly run 4. "external review lacks context" - the plan file ( 5. "no severity threshold" - the second review prompt already says "Focus only on critical and major issues. Ignore style/minor issues." iteration limits are also already configurable via all prompts in ralphex are fully configurable -
agents are also customizable in none of the points here are actual bugs or missing features - the described functionality either already exists or is achievable through prompt customization. moving this to Discussions Q&A. |
Beta Was this translation helpful? Give feedback.
-
|
The cross-iteration memory gap is real. I hit the same thing running 9 agents in a shared environment — agent reviews from round N had no way to reference what was flagged in round N-1, so the same false positive triggered every cycle. What fixed it for me was a simple append-only log per branch that each review step reads before starting. Not elegant but it broke the loop. For branch scoping I ended up giving each agent its own workspace directory so file-level conflicts disappeared. |
Beta Was this translation helpful? Give feedback.
-
|
ralphex already has this - there's an append-only progress file per branch/plan that accumulates all output across iterations. each review step gets the full log via in practice agents do see and reference prior findings. when one flags a false positive and the other disagrees, they go back and forth across iterations - I've seen them directly mention the "opponent's opinion" from previous rounds. eventually they converge, either fixing it or agreeing to leave it. not sure I follow your specific case though - could you share an example of what the repeated false positive loop looks like in your setup? |
Beta Was this translation helpful? Give feedback.
-
|
@svivekiyer this is a strong example of context failing at the coordination layer rather than the model layer: dismissed findings, branch scope, and deployment constraints existed somewhere, but the next review turn still didn't reliably use them. I'm researching these failures across coding-agent workflows. If you're willing, I'd value one short postmortem here: https://www.agentsneedcontext.com/agent-failure-postmortem The most useful thing would be one concrete review loop that repeated, what context should have carried forward, where it lived, and what extra review time or trust cost it created. No pitch, just research. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
After using ralphex's multi-agent review loop (codex/custom external review + Claude quality/implementation agents) on a real branch with ~700 lines of changes, I noticed several areas where the review iteration workflow could be improved.
The multi-agent parallel approach (quality + implementation agents) is genuinely strong — it caught real bugs that a single-pass review would miss (indentation bugs, binary file handling, missing timeouts). The issues below are about the iteration loop and agent coordination, not the core review quality.
1. No cross-iteration memory for dismissed findings
Problem: A finding was flagged as CRITICAL by both agents in review iteration 3. Claude evaluated the code carefully, traced the control flow across two separate try blocks, and dismissed it as a false positive with a detailed explanation. In iteration 4, both agents flagged the exact same pattern as CRITICAL again. Claude had to re-verify and re-dismiss from scratch.
Suggestion: Pass a summary of previously-dismissed findings (with reasoning) to subsequent iterations so agents don't re-flag the same thing. Something like a
dismissed_findingscontext block that the review prompt includes.2. Cascading fix chains across iterations
Problem: The review found a real bug (mount check not filtering unmounted paths). Claude fixed it. Next iteration found that the fix introduced a new issue (empty paths list not guarded). Claude fixed that. Next iteration found the docstring didn't match the updated behavior.
Each iteration introduces changes that the next iteration catches, creating a cascading chain (4 iterations to converge). This is correct behavior in the sense that each fix is valid, but it suggests the initial fix could be more holistic if the agent were prompted to consider downstream effects of its changes.
Suggestion: After fixing an issue, prompt Claude to also check: "Does this fix introduce any new edge cases or require updates elsewhere (guards, docstrings, callers)?" This could reduce the number of iterations needed to converge.
3. No distinction between branch changes and pre-existing code
Problem: Agents repeatedly flagged pre-existing patterns that existed long before the branch: timezone usage (
datetime.now()vs UTC), signal handler ordering, fallback values in.get()calls,sys.path.insertpatterns. These are valid observations in a general code review but are not related to the branch under review. Claude had to manually dismiss each one every iteration.Suggestion: Provide the agents with the branch diff scope and instruct them to focus on code changed or introduced by the branch. Pre-existing issues could be flagged separately (e.g., "pre-existing, not a regression") and not re-flagged in subsequent iterations.
4. External review lacks operational/deployment context
Problem: The external reviewer flagged
sys.exit(0)on lock contention as a bug and changed it tosys.exit(75). In the actual deployment, cron jobs and monitoring depend on exit code 0 meaning "nothing to do, no error." Exit 75 would cause cron failure emails and monitoring alerts on a normal, expected condition (overlapping scan schedules).Similarly, it flagged a theoretical lock name collision (two different paths sharing the same trailing directory name) that would never occur in the actual deployment where each volume runs on a dedicated server with unique paths.
Suggestion: Allow a project-level context file (e.g.,
.ralphex/context.md) that describes deployment constraints, operational expectations, and known design decisions. The review prompt could include this context so agents don't "fix" things that are intentional.5. No severity threshold for stopping iterations
Problem: The review loop keeps iterating until zero findings. But in later iterations, the only findings were LOW severity (docstring mismatch, style preferences, theoretical edge cases). There's no way to say "remaining items are optional, stop iterating."
Suggestion: Add a configurable severity threshold (e.g.,
min_severity = high) that tells the loop to stop when only LOW/MEDIUM findings remain. This would prevent unnecessary iterations on cosmetic issues.Beta Was this translation helpful? Give feedback.
All reactions