-
Notifications
You must be signed in to change notification settings - Fork 4
67 'so close' failures: agent fixes bug but breaks other tests #127
Copy link
Copy link
Open
Description
Problem
In our SWE-bench-verified evaluation, 67 tasks (13.4%) follow a pattern where the agent correctly identifies and fixes the reported bug, but the fix introduces regressions that break other tests. The agent doesn't run the full test suite before submitting, or runs tests but doesn't iterate on failures.
Data
- 67 tasks where the patch addresses the correct code location but fails SWE-bench's test harness
- Common pattern: agent edits the right file, fixes the reported behavior, but doesn't verify against the broader test suite
- The agent often runs only a targeted test (e.g., the specific test file mentioned in the issue) rather than the full relevant test suite
Root Cause
The MCP server's instructions don't provide guidance on test-running workflows. The agent doesn't know:
- How to run the project's test suite
- Which test commands are appropriate for the repo
- That it should run a broader test suite (not just the specific failing test) before submitting
Impact
If even half of these "so close" tasks were recovered through better test guidance, that's +33 additional resolves (42.4% → ~49%).
Recommended Fixes
- Include per-repo test runner commands in the MCP server instructions or tool responses (e.g., "For django:
python -m pytest tests/ -x", "For sympy:python -m pytest sympy/core/tests/") - Add test-running guidance to server instructions: "After making changes, always run the relevant test suite to verify no regressions. Run broader tests, not just the specific test case from the issue."
- Return test commands in symbol_context responses: When the agent looks up a symbol, include a hint like "Test this module with:
pytest path/to/tests/" - Post-edit workflow hint: After detecting that the agent has edited files, remind it to run tests before submitting
Labels
enhancement, swe-bench
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels