67 'so close' failures: agent fixes bug but breaks other tests

## Problem

In our SWE-bench-verified evaluation, 67 tasks (13.4%) follow a pattern where the agent correctly identifies and fixes the reported bug, but the fix introduces regressions that break other tests. The agent doesn't run the full test suite before submitting, or runs tests but doesn't iterate on failures.

### Data

- 67 tasks where the patch addresses the correct code location but fails SWE-bench's test harness
- Common pattern: agent edits the right file, fixes the reported behavior, but doesn't verify against the broader test suite
- The agent often runs only a targeted test (e.g., the specific test file mentioned in the issue) rather than the full relevant test suite

### Root Cause

The MCP server's instructions don't provide guidance on test-running workflows. The agent doesn't know:
1. How to run the project's test suite
2. Which test commands are appropriate for the repo
3. That it should run a broader test suite (not just the specific failing test) before submitting

### Impact

If even half of these "so close" tasks were recovered through better test guidance, that's +33 additional resolves (42.4% → ~49%).

### Recommended Fixes

1. **Include per-repo test runner commands** in the MCP server instructions or tool responses (e.g., "For django: `python -m pytest tests/ -x`", "For sympy: `python -m pytest sympy/core/tests/`")
2. **Add test-running guidance** to server instructions: "After making changes, always run the relevant test suite to verify no regressions. Run broader tests, not just the specific test case from the issue."
3. **Return test commands in symbol_context responses**: When the agent looks up a symbol, include a hint like "Test this module with: `pytest path/to/tests/`"
4. **Post-edit workflow hint**: After detecting that the agent has edited files, remind it to run tests before submitting

## Labels

enhancement, swe-bench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

67 'so close' failures: agent fixes bug but breaks other tests #127

Problem

Data

Root Cause

Impact

Recommended Fixes

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

67 'so close' failures: agent fixes bug but breaks other tests #127

Description

Problem

Data

Root Cause

Impact

Recommended Fixes

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions