Skip to content

67 'so close' failures: agent fixes bug but breaks other tests #127

@greynewell

Description

@greynewell

Problem

In our SWE-bench-verified evaluation, 67 tasks (13.4%) follow a pattern where the agent correctly identifies and fixes the reported bug, but the fix introduces regressions that break other tests. The agent doesn't run the full test suite before submitting, or runs tests but doesn't iterate on failures.

Data

  • 67 tasks where the patch addresses the correct code location but fails SWE-bench's test harness
  • Common pattern: agent edits the right file, fixes the reported behavior, but doesn't verify against the broader test suite
  • The agent often runs only a targeted test (e.g., the specific test file mentioned in the issue) rather than the full relevant test suite

Root Cause

The MCP server's instructions don't provide guidance on test-running workflows. The agent doesn't know:

  1. How to run the project's test suite
  2. Which test commands are appropriate for the repo
  3. That it should run a broader test suite (not just the specific failing test) before submitting

Impact

If even half of these "so close" tasks were recovered through better test guidance, that's +33 additional resolves (42.4% → ~49%).

Recommended Fixes

  1. Include per-repo test runner commands in the MCP server instructions or tool responses (e.g., "For django: python -m pytest tests/ -x", "For sympy: python -m pytest sympy/core/tests/")
  2. Add test-running guidance to server instructions: "After making changes, always run the relevant test suite to verify no regressions. Run broader tests, not just the specific test case from the issue."
  3. Return test commands in symbol_context responses: When the agent looks up a symbol, include a hint like "Test this module with: pytest path/to/tests/"
  4. Post-edit workflow hint: After detecting that the agent has edited files, remind it to run tests before submitting

Labels

enhancement, swe-bench

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions