Skip to content

Comments

SWE-bench: fix SWE-agent hanging, adjust expected scores#1202

Merged
Kipok merged 4 commits intomainfrom
ludwig-n/swe-agent-fix-rich
Jan 30, 2026
Merged

SWE-bench: fix SWE-agent hanging, adjust expected scores#1202
Kipok merged 4 commits intomainfrom
ludwig-n/swe-agent-fix-rich

Conversation

@ludwig-n
Copy link
Collaborator

@ludwig-n ludwig-n commented Jan 30, 2026

  1. Fixes an issue where SWE-agent was hanging forever on 41 Django instances by downgrading to rich==14.2.0 in the SWE-agent environment.
  2. Adjusts expected scores of Qwen3-Coder-30B-A3B in the tests and docs after this evaluation harness fix: Fix patches being reversed if repo state in container differs from base_commit Kipok/SWE-bench#3.

Summary by CodeRabbit

Release Notes

  • Documentation

    • Updated SWE-bench evaluation results with latest benchmark metrics.
  • Chores

    • Added explicit version constraint for a package dependency during setup.
  • Tests

    • Updated test baseline metrics to reflect new measurement results.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
@ludwig-n ludwig-n marked this pull request as ready for review January 30, 2026 16:37
@ludwig-n ludwig-n requested a review from Kipok January 30, 2026 16:37
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Updates numeric example values in evaluation documentation, adds explicit rich package downgrade (v14.2.0) to SWE-agent setup in swebench.py, and updates metric ranges and reference date in test validation thresholds to reflect new measurement data.

Changes

Cohort / File(s) Summary
Documentation Updates
docs/evaluation/code.md
Updated SWE-bench results example values (issues_resolved: 45.0 → 48.4, no_patch: 0.4 → 1.0, patch_cant_apply: 0.8 → 1.6).
Infrastructure Setup
nemo_skills/inference/eval/swebench.py
Added explicit downgrade of rich package to version 14.2.0 in SWE-agent setup sequence with explanatory comment; affects dependency resolution during container initialization.
Test Configuration
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py
Updated METRIC_RANGES validation thresholds for openhands (41.9-47.9 → 44.3-52.3) and swe_agent (45.5-52.4 → 45.5-53.5); updated reference date to 2026-01-30 and changed averaging methodology to 3-run average.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • Kipok
  • gwarmstrong
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the two main changes: fixing SWE-agent hanging (via rich downgrade) and adjusting expected scores (reflected in documentation and test updates).
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ludwig-n/swe-agent-fix-rich

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Kipok Kipok merged commit 250c862 into main Jan 30, 2026
5 checks passed
@Kipok Kipok deleted the ludwig-n/swe-agent-fix-rich branch January 30, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants