SWE-bench: fix SWE-agent hanging, adjust expected scores by ludwig-n · Pull Request #1202 · NVIDIA-NeMo/Skills

ludwig-n · 2026-01-30T15:45:28Z

Fixes an issue where SWE-agent was hanging forever on 41 Django instances by downgrading to rich==14.2.0 in the SWE-agent environment.
Adjusts expected scores of Qwen3-Coder-30B-A3B in the tests and docs after this evaluation harness fix: Fix patches being reversed if repo state in container differs from base_commit Kipok/SWE-bench#3.

Summary by CodeRabbit

Release Notes

Documentation
- Updated SWE-bench evaluation results with latest benchmark metrics.
Chores
- Added explicit version constraint for a package dependency during setup.
Tests
- Updated test baseline metrics to reflect new measurement results.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai · 2026-01-30T16:41:50Z

📝 Walkthrough

Walkthrough

Updates numeric example values in evaluation documentation, adds explicit rich package downgrade (v14.2.0) to SWE-agent setup in swebench.py, and updates metric ranges and reference date in test validation thresholds to reflect new measurement data.

Changes

Cohort / File(s)	Summary
Documentation Updates `docs/evaluation/code.md`	Updated SWE-bench results example values (issues_resolved: 45.0 → 48.4, no_patch: 0.4 → 1.0, patch_cant_apply: 0.8 → 1.6).
Infrastructure Setup `nemo_skills/inference/eval/swebench.py`	Added explicit downgrade of rich package to version 14.2.0 in SWE-agent setup sequence with explanatory comment; affects dependency resolution during container initialization.
Test Configuration `tests/slurm-tests/qwen3coder_30b_swebench/check_results.py`	Updated METRIC_RANGES validation thresholds for openhands (41.9-47.9 → 44.3-52.3) and swe_agent (45.5-52.4 → 45.5-53.5); updated reference date to 2026-01-30 and changed averaging methodology to 3-run average.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Fixes to support SWE-bench Multilingual #1064: Modifies swebench.py installation sequence, overlapping with rich package management changes in this PR.
SWE-bench: add Slurm tests & other improvements #920: Updates the same check_results.py test file METRIC_RANGES values.

Suggested reviewers

Kipok
gwarmstrong

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the two main changes: fixing SWE-agent hanging (via rich downgrade) and adjusting expected scores (reflected in documentation and test updates).
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ludwig-n/swe-agent-fix-rich

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Kipok

Thanks!

ludwig-n added 4 commits January 30, 2026 17:00

Fix rich==14.2.0 for SWE-agent

dc50480

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add comment

a1b957c

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Adjust expected SWE-bench results for OpenHands

259efd6

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Adjust upper bound for SWE-agent

d9913c9

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

ludwig-n marked this pull request as ready for review January 30, 2026 16:37

ludwig-n requested a review from Kipok January 30, 2026 16:37

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

Kipok approved these changes Jan 30, 2026

View reviewed changes

Kipok merged commit 250c862 into main Jan 30, 2026
5 checks passed

Kipok deleted the ludwig-n/swe-agent-fix-rich branch January 30, 2026 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

SWE-bench: fix SWE-agent hanging, adjust expected scores#1202

SWE-bench: fix SWE-agent hanging, adjust expected scores#1202
Kipok merged 4 commits intomainfrom
ludwig-n/swe-agent-fix-rich

ludwig-n commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Kipok left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ludwig-n commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ludwig-n commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading