Support mini-swe-agent as agent harness by wasiahmad · Pull Request #1212 · NVIDIA-NeMo/Skills

wasiahmad · 2026-02-04T20:15:12Z

Summary by CodeRabbit

New Features
- Added mini-SWE-agent as a supported agent framework for SWE-bench evaluation, including per-datapoint execution, containerized run support, result generation in SWE-bench format, and integration into the processing flow.
- Added three mini-SWE-agent prompt/config templates (standard, backticks, XML) to drive interactive, single-command iterative agent runs.
Documentation
- Updated SWE-bench docs to reflect support and defaults for three agent frameworks: SWE-agent, mini-SWE-agent, and OpenHands.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: i-vainn <imoshkov@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml

nemo_skills/inference/eval/swebench.py

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/eval/swebench.py

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_xml.yaml

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/evaluation/code.md

nemo_skills/inference/eval/swebench.py

greptile-apps

_{2 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/evaluation/code.md

nemo_skills/inference/eval/swebench.py

greptile-apps · 2026-02-07T03:27:44Z

Additional Comments (4)

nemo_skills/inference/eval/swebench.py
extra_body type mismatch

SweBenchInferenceConfig.extra_body is declared as a plain dict, but later code treats it as an OmegaConf object (OmegaConf.to_container(self.cfg.inference.extra_body, ...)). If extra_body remains a regular dict (as annotated/defaulted here), this call will error at runtime. Either declare extra_body with the same OmegaConf-compatible type used elsewhere in the codebase (e.g., DictConfig), or convert/merge it without calling OmegaConf.to_container when it’s already a dict.

nemo_skills/inference/eval/swebench.py
Malformed CLI arguments

In _run_swe_agent, there’s no && after uv pip install -e . so the next fragment (uv pip install rich==...) will be appended as additional arguments to the python invocation instead of a separate shell command, causing the container command to fail.

                "uv pip install -e . && "

nemo_skills/inference/eval/swebench.py
Unescaped instance_id in shell

data_point['instance_id'] is interpolated into the container command for --output trajectories/{instance_id}.traj.json without shell quoting. If an instance id contains spaces/quotes/metacharacters, the command will break or execute unintended tokens. Please wrap it with shlex.quote(...) (similar to problem_statement) or otherwise ensure it’s safely escaped before embedding in the shell string.

nemo_skills/inference/eval/swebench.py
Temp config path traversal

tmp_config_filename = f"configs/config_{data_point['instance_id']}.yaml" is used to build host_tmp_path under self.output_dir. If instance_id contains / or .., this will create nested paths or allow writing/removing files outside output_dir when creating/cleaning up the temp config. Sanitize the filename (e.g., replace path separators and strip ..) or use a hash of instance_id for the on-disk name.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/evaluation/code.md (1)

280-281: ⚠️ Potential issue | 🟡 Minor

Add note clarifying that mini-SWE-agent is not supported for multilingual evaluation.

The swe-bench-multilingual section should explicitly document that mini-SWE-agent does not have multilingual support yet. While the current documentation correctly omits mini-SWE-agent from the multilingual example, this should be made explicit for clarity, especially since mini-SWE-agent is mentioned as a supported alternative in the regular SWE-bench section (lines 160-161).

Consider adding a note like: "Currently, only OpenHands and SWE-agent support multilingual evaluation. Mini-SWE-agent support for multilingual datasets is not yet available."

🤖 Fix all issues with AI agents

In `@nemo_skills/inference/eval/swebench.py`:
- Line 635: Replace the silent fallback to {} and fail-fast when expected key is
missing: change the access of trajectory_dict.get("info", {}) to direct
dictionary access trajectory_dict["info"] inside the function where
trajectory_info is assigned (variable name trajectory_info in swebench.py) so a
missing "info" raises a KeyError; if you need clearer context, wrap the access
in a try/except and re-raise a more specific error mentioning the trajectory id
or source (use the same trajectory_dict/trajectory_info variables) rather than
silently using an empty dict.

In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 40-45: The numbered rules in the prompt skip item 2 (they read 1,
3, 4, 5); update the ordered list in swebench.yaml so the sequence is
consecutive by changing "3. The system executes the command(s) in a subshell" to
"2. The system executes the command(s) in a subshell" and then renumber the
following lines accordingly ("4."→"3.", "5."→"4.") so the list reads 1–4 in
proper order.
- Line 18: The XML-like tag casing is inconsistent: the opening tag
`<IMPORTANT>` is uppercase but the closing tag `</important>` is lowercase;
update the closing tag to exactly match the opener (change `</important>` to
`</IMPORTANT>`) or make both tags consistently lowercase (e.g.,
`<important>`/`</important>`), ensuring the pair is identical and preserving the
surrounding sentence and punctuation in the `swebench.yaml` snippet.

🧹 Nitpick comments (2)

docs/evaluation/code.md (1)

160-161: Consider adding mini-SWE-agent-specific expected results or a note about expected performance.

The sample run section only shows expected results for OpenHands (issues_resolved: 48.4). While the instructions say to replace openhands with mini_swe_agent, users would benefit from knowing what resolve rate to expect with mini-swe-agent, even approximately. Based on learnings, documentation for benchmarks should include expected results for tested models.

nemo_skills/inference/eval/swebench.py (1)

558-652: 
Overall implementation looks solid — past review issues are resolved.

The three issues flagged in previous reviews are all addressed:

Default config path now correctly points to "eval/swe-bench/mini-swe-agent/swebench" (line 574)

Module invocation uses proper dotted path python -m minisweagent.run.mini (line 613)

Search path {instance_id}.traj.json matches the --output argument (lines 617, 624)

The completion_kwargs construction logic (lines 563–572) is duplicated across _run_swe_agent, _run_mini_swe_agent, and _run_openhands. Consider extracting it into a shared helper to reduce duplication.

nemo_skills/inference/eval/swebench.py

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/eval/swebench.py

docs/evaluation/code.md

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 21-24: In swebench.yaml update the contradictory prompt
requirement on the response rules: replace the phrase "Provide exactly ONE bash
command to execute" with "Provide AT LEAST ONE bash command to execute" so it
matches the other occurrences and the format_error_template semantics; ensure
the change is made in the response instruction block near the top of the file so
all references (including format_error_template and lines that currently state
"AT LEAST ONE") are consistent.

🧹 Nitpick comments (2)

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml (1)

175-176: Nit: trailing dot on cost_limit value.

cost_limit: 3. is valid YAML (parses as 3.0) but reads oddly. Consider 3.0 for clarity. This is consistent with the same pattern in the sibling swebench.yaml, so it's a minor cosmetic point across all three configs.

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml (1)

113-114: Nit: trailing dot on cost_limit value.

Same as the sibling config — cost_limit: 3. is valid YAML but 3.0 would be more conventional.

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml

Kipok · 2026-02-10T00:09:02Z

no issues from my side, but will leave this to @ludwig-n for final approval. @wasiahmad please fix DCO

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

greptile-apps

_{3 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/eval/swebench.py

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml

docs/evaluation/code.md

ludwig-n · 2026-02-10T15:54:34Z

Score for Qwen3-Coder-480B-A35B: ~52.5% (avg of 3 runs). Officially reported score is 55.4%. Good enough to merge in my opinion.

With this framework, patches don't always follow the git diff format, so it causes a higher percentage of "patch can't apply" errors. However, this seems to be a "feature" of mini-swe-agent, as it asks the LLM to create the patch file, rather than running git diff manually after the agent is done like SWE-agent/OpenHands. So it's not an issue with our implementation.

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/evaluation/code.md

nemo_skills/inference/eval/swebench.py

wasiahmad · 2026-02-16T18:52:31Z

Evaluated Minimax-M2.1 and Minimax-M2.5 on swe-bench-verified using mini-swe-agent and got the following scores.

Minimax-M2.1 => Pass@1: 70.6% (avg-of-3: 71.0, 71.0, 69.8)
Minimax-M2.5 => Pass@1: 75.9% (avg-of-3: 76.6, 75.0, 76.0)

No official reported score of these models with mini-swe-agent.
Their official score for swe-bench-verified is 74.0 and 80.2, respectively.

wasiahmad and others added 30 commits November 29, 2025 15:55

some support

537e8b3

updating to support mini-swe-agent

febd824

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating to support mini-swe-agent

783354b

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating to support mini-swe-agent

71167b0

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing a minor bug

6422992

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing a minor bug

583b9ab

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing a minor bug

3d99c60

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing a minor bug

3846964

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing a minor bug

dac4bca

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing a minor bug

9f3e234

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Add HMMT Nov 2025 dataset (#1061)

fa2e5b5

Signed-off-by: i-vainn <imoshkov@nvidia.com>

Use docker build cache (#1056)

4bd5f7d

Signed-off-by: George Armstrong <georgea@nvidia.com>

ci: Add CodeRabbit configuration file (#1063)

aff6e4a

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

FIX integration tests by escaping aalcr and adding judge args (#1062)

7c44a0d

Signed-off-by: George Armstrong <georgea@nvidia.com>

ENH add tool calling args (#1067)

e5bcd68

Signed-off-by: George Armstrong <georgea@nvidia.com>

Fix sglang tool calling (#1070)

c74cd99

Signed-off-by: George Armstrong <georgea@nvidia.com>

Network Blocking for Sandbox Code Execution (#1071)

e03f563

Signed-off-by: George Armstrong <georgea@nvidia.com>

Fixes to support SWE-bench Multilingual (#1064)

c376270

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

fix: IFBench error handling and build improvements (#1073)

1b1f66e

Signed-off-by: George Armstrong <georgea@nvidia.com>

FIX math verify handle leading zeros and int literals cases (#1074)

782b083

Signed-off-by: George Armstrong <georgea@nvidia.com>

build: move data preparation to beginning of gpu tests build (#1077)

1545f73

Signed-off-by: George Armstrong <georgea@nvidia.com>

MAINT update langugage-data dependency (#1076)

6594d4c

Signed-off-by: George Armstrong <georgea@nvidia.com>

MAINT: Add audio requirements to vllm image (#1081)

53f1056

Signed-off-by: George Armstrong <georgea@nvidia.com>

Add apex-shortlist dataset (#1080)

7e35ddd

Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Introduce regex for small differences of formatting from judge (#1082)

0316807

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summar…

0807259

…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

MAINT pin nemo-evaluator (#1095)

b74c543

Signed-off-by: George Armstrong <georgea@nvidia.com>

Update issue templates

5c15cf7

Delete .github/ISSUE_TEMPLATE directory

c4eb65f

Signed-off-by: George Armstrong <georgea@nvidia.com>

enable blank issues (#1096)

2d93252

Signed-off-by: George Armstrong <georgea@nvidia.com>

ludwig-n added 3 commits February 6, 2026 19:07

Save configs in separate folder

d567697

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Update docs

4c3e0db

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Remove drop_params from configs

20f8580

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml Show resolved Hide resolved

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

supporting agent_max_turns

74922d7

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

downgrading rich to avoid issues with some instances

6615af3

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

nemo_skills/inference/eval/swebench.py Outdated Show resolved Hide resolved

Kipok reviewed Feb 6, 2026

View reviewed changes

missing && added

72f0d46

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

docs/evaluation/code.md Show resolved Hide resolved

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

Merge branch 'main' into mini_swe_agent

8e0848a

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

docs/evaluation/code.md Show resolved Hide resolved

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml Show resolved Hide resolved

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml Show resolved Hide resolved

wasiahmad added 2 commits February 7, 2026 01:06

Merge branch 'main' into mini_swe_agent

6251767

adding reference

8a095f7

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

docs/evaluation/code.md Show resolved Hide resolved

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml Show resolved Hide resolved

Remove step_limit and set cost_limit=0 in all configs

0acf4a4

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml Show resolved Hide resolved

docs/evaluation/code.md Show resolved Hide resolved

ludwig-n approved these changes Feb 10, 2026

View reviewed changes

Merge branch 'main' into mini_swe_agent

4b05c9d

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

docs/evaluation/code.md Show resolved Hide resolved

nemo_skills/inference/eval/swebench.py Show resolved Hide resolved

Merge branch 'main' into mini_swe_agent

276556c

wasiahmad merged commit 9517614 into main Feb 16, 2026
5 checks passed

wasiahmad deleted the mini_swe_agent branch February 16, 2026 19:13

Comments

Conversation

wasiahmad commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 7, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok commented Feb 10, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ludwig-n commented Feb 10, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wasiahmad commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

wasiahmad commented Feb 4, 2026 •

edited by coderabbitai bot

Loading