Skip to content

Comments

Support mini-swe-agent as agent harness#1212

Merged
wasiahmad merged 104 commits intomainfrom
mini_swe_agent
Feb 16, 2026
Merged

Support mini-swe-agent as agent harness#1212
wasiahmad merged 104 commits intomainfrom
mini_swe_agent

Conversation

@wasiahmad
Copy link
Collaborator

@wasiahmad wasiahmad commented Feb 4, 2026

Summary by CodeRabbit

  • New Features

    • Added mini-SWE-agent as a supported agent framework for SWE-bench evaluation, including per-datapoint execution, containerized run support, result generation in SWE-bench format, and integration into the processing flow.
    • Added three mini-SWE-agent prompt/config templates (standard, backticks, XML) to drive interactive, single-command iterative agent runs.
  • Documentation

    • Updated SWE-bench docs to reflect support and defaults for three agent frameworks: SWE-agent, mini-SWE-agent, and OpenHands.

wasiahmad and others added 30 commits November 29, 2025 15:55
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 7, 2026

Additional Comments (4)

nemo_skills/inference/eval/swebench.py
extra_body type mismatch

SweBenchInferenceConfig.extra_body is declared as a plain dict, but later code treats it as an OmegaConf object (OmegaConf.to_container(self.cfg.inference.extra_body, ...)). If extra_body remains a regular dict (as annotated/defaulted here), this call will error at runtime. Either declare extra_body with the same OmegaConf-compatible type used elsewhere in the codebase (e.g., DictConfig), or convert/merge it without calling OmegaConf.to_container when it’s already a dict.


nemo_skills/inference/eval/swebench.py
Malformed CLI arguments

In _run_swe_agent, there’s no && after uv pip install -e . so the next fragment (uv pip install rich==...) will be appended as additional arguments to the python invocation instead of a separate shell command, causing the container command to fail.

                "uv pip install -e . && "

nemo_skills/inference/eval/swebench.py
Unescaped instance_id in shell

data_point['instance_id'] is interpolated into the container command for --output trajectories/{instance_id}.traj.json without shell quoting. If an instance id contains spaces/quotes/metacharacters, the command will break or execute unintended tokens. Please wrap it with shlex.quote(...) (similar to problem_statement) or otherwise ensure it’s safely escaped before embedding in the shell string.


nemo_skills/inference/eval/swebench.py
Temp config path traversal

tmp_config_filename = f"configs/config_{data_point['instance_id']}.yaml" is used to build host_tmp_path under self.output_dir. If instance_id contains / or .., this will create nested paths or allow writing/removing files outside output_dir when creating/cleaning up the temp config. Sanitize the filename (e.g., replace path separators and strip ..) or use a hash of instance_id for the on-disk name.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/evaluation/code.md (1)

280-281: ⚠️ Potential issue | 🟡 Minor

Add note clarifying that mini-SWE-agent is not supported for multilingual evaluation.

The swe-bench-multilingual section should explicitly document that mini-SWE-agent does not have multilingual support yet. While the current documentation correctly omits mini-SWE-agent from the multilingual example, this should be made explicit for clarity, especially since mini-SWE-agent is mentioned as a supported alternative in the regular SWE-bench section (lines 160-161).

Consider adding a note like: "Currently, only OpenHands and SWE-agent support multilingual evaluation. Mini-SWE-agent support for multilingual datasets is not yet available."

🤖 Fix all issues with AI agents
In `@nemo_skills/inference/eval/swebench.py`:
- Line 635: Replace the silent fallback to {} and fail-fast when expected key is
missing: change the access of trajectory_dict.get("info", {}) to direct
dictionary access trajectory_dict["info"] inside the function where
trajectory_info is assigned (variable name trajectory_info in swebench.py) so a
missing "info" raises a KeyError; if you need clearer context, wrap the access
in a try/except and re-raise a more specific error mentioning the trajectory id
or source (use the same trajectory_dict/trajectory_info variables) rather than
silently using an empty dict.

In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 40-45: The numbered rules in the prompt skip item 2 (they read 1,
3, 4, 5); update the ordered list in swebench.yaml so the sequence is
consecutive by changing "3. The system executes the command(s) in a subshell" to
"2. The system executes the command(s) in a subshell" and then renumber the
following lines accordingly ("4."→"3.", "5."→"4.") so the list reads 1–4 in
proper order.
- Line 18: The XML-like tag casing is inconsistent: the opening tag
`<IMPORTANT>` is uppercase but the closing tag `</important>` is lowercase;
update the closing tag to exactly match the opener (change `</important>` to
`</IMPORTANT>`) or make both tags consistently lowercase (e.g.,
`<important>`/`</important>`), ensuring the pair is identical and preserving the
surrounding sentence and punctuation in the `swebench.yaml` snippet.
🧹 Nitpick comments (2)
docs/evaluation/code.md (1)

160-161: Consider adding mini-SWE-agent-specific expected results or a note about expected performance.

The sample run section only shows expected results for OpenHands (issues_resolved: 48.4). While the instructions say to replace openhands with mini_swe_agent, users would benefit from knowing what resolve rate to expect with mini-swe-agent, even approximately. Based on learnings, documentation for benchmarks should include expected results for tested models.

nemo_skills/inference/eval/swebench.py (1)

558-652: <!-- [scratchpad_start]
[task review_file_5/5 review_method_run_mini_swe_agent]
Deep review of _run_mini_swe_agent method.
[observations]

  • Past review comments addressed:
    1. Config path: now uses "eval/swe-bench/mini-swe-agent/swebench" ✓
    2. Python -m: now uses python -m minisweagent.run.mini (dotted module path) ✓
    3. Search path: now uses f"{data_point['instance_id']}.traj.json" matching output ✓
  • The completion_kwargs building logic (lines 563-572) is duplicated from _run_swe_agent (lines 497-506)
  • The method structure is clean: build kwargs → load config → modify config → write temp config → run command → parse output → cleanup

[analysis]

  • Code duplication of completion_kwargs building across _run_swe_agent, _run_mini_swe_agent, and _run_openhands
  • This is a refactor opportunity but not essential
  • The method correctly handles cleanup in the finally block
  • The YAML config manipulation is straightforward

[actions]

  • Suggest optional refactor for completion_kwargs duplication
  • Approve the overall structure
    [scratchpad_end] -->
    Overall implementation looks solid — past review issues are resolved.

The three issues flagged in previous reviews are all addressed:

  1. Default config path now correctly points to "eval/swe-bench/mini-swe-agent/swebench" (line 574)
  2. Module invocation uses proper dotted path python -m minisweagent.run.mini (line 613)
  3. Search path {instance_id}.traj.json matches the --output argument (lines 617, 624)

The completion_kwargs construction logic (lines 563–572) is duplicated across _run_swe_agent, _run_mini_swe_agent, and _run_openhands. Consider extracting it into a shared helper to reduce duplication.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 21-24: In swebench.yaml update the contradictory prompt
requirement on the response rules: replace the phrase "Provide exactly ONE bash
command to execute" with "Provide AT LEAST ONE bash command to execute" so it
matches the other occurrences and the format_error_template semantics; ensure
the change is made in the response instruction block near the top of the file so
all references (including format_error_template and lines that currently state
"AT LEAST ONE") are consistent.
🧹 Nitpick comments (2)
nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml (1)

175-176: Nit: trailing dot on cost_limit value.

cost_limit: 3. is valid YAML (parses as 3.0) but reads oddly. Consider 3.0 for clarity. This is consistent with the same pattern in the sibling swebench.yaml, so it's a minor cosmetic point across all three configs.

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml (1)

113-114: Nit: trailing dot on cost_limit value.

Same as the sibling config — cost_limit: 3. is valid YAML but 3.0 would be more conventional.

@Kipok
Copy link
Collaborator

Kipok commented Feb 10, 2026

no issues from my side, but will leave this to @ludwig-n for final approval. @wasiahmad please fix DCO

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@ludwig-n
Copy link
Collaborator

Score for Qwen3-Coder-480B-A35B: ~52.5% (avg of 3 runs). Officially reported score is 55.4%. Good enough to merge in my opinion.

With this framework, patches don't always follow the git diff format, so it causes a higher percentage of "patch can't apply" errors. However, this seems to be a "feature" of mini-swe-agent, as it asks the LLM to create the patch file, rather than running git diff manually after the agent is done like SWE-agent/OpenHands. So it's not an issue with our implementation.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@wasiahmad
Copy link
Collaborator Author

Evaluated Minimax-M2.1 and Minimax-M2.5 on swe-bench-verified using mini-swe-agent and got the following scores.

  • Minimax-M2.1 => Pass@1: 70.6% (avg-of-3: 71.0, 71.0, 69.8)
  • Minimax-M2.5 => Pass@1: 75.9% (avg-of-3: 76.6, 75.0, 76.0)

No official reported score of these models with mini-swe-agent.
Their official score for swe-bench-verified is 74.0 and 80.2, respectively.

@wasiahmad wasiahmad merged commit 9517614 into main Feb 16, 2026
5 checks passed
@wasiahmad wasiahmad deleted the mini_swe_agent branch February 16, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.