Support mini-swe-agent as agent harness#1212
Conversation
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml
Show resolved
Hide resolved
nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_xml.yaml
Show resolved
Hide resolved
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Additional Comments (4)
In
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/evaluation/code.md (1)
280-281:⚠️ Potential issue | 🟡 MinorAdd note clarifying that mini-SWE-agent is not supported for multilingual evaluation.
The swe-bench-multilingual section should explicitly document that mini-SWE-agent does not have multilingual support yet. While the current documentation correctly omits mini-SWE-agent from the multilingual example, this should be made explicit for clarity, especially since mini-SWE-agent is mentioned as a supported alternative in the regular SWE-bench section (lines 160-161).
Consider adding a note like: "Currently, only OpenHands and SWE-agent support multilingual evaluation. Mini-SWE-agent support for multilingual datasets is not yet available."
🤖 Fix all issues with AI agents
In `@nemo_skills/inference/eval/swebench.py`:
- Line 635: Replace the silent fallback to {} and fail-fast when expected key is
missing: change the access of trajectory_dict.get("info", {}) to direct
dictionary access trajectory_dict["info"] inside the function where
trajectory_info is assigned (variable name trajectory_info in swebench.py) so a
missing "info" raises a KeyError; if you need clearer context, wrap the access
in a try/except and re-raise a more specific error mentioning the trajectory id
or source (use the same trajectory_dict/trajectory_info variables) rather than
silently using an empty dict.
In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 40-45: The numbered rules in the prompt skip item 2 (they read 1,
3, 4, 5); update the ordered list in swebench.yaml so the sequence is
consecutive by changing "3. The system executes the command(s) in a subshell" to
"2. The system executes the command(s) in a subshell" and then renumber the
following lines accordingly ("4."→"3.", "5."→"4.") so the list reads 1–4 in
proper order.
- Line 18: The XML-like tag casing is inconsistent: the opening tag
`<IMPORTANT>` is uppercase but the closing tag `</important>` is lowercase;
update the closing tag to exactly match the opener (change `</important>` to
`</IMPORTANT>`) or make both tags consistently lowercase (e.g.,
`<important>`/`</important>`), ensuring the pair is identical and preserving the
surrounding sentence and punctuation in the `swebench.yaml` snippet.
🧹 Nitpick comments (2)
docs/evaluation/code.md (1)
160-161: Consider adding mini-SWE-agent-specific expected results or a note about expected performance.The sample run section only shows expected results for OpenHands (
issues_resolved: 48.4). While the instructions say to replaceopenhandswithmini_swe_agent, users would benefit from knowing what resolve rate to expect with mini-swe-agent, even approximately. Based on learnings, documentation for benchmarks should include expected results for tested models.nemo_skills/inference/eval/swebench.py (1)
558-652: <!-- [scratchpad_start]
[task review_file_5/5 review_method_run_mini_swe_agent]
Deep review of _run_mini_swe_agent method.
[observations]
- Past review comments addressed:
- Config path: now uses "eval/swe-bench/mini-swe-agent/swebench" ✓
- Python -m: now uses
python -m minisweagent.run.mini(dotted module path) ✓- Search path: now uses
f"{data_point['instance_id']}.traj.json"matching output ✓- The completion_kwargs building logic (lines 563-572) is duplicated from _run_swe_agent (lines 497-506)
- The method structure is clean: build kwargs → load config → modify config → write temp config → run command → parse output → cleanup
[analysis]
- Code duplication of completion_kwargs building across _run_swe_agent, _run_mini_swe_agent, and _run_openhands
- This is a refactor opportunity but not essential
- The method correctly handles cleanup in the finally block
- The YAML config manipulation is straightforward
[actions]
- Suggest optional refactor for completion_kwargs duplication
- Approve the overall structure
[scratchpad_end] -->
Overall implementation looks solid — past review issues are resolved.The three issues flagged in previous reviews are all addressed:
- Default config path now correctly points to
"eval/swe-bench/mini-swe-agent/swebench"(line 574)- Module invocation uses proper dotted path
python -m minisweagent.run.mini(line 613)- Search path
{instance_id}.traj.jsonmatches the--outputargument (lines 617, 624)The completion_kwargs construction logic (lines 563–572) is duplicated across
_run_swe_agent,_run_mini_swe_agent, and_run_openhands. Consider extracting it into a shared helper to reduce duplication.
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 21-24: In swebench.yaml update the contradictory prompt
requirement on the response rules: replace the phrase "Provide exactly ONE bash
command to execute" with "Provide AT LEAST ONE bash command to execute" so it
matches the other occurrences and the format_error_template semantics; ensure
the change is made in the response instruction block near the top of the file so
all references (including format_error_template and lines that currently state
"AT LEAST ONE") are consistent.
🧹 Nitpick comments (2)
nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml (1)
175-176: Nit: trailing dot oncost_limitvalue.
cost_limit: 3.is valid YAML (parses as3.0) but reads oddly. Consider3.0for clarity. This is consistent with the same pattern in the siblingswebench.yaml, so it's a minor cosmetic point across all three configs.nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml (1)
113-114: Nit: trailing dot oncost_limitvalue.Same as the sibling config —
cost_limit: 3.is valid YAML but3.0would be more conventional.
|
no issues from my side, but will leave this to @ludwig-n for final approval. @wasiahmad please fix DCO |
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
|
Score for Qwen3-Coder-480B-A35B: ~52.5% (avg of 3 runs). Officially reported score is 55.4%. Good enough to merge in my opinion. With this framework, patches don't always follow the |
|
Evaluated Minimax-M2.1 and Minimax-M2.5 on swe-bench-verified using mini-swe-agent and got the following scores.
No official reported score of these models with mini-swe-agent. |
Summary by CodeRabbit
New Features
Documentation