Skip to content

Comments

Fixes to support SWE-bench Multilingual#1064

Merged
ludwig-n merged 5 commits intomainfrom
ludwig-n/swe-bench-multilingual
Dec 4, 2025
Merged

Fixes to support SWE-bench Multilingual#1064
ludwig-n merged 5 commits intomainfrom
ludwig-n/swe-bench-multilingual

Conversation

@ludwig-n
Copy link
Collaborator

@ludwig-n ludwig-n commented Dec 2, 2025

This PR adds two changes to support evaluation on SWE-bench Multilingual:

  • Adds jq as a dependency for OpenHands. This is required for some SWE-bench Multilingual containers where it is not installed by default.
  • Adds a new config variant for SWE-agent where mentions of Python are removed from the prompt, which is more suitable for multilingual datasets.

These changes should not affect standard evaluation on SWE-bench Verified.

The Slurm tests for SWE-bench pass.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added multilingual configuration for SWE-agent enabling language-agnostic prompt handling with repository context and problem-solving guidance.
  • Infrastructure

    • Integrated jq utility into the OpenHands runtime environment to support SWE-bench evaluation pipeline.

✏️ Tip: You can customize this high-level summary in your review settings.

@wasiahmad
Copy link
Collaborator

wasiahmad commented Dec 3, 2025

@ludwig-n

  1. Can you add benchmark numbers for Qwen3 coder models? This will be in record.
  2. Just for my understanding, is jq installed in each instance containers? (the child ones)
  3. Did we make any change to the eval harness for multilingual support?
  4. The prompt config is added only for SWE-agent, don't we need one for OH? Or, is it already language agnostic?

[Update] I found this (Kipok/SWE-bench#2) for 3. Good to keep a reference here for record.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

📝 Walkthrough

Walkthrough

The changes add jq binary support to the OpenHands/SWE-agent environment setup procedure and introduce a new multilingual configuration template for SWE-agent prompts with language-agnostic instruction guidance and tool configuration.

Changes

Cohort / File(s) Summary
OpenHands/SWE-agent jq tooling setup
nemo_skills/inference/eval/swebench.py
Adds jq binary installation, downloads the binary, sets permissions, creates a symlink in /usr/local/bin, and extends container setup to copy and link the jq directory, mirroring existing tmux/poetry handling.
Multilingual SWE-agent configuration
nemo_skills/prompt/config/eval/swe-bench/swe-agent/multilingual.yaml
New configuration file defining language-agnostic SWE-agent prompt templates, instruction flow for minimal file changes, tool environment variables, function calling setup, and model configuration (cost/token limits set to zero for Nemo-Skills override).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • Review jq integration in swebench.py follows established patterns (similar to tmux/poetry setup) but verify proper binary handling and permissions
  • Verify multilingual.yaml configuration is syntactically correct and aligns with SWE-agent requirements
  • Ensure jq additions don't interfere with existing OpenHands environment setup logic

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fixes to support SWE-bench Multilingual' accurately captures the main changes: adding jq dependency support and introducing a multilingual config variant for SWE-agent.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ludwig-n/swe-bench-multilingual

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48cbf77 and 779a655.

📒 Files selected for processing (2)
  • nemo_skills/inference/eval/swebench.py (2 hunks)
  • nemo_skills/prompt/config/eval/swe-bench/swe-agent/multilingual.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (5)
nemo_skills/inference/eval/swebench.py (1)

538-547: LGTM! jq integration follows the established pattern.

The implementation for copying jq from /root_mount and creating symbolic links mirrors the approach used for tmux and poetry, ensuring consistency across the codebase.

nemo_skills/prompt/config/eval/swe-bench/swe-agent/multilingual.yaml (4)

1-6: Clear documentation of config origin and purpose.

The header comments appropriately document the base configuration source and the language-agnostic modifications, making the intent clear for future maintainers.


7-35: LGTM! Templates are language-agnostic as intended.

The system and instance templates successfully avoid Python-specific language while providing clear instructions. The step-by-step guidance in the instance template is well-structured and appropriate for multilingual code repositories.


67-78: LGTM! Model configuration with clear override documentation.

The model configuration appropriately sets placeholder values with clear documentation that these are overridden by Nemo-Skills, preventing confusion about which parameters take precedence.


44-47: The tool bundle paths are legitimate, documented SWE-agent tools. They are standard upstream bundles (tools/registry, tools/edit_anthropic, tools/review_on_submit_m) already used consistently across multiple configuration files in this repository. No verification is needed.

Likely an incorrect or invalid review comment.

@ludwig-n ludwig-n merged commit dbfad3d into main Dec 4, 2025
5 checks passed
@ludwig-n ludwig-n deleted the ludwig-n/swe-bench-multilingual branch December 4, 2025 12:55
melllinia pushed a commit that referenced this pull request Dec 5, 2025
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Jorjeous pushed a commit that referenced this pull request Dec 11, 2025
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 12, 2025
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants