-
Notifications
You must be signed in to change notification settings - Fork 92
feat(exercism): adding codex, claude, gemini and hidden tests flag #325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
6bdd8cd
feat(exercism): added codex and claude inspect swe agents
lvogel04 3baf5c9
feat(exercism): add gemini cli + add hidden test flag to remove tests…
lvogel04 ce802a7
fix gemini agent
lvogel04 600540e
chore: update benchmark docs [skip ci]
github-actions[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,137 +1,69 @@ | ||
| """ | ||
| Claude code agent implementation. | ||
| Claude Code agent backed by inspect_swe. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import os | ||
| from typing import List | ||
|
|
||
| from inspect_ai.agent import AgentState | ||
| from inspect_ai.model import ChatMessageUser, ModelOutput | ||
|
|
||
| from openbench.utils.cli_commands import format_execution_output | ||
|
|
||
| from .base import BaseCodeAgent | ||
| from openbench.utils.cli_commands import ( | ||
| generate_env_setup_script, | ||
| write_prompt_to_file, | ||
| write_and_execute_script, | ||
| read_log_file, | ||
| format_execution_output, | ||
| get_claude_script_template, | ||
| ) | ||
| from openbench.utils.docker import ClaudeCommands | ||
| from inspect_swe import claude_code | ||
|
|
||
|
|
||
| class ClaudeAgent(BaseCodeAgent): | ||
| """Claude-based code editor with file system access.""" | ||
| class ClaudeCodeAgent(BaseCodeAgent): | ||
| """Claude Code CLI agent via inspect_swe.""" | ||
|
|
||
| def __init__(self): | ||
| super().__init__("claude") | ||
| def __init__(self) -> None: | ||
| super().__init__("claude_code") | ||
|
|
||
| async def execute(self, workdir: str, prompt_text: str, model: str) -> str: | ||
| """Execute Claude Code CLI command. | ||
| """Execute Claude Code.""" | ||
|
|
||
| Args: | ||
| workdir: Working directory path for the task | ||
| prompt_text: The prompt to send to claude code | ||
| model: Model string to use with claude code | ||
|
|
||
| Returns: | ||
| Formatted output string with claude code execution results | ||
| """ | ||
| try: | ||
| # Check for required API key | ||
| anthropic_api_key = os.getenv("ANTHROPIC_API_KEY") | ||
| if not anthropic_api_key: | ||
| return "ERROR: ANTHROPIC_API_KEY is not set" | ||
|
|
||
| # Write prompt to avoid shell quoting issues | ||
| if not await write_prompt_to_file(prompt_text, "claude_code_prompt.txt"): | ||
| return "ERROR: failed to write prompt file" | ||
|
|
||
| # Get environment setup script | ||
| env_setup = generate_env_setup_script() | ||
|
|
||
| # Create claude execution script | ||
| script_content = get_claude_script_template().format( | ||
| workdir=workdir, env_setup=env_setup, model=model | ||
| ) | ||
|
|
||
| # Execute the script | ||
| result = await write_and_execute_script( | ||
| script_content, | ||
| "claude_script.sh", | ||
| timeout=1800, # 30 minutes | ||
| ) | ||
|
|
||
| # Read claude-specific log | ||
| additional_logs = [] | ||
| claude_log = await read_log_file( | ||
| "/tmp/claude-code-output.log", "CLAUDE CODE", tail_lines=200 | ||
| ) | ||
| if claude_log: | ||
| additional_logs.append(claude_log) | ||
|
|
||
| return format_execution_output(result, additional_logs) | ||
|
|
||
| except Exception as e: | ||
| return f"ERROR: Failed to run claude code: {str(e)}" | ||
| claude_agent = claude_code(cwd=workdir, model=model) | ||
| state = AgentState(messages=[ChatMessageUser(content=prompt_text)]) | ||
| completed_state = await claude_agent(state) | ||
| stdout_text = _format_agent_output(completed_state.output) | ||
| result = { | ||
| "returncode": 0, | ||
| "success": True, | ||
| "stdout": stdout_text, | ||
| "stderr": "", | ||
| } | ||
| return format_execution_output(result) | ||
| except Exception as exc: # pragma: no cover - defensive | ||
| return f"ERROR: claude_code execution failed: {exc}" | ||
|
|
||
| def resolve_model(self, state_model: str) -> str: | ||
| """Resolve the appropriate model string for Claude. | ||
|
|
||
| Args: | ||
| state_model: Model from TaskState.model | ||
|
|
||
| Returns: | ||
| Resolved model string for Claude (removes anthropic/ prefix) | ||
| """ | ||
| # Claude CLI uses Anthropic models directly (remove prefix) | ||
| if state_model.startswith("anthropic/"): | ||
| return state_model[len("anthropic/") :] | ||
| return state_model | ||
|
|
||
| def get_setup_commands(self) -> List[str]: | ||
| """Get setup commands required by Claude. | ||
|
|
||
| Returns: | ||
| Empty list (no special setup required) | ||
| """ | ||
| return [] | ||
| """Resolve the appropriate model string for Claude Code.""" | ||
| stripped = (state_model or "").strip() | ||
| return stripped if stripped else self.get_default_model() | ||
|
|
||
| def get_default_model(self) -> str: | ||
| """Get the default model for Claude. | ||
|
|
||
| Returns: | ||
| Default model string | ||
| """ | ||
| return "anthropic/claude-sonnet-4-20250514" | ||
| return "anthropic/claude-sonnet-4-5-20250929" | ||
|
|
||
| def get_description(self) -> str: | ||
| """Get description of Claude. | ||
|
|
||
| Returns: | ||
| Description string | ||
| """ | ||
| return "Claude cli code agent" | ||
| return "Claude Code agent." | ||
|
|
||
| def get_dockerfile_commands(self) -> List[str]: | ||
| """Get Dockerfile commands to install Claude Code CLI. | ||
|
|
||
| Returns: | ||
| List of Dockerfile RUN commands | ||
| """ | ||
| return ClaudeCommands.DOCKERFILE_COMMANDS | ||
|
|
||
| def get_base_packages(self) -> List[str]: | ||
| """Get base packages required by Claude. | ||
| return [] | ||
|
|
||
| Returns: | ||
| List of apt package names | ||
| """ | ||
| return ClaudeCommands.BASE_PACKAGES | ||
|
|
||
| def get_env_requirements(self) -> List[str]: | ||
| """Get environment variables required by Claude. | ||
| def _format_agent_output(output: ModelOutput) -> str: | ||
| """Render agent output as plain text.""" | ||
| if not output or not output.choices: | ||
| return "Agent completed without emitting assistant output." | ||
|
|
||
| Returns: | ||
| List of environment variable names | ||
| """ | ||
| return ["ANTHROPIC_API_KEY"] # Claude specifically requires Anthropic API key | ||
| parts: List[str] = [] | ||
| for idx, choice in enumerate(output.choices, start=1): | ||
| message = choice.message | ||
| text = ( | ||
| message.text.strip() if message and message.text else "" | ||
| ) or "(no text output)" | ||
| parts.append(f"[Choice {idx}] {text}") | ||
| return "\n\n".join(parts) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.