Skip to content

Conversation

@lvogel04
Copy link
Contributor

@lvogel04 lvogel04 commented Nov 25, 2025

Summary

  • Refactores our dynamic claude agent set up with packaged claude code inspect swe agent
  • Adds codex CLI agent and Gemini CLI agent
  • Changes default coding agent to codex
  • Adds hidden test flag that removes test files from agent workspace to make tasks harder. It better tests models on task ambiguity which is more realistic to real world coding tasks. Quick 30 sample tests shows codex with gpt-5 scores 30% lower with no tests.
  • Added test coverage for cli commands and updates documentation
  • Ran eval coverage to test inspect ai bump. Results:

Livemcpbench: 0.432 (0.051 stderr)
Taubench: retail: 0.789, airline: 0.58
gpqa_diamond: 0.846 (0.021 stderr)
sealqa: 0.287 (0.028 stderr)
mrcr: 8n: 0.401, 4n: 0.577, 2n: 0.711
Humaneval: 0.988 (0.006 stderrr)
Mathvista 0.715 (0.014 stderr)

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Changes Made

Testing

  • I have run the existing test suite (pytest)
  • I have added tests for my changes
  • I have tested with multiple model providers (if applicable)
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Related Issues

Closes #

Additional Context


Note

Adds Codex (now default), Claude Code, and Gemini agents for Exercism and introduces --hidden-tests with sanitized workspaces, plus docs, deps, and Docker/provider updates.

  • Exercism benchmark:
    • Default --code-agent changed from opencode to codex across tasks and docs.
    • New --hidden-tests flag; tasks propagate hide_tests and swap prompt to EXERCISM_HIDDEN_TEST_PROMPT.
  • Agents:
    • Add CodexAgent (inspect_swe), ClaudeCodeAgent (inspect_swe), and GeminiAgent CLIs; exported via openbench.agents and AgentManager.
    • Enforce model constraints: roo requires openrouter/*; claude_code requires anthropic/*.
  • CLI/Utils:
    • eval command accepts --hidden-tests and forwards hide_tests to tasks.
    • New workspace hiding pipeline: discover_hidden_paths, prepare_hidden_workspace, sync_agent_workspace; rsync-based excludes; Python test filename normalization retained.
  • Docker/Provider:
    • Add rsync to base image; add GeminiCommands; map agent Docker installs (claude_code/codex need none).
    • Google provider includes GEMINI_API_KEY.
  • Dependencies:
    • Bump inspect-ai to 0.3.141; add inspect_swe>=0.2.26 and anthropic>=0.69.0 in core and root projects.
  • Config/Metadata:
    • Remove exercism eval-group from config and snippets (individual tasks remain).
  • Docs:
    • Update Exercism docs/README/release notes for default codex, new agents, and --hidden-tests usage.
  • Tests:
    • Add tests/test_exercism_cli.py covering hidden path discovery, workspace sync, setup/test runners.
    • Minor updates to Groq provider streaming tests.

Written by Cursor Bugbot for commit 600540e. This will update automatically on new commits. Configure here.

@lvogel04 lvogel04 marked this pull request as ready for review December 1, 2025 06:11
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

✅ Benchmark documentation has been automatically updated.

@socket-security
Copy link

socket-security bot commented Dec 1, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedopenai@​2.8.0 ⏵ 2.8.196100100100100
Updatedanthropic@​0.73.0 ⏵ 0.74.197 +1100100100100
Addedinspect-swe@​0.2.26100100100100100
Addedpathlib-abc@​0.5.2100100100100100
Updatedinspect-ai@​0.3.125 ⏵ 0.3.141100 +27100100100100

View full report

@lvogel04 lvogel04 changed the title feat(exercism): added codex and claude inspect swe agents feat(exercism): adding codex, claude, gemini and hidden tests flag Dec 1, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

✅ Benchmark documentation has been automatically updated.

Copy link
Collaborator

@nmayorga7 nmayorga7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few minor comments; overall, additions follow existing infrastructure though so lgtm. preemptively stamping to unblock.

return []


def _format_agent_output(output: ModelOutput) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_format_agent_output() duplicate code. identical in claude.py and codex.py. shared util?

async def execute(self, workdir: str, prompt_text: str, model: str) -> str:
"""Execute Codex CLI agent."""
try:
codex_agent = codex_cli(cwd=workdir, model=model, model_config=model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why passing in model twice to codex_cli()?

for model_name in model:
if not model_name.startswith("anthropic/"):
raise typer.BadParameter(
"For claude_code, --model must be an Anthropic model id prefixed with 'anthropic/'. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit for consistency, maybe add an example as in line 706 for openrouter ("Example: --model openrouter/anthropic/claude-sonnet-4-20250514")

## Usage

Run Exercism evaluation across all languages with the default code agent (opencode):
Run Exercism evaluation across all languages with the default code agent (codex):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why the switch of default to codex?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants