-
Notifications
You must be signed in to change notification settings - Fork 91
feat(exercism): adding codex, claude, gemini and hidden tests flag #325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
93e8b55 to
3baf5c9
Compare
|
✅ Benchmark documentation has been automatically updated. |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
5ce797a to
ce802a7
Compare
|
✅ Benchmark documentation has been automatically updated. |
nmayorga7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few minor comments; overall, additions follow existing infrastructure though so lgtm. preemptively stamping to unblock.
| return [] | ||
|
|
||
|
|
||
| def _format_agent_output(output: ModelOutput) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_format_agent_output() duplicate code. identical in claude.py and codex.py. shared util?
| async def execute(self, workdir: str, prompt_text: str, model: str) -> str: | ||
| """Execute Codex CLI agent.""" | ||
| try: | ||
| codex_agent = codex_cli(cwd=workdir, model=model, model_config=model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why passing in model twice to codex_cli()?
| for model_name in model: | ||
| if not model_name.startswith("anthropic/"): | ||
| raise typer.BadParameter( | ||
| "For claude_code, --model must be an Anthropic model id prefixed with 'anthropic/'. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit for consistency, maybe add an example as in line 706 for openrouter ("Example: --model openrouter/anthropic/claude-sonnet-4-20250514")
| ## Usage | ||
|
|
||
| Run Exercism evaluation across all languages with the default code agent (opencode): | ||
| Run Exercism evaluation across all languages with the default code agent (codex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why the switch of default to codex?
Summary
Livemcpbench: 0.432 (0.051 stderr)
Taubench: retail: 0.789, airline: 0.58
gpqa_diamond: 0.846 (0.021 stderr)
sealqa: 0.287 (0.028 stderr)
mrcr: 8n: 0.401, 4n: 0.577, 2n: 0.711
Humaneval: 0.988 (0.006 stderrr)
Mathvista 0.715 (0.014 stderr)
What are you adding?
Changes Made
Testing
pytest)pre-commit run --all-files)Checklist
Related Issues
Closes #
Additional Context
Note
Adds Codex (now default), Claude Code, and Gemini agents for Exercism and introduces --hidden-tests with sanitized workspaces, plus docs, deps, and Docker/provider updates.
--code-agentchanged fromopencodetocodexacross tasks and docs.--hidden-testsflag; tasks propagatehide_testsand swap prompt toEXERCISM_HIDDEN_TEST_PROMPT.CodexAgent(inspect_swe),ClaudeCodeAgent(inspect_swe), andGeminiAgentCLIs; exported viaopenbench.agentsandAgentManager.roorequiresopenrouter/*;claude_coderequiresanthropic/*.evalcommand accepts--hidden-testsand forwardshide_teststo tasks.discover_hidden_paths,prepare_hidden_workspace,sync_agent_workspace; rsync-based excludes; Python test filename normalization retained.rsyncto base image; addGeminiCommands; map agent Docker installs (claude_code/codexneed none).GEMINI_API_KEY.inspect-aito0.3.141; addinspect_swe>=0.2.26andanthropic>=0.69.0in core and root projects.exercismeval-group from config and snippets (individual tasks remain).codex, new agents, and--hidden-testsusage.tests/test_exercism_cli.pycovering hidden path discovery, workspace sync, setup/test runners.Written by Cursor Bugbot for commit 600540e. This will update automatically on new commits. Configure here.