feat(exercism): adding codex, claude, gemini and hidden tests flag #325

lvogel04 · 2025-11-25T02:44:40Z

Summary

Refactores our dynamic claude agent set up with packaged claude code inspect swe agent
Adds codex CLI agent and Gemini CLI agent
Changes default coding agent to codex
Adds hidden test flag that removes test files from agent workspace to make tasks harder. It better tests models on task ambiguity which is more realistic to real world coding tasks. Quick 30 sample tests shows codex with gpt-5 scores 30% lower with no tests.
Added test coverage for cli commands and updates documentation
Ran eval coverage to test inspect ai bump. Results:

Livemcpbench: 0.432 (0.051 stderr)
Taubench: retail: 0.789, airline: 0.58
gpqa_diamond: 0.846 (0.021 stderr)
sealqa: 0.287 (0.028 stderr)
mrcr: 8n: 0.401, 4n: 0.577, 2n: 0.711
Humaneval: 0.988 (0.006 stderrr)
Mathvista 0.715 (0.014 stderr)

What are you adding?

Changes Made

Testing

I have run the existing test suite (pytest)
I have added tests for my changes
I have tested with multiple model providers (if applicable)
I have run pre-commit hooks (pre-commit run --all-files)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (if applicable)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Related Issues

Closes #

Additional Context

Note

Adds Codex (now default), Claude Code, and Gemini agents for Exercism and introduces --hidden-tests with sanitized workspaces, plus docs, deps, and Docker/provider updates.

Exercism benchmark:
- Default --code-agent changed from opencode to codex across tasks and docs.
- New --hidden-tests flag; tasks propagate hide_tests and swap prompt to EXERCISM_HIDDEN_TEST_PROMPT.
Agents:
- Add CodexAgent (inspect_swe), ClaudeCodeAgent (inspect_swe), and GeminiAgent CLIs; exported via openbench.agents and AgentManager.
- Enforce model constraints: roo requires openrouter/*; claude_code requires anthropic/*.
CLI/Utils:
- eval command accepts --hidden-tests and forwards hide_tests to tasks.
- New workspace hiding pipeline: discover_hidden_paths, prepare_hidden_workspace, sync_agent_workspace; rsync-based excludes; Python test filename normalization retained.
Docker/Provider:
- Add rsync to base image; add GeminiCommands; map agent Docker installs (claude_code/codex need none).
- Google provider includes GEMINI_API_KEY.
Dependencies:
- Bump inspect-ai to 0.3.141; add inspect_swe>=0.2.26 and anthropic>=0.69.0 in core and root projects.
Config/Metadata:
- Remove exercism eval-group from config and snippets (individual tasks remain).
Docs:
- Update Exercism docs/README/release notes for default codex, new agents, and --hidden-tests usage.
Tests:
- Add tests/test_exercism_cli.py covering hidden path discovery, workspace sync, setup/test runners.
- Minor updates to Groq provider streaming tests.

^{Written by Cursor Bugbot for commit 600540e. This will update automatically on new commits. Configure here.}

… from agent run

github-actions · 2025-12-01T06:11:40Z

✅ Benchmark documentation has been automatically updated.

socket-security · 2025-12-01T06:11:59Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	openai@2.8.0 ⏵ 2.8.1
	anthropic@0.73.0 ⏵ 0.74.1	⁺¹
	inspect-swe@0.2.26
	pathlib-abc@0.5.2
	inspect-ai@0.3.125 ⏵ 0.3.141	⁺²⁷

View full report

src/openbench/evals/exercism/exercism.py

src/openbench/utils/cli_commands.py

github-actions · 2025-12-01T06:38:04Z

✅ Benchmark documentation has been automatically updated.

nmayorga7

left a few minor comments; overall, additions follow existing infrastructure though so lgtm. preemptively stamping to unblock.

nmayorga7 · 2025-12-10T20:16:25Z

src/openbench/agents/codex.py

+        return []
+
+
+def _format_agent_output(output: ModelOutput) -> str:


_format_agent_output() duplicate code. identical in claude.py and codex.py. shared util?

nmayorga7 · 2025-12-10T20:19:07Z

src/openbench/agents/codex.py

+    async def execute(self, workdir: str, prompt_text: str, model: str) -> str:
+        """Execute Codex CLI agent."""
+        try:
+            codex_agent = codex_cli(cwd=workdir, model=model, model_config=model)


why passing in model twice to codex_cli()?

nmayorga7 · 2025-12-10T20:21:29Z

src/openbench/_cli/eval_command.py

+        for model_name in model:
+            if not model_name.startswith("anthropic/"):
+                raise typer.BadParameter(
+                    "For claude_code, --model must be an Anthropic model id prefixed with 'anthropic/'. "


Tiny nit for consistency, maybe add an example as in line 706 for openrouter ("Example: --model openrouter/anthropic/claude-sonnet-4-20250514")

nmayorga7 · 2025-12-10T20:22:32Z

docs/evals/exercism.mdx

 ## Usage

-Run Exercism evaluation across all languages with the default code agent (opencode):
+Run Exercism evaluation across all languages with the default code agent (codex):


curious why the switch of default to codex?

lvogel04 added 2 commits December 1, 2025 00:10

feat(exercism): added codex and claude inspect swe agents

6bdd8cd

feat(exercism): add gemini cli + add hidden test flag to remove tests…

3baf5c9

… from agent run

lvogel04 force-pushed the fix/exercism2 branch from 93e8b55 to 3baf5c9 Compare December 1, 2025 06:11

lvogel04 marked this pull request as ready for review December 1, 2025 06:11

lvogel04 requested review from AarushSah and nmayorga7 as code owners December 1, 2025 06:11

lvogel04 changed the title ~~feat(exercism): added codex and claude inspect swe agents~~ feat(exercism): adding codex, claude, gemini and hidden tests flag Dec 1, 2025

cursor bot reviewed Dec 1, 2025

View reviewed changes

src/openbench/evals/exercism/exercism.py Show resolved Hide resolved

src/openbench/utils/cli_commands.py Outdated Show resolved Hide resolved

fix gemini agent

ce802a7

lvogel04 force-pushed the fix/exercism2 branch from 5ce797a to ce802a7 Compare December 1, 2025 06:37

chore: update benchmark docs [skip ci]

600540e

nmayorga7 approved these changes Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(exercism): adding codex, claude, gemini and hidden tests flag #325

feat(exercism): adding codex, claude, gemini and hidden tests flag #325

Uh oh!

lvogel04 commented Nov 25, 2025 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

socket-security bot commented Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

nmayorga7 left a comment

Uh oh!

nmayorga7 Dec 10, 2025

Uh oh!

nmayorga7 Dec 10, 2025

Uh oh!

nmayorga7 Dec 10, 2025

Uh oh!

nmayorga7 Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

		return []


		def _format_agent_output(output: ModelOutput) -> str:

feat(exercism): adding codex, claude, gemini and hidden tests flag #325

Are you sure you want to change the base?

feat(exercism): adding codex, claude, gemini and hidden tests flag #325

Uh oh!

Conversation

lvogel04 commented Nov 25, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What are you adding?

Changes Made

Testing

Checklist

Related Issues

Additional Context

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

socket-security bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

nmayorga7 left a comment

Choose a reason for hiding this comment

Uh oh!

nmayorga7 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

nmayorga7 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

nmayorga7 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

nmayorga7 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

lvogel04 commented Nov 25, 2025 •

edited by cursor bot

Loading

socket-security bot commented Dec 1, 2025 •

edited

Loading