Skip to content

[BUG] Default eval model does not exist but the eval suite still runs in full. #222

@bkutasi

Description

@bkutasi

Description

Grok-code is not reachable anymore on opencode zen.

Steps to Reproduce

cit clone https://github.com/darrenhinde/OpenAgentsControl
cd evals/frameworks
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"

Expected Behavior

Eval suite breaks execution

Actual Behavior

Eval suite continues running

Environment

  • Ubuntu 24
  • OpenCode CLI ver: 1.2.5
  • Installation profile: developer
  • Bash version: 5.2.21

Possible Solution: Change the default model to big-pickle, and implement error handling.

npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug

> @opencode-agents/eval-framework@0.1.1 eval:sdk
> tsx src/sdk/run-sdk-tests.ts --agent=openagent --pattern=**/golden/*.yaml --debug

🚀 OpenCode SDK Test Runner

Testing agent: openagent

Found 10 test file(s):

  1. shared/tests/golden/08-error-handling.yaml
  2. shared/tests/golden/07-tool-selection.yaml
  3. shared/tests/golden/06-delegation-decision.yaml
  4. shared/tests/golden/05-multi-turn-context.yaml
  5. shared/tests/golden/04-write-with-approval.yaml
  6. shared/tests/golden/03-read-before-write.yaml
  7. shared/tests/golden/02-delegation-test.yaml
  8. shared/tests/golden/02-context-loading.yaml
  9. shared/tests/golden/02-context-loading-explicit.yaml
  10. shared/tests/golden/00-smoke-test.yaml

Loading test cases...
✅ Loaded 10 test case(s)

[TestRunner] DEBUG_VERBOSE enabled - full conversation logging active
[TestRunner] Git root: /media/nvme/projects/OpenAgentsControl
[TestRunner] Server will use eval-runner.md (dynamically configured)
Using default model: opencode/grok-code-fast (free tier)

Starting test runner...
[TestRunner] Configured eval-runner.md with core/openagent.md
Starting opencode server...
[Server STDOUT]: Warning: OPENCODE_SERVER_PASSWORD is not set; server is unsecured.
[Server STDOUT]: opencode server listening on http://127.0.0.1:4096
Server started at http://127.0.0.1:4096
[TestRunner] Multi-agent logging enabled (verbose mode)
[TestRunner] Multi-agent logging enabled (verbose mode)
[TestRunner] Evaluators initialized with SDK client
✅ Test runner started

Running tests...


============================================================
Running test: golden-08-error-handling - Golden 08: Error Handling - Agent Handles Missing Files Gracefully
============================================================
┌──────────────────────────────────────────────────────────┐
│ 🤖 Agent: OpenAgent                                      │
│ 🧠 Model: opencode/grok-code                             │
└──────────────────────────────────────────────────────────┘
Approval strategy: Auto-approve all permission requests
[Handlers] Before setup: 0
[Handlers] After setup: 9
⚠️   Event stream connection not confirmed within timeout

══════════════════════════════════════════════════════════════════════
📋 EVAL-RUNNER.MD CONTENT (System Prompt Being Tested)
══════════════════════════════════════════════════════════════════════

🔧 FRONTMATTER:
──────────────────────────────────────────────────────────────────────
---
name: OpenAgent
description: "Universal agent for answering queries, executing tasks, and coordinating workflows across any domain"
mode: primary
temperature: 0.2
permission:
  bash:
    # 1. Base Rule: Allow standard workflow to prevent nagging
    "*": "allow"

    # 2. Moderate Risk: Ask for confirmation (prevents accidents)
    "rm *": "ask"
    "mv *": "ask"
    "cp -f *": "ask"       # Overwrite copy
    "> *": "ask"           # Overwrite redirection
    ">> *": "ask"          # Append redirection
    "chmod *": "ask"       # Changing file permissions
    "chown *": "ask"       # Changing file ownership
    "git push *": "ask"    # Pushing to remote
    "ssh *": "ask"         # Remote connections

    # 3. High Danger: Deny outright
    "sudo *": "deny"
    "su *": "deny"
    "rm -rf *": "deny"     # Recursive force delete is too dangerous for agents
    "dd *": "deny"         # Disk writing
    "mkfs *": "deny"       # Formatting
    "fdisk *": "deny"
    "parted *": "deny"
    "shutdown *": "deny"
    "reboot *": "deny"
    "halt *": "deny"
    ":(){:|:&};:": "deny"  # Fork bomb
    "cat /etc/shadow": "deny"

  edit:
    "**/*.env*": "deny"
    "**/*.key": "deny"
    "**/*.pem": "deny"     # AWS/SSH keys
    "**/*.secret": "deny"
    "**/id_rsa*": "deny"   # Private SSH keys
    "**/id_dsa*": "deny"
    "**/.bash_history": "deny"
    "**/.zsh_history": "deny"
---

📝 SYSTEM PROMPT (first 50 lines):
──────────────────────────────────────────────────────────────────────
Always use ContextScout for discovery of new tasks or context files.
ContextScout is exempt from the approval gate rule. ContextScout is your secret weapon for quality, use it where possible.
<context>
  <system_context>Universal AI agent for code, docs, tests, and workflow coordination called OpenAgent</system_context>
  <domain_context>Any codebase, any language, any project structure</domain_context>
  <task_context>Execute tasks directly or delegate to specialized subagents</task_context>
  <execution_context>Context-aware execution with project standards enforcement</execution_context>
</context>

<critical_context_requirement>
PURPOSE: Context files contain project-specific standards that ensure consistency,
quality, and alignment with established patterns. Without loading context first,
you will create code/docs/tests that don't match the project's conventions,
causing inconsistency and rework.

BEFORE any bash/write/edit/task execution, ALWAYS load required context files.
(Read/list/glob/grep for discovery are allowed - load context once discovered)
NEVER proceed with code/docs/tests without loading standards first.
AUTO-STOP if you find yourself executing without context loaded.

WHY THIS MATTERS:
- Code without standards/code-quality.md → Inconsistent patterns, wrong architecture
- Docs without standards/documentation.md → Wrong tone, missing sections, poor structure
- Tests without standards/test-coverage.md → Wrong framework, incomplete coverage
- Review without workflows/code-review.md → Missed quality checks, incomplete analysis
- Delegation without workflows/task-delegation-basics.md → Wrong context passed to subagents

Required context files:
- Code tasks → .opencode/context/core/standards/code-quality.md
- Docs tasks → .opencode/context/core/standards/documentation.md
- Tests tasks → .opencode/context/core/standards/test-coverage.md
- Review tasks → .opencode/context/core/workflows/code-review.md
- Delegation → .opencode/context/core/workflows/task-delegation-basics.md

CONSEQUENCE OF SKIPPING: Work that doesn't match project standards = wasted effort + rework
</critical_context_requirement>

<critical_rules priority="absolute" enforcement="strict">
  <rule id="approval_gate" scope="all_execution">
    Request approval before ANY execution (bash, write, edit, task). Read/list ops don't require approval.
  </rule>

  <rule id="stop_on_failure" scope="validation">
    STOP on test fail/errors - NEVER auto-fix
  </rule>
  <rule id="report_first" scope="error_handling">
    On fail: REPORT→PROPOSE FIX→REQUEST APPROVAL→FIX (never auto-fix)
  </rule>
  <rule id="confirm_cleanup" scope="session_management">
    Confirm before deleting session files/cleanup ops

... (609 more lines)
══════════════════════════════════════════════════════════════════════

Creating session...

┌────────────────────────────────────────────────────────────┐
│ 🎯 PARENT: OpenAgent (ses_399c15b0...)                     │
└────────────────────────────────────────────────────────────┘
📋 Session created
Session created: ses_399c15b0effe5xvbrmBumPfziD
Sending 1 prompts (multi-turn)...
Using smart timeout: 60000ms per prompt, max 120000ms absolute

Prompt 1/1:
  Text: Read the file evals/test_tmp/this-file-does-not-exist-12345.txt and show me its contents.

  🤖 Agent: Read the file evals/test_tmp/this-file-does-not-exist-12345.txt and show me its ...
[Server STDERR]: 1102 |     const info = provider.models[modelID]
1103 |     if (!info) {
1104 |       const availableModels = Object.keys(provider.models)
1105 |       const matches = fuzzysort.go(modelID, availableModels, { limit: 3, threshold: -10000 })
1106 |       const suggestions = matches.map((m) => m.target)
1107 |       throw new ModelNotFoundError({ providerID, modelID, suggestions })
                   ^
ProviderModelNotFoundError: ProviderModelNotFoundError
 data: {
  providerID: "opencode",
  modelID: "grok-code",
  suggestions: [],
},

      at getModel (src/provider/provider.ts:1107:13)
✅ Completed via sdk: success
  Completed

All prompts completed
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions