Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
## Features

- **Declarative Eval Config**: Define evaluation environment, engine, model, and cases through YAML (`eval.yaml` + `cases/*.yaml`).
- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as built-in Agent Engines, plus user-defined agents via `engine.custom` (local transport — see [docs/design/custom-engine.md](docs/design/custom-engine.md)).
- **Flexible Judging**: Supports `rule_based`, `script`, and `agent_judge` evaluation strategies.
- **Structured Reports**: Outputs Anthropic-compatible `grading.json`, `benchmark.json`, `benchmark.md`, plus `result.json`, JUnit XML, and HTML reports.
- **Anthropic Compatible**: Import `evals.json` via `skill-up import`, or auto-detect with `--auto`.
Expand Down
2 changes: 1 addition & 1 deletion README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
## 特性

- **声明式评测配置**:通过 YAML(`eval.yaml` + `cases/*.yaml`)定义评测环境、引擎、模型和用例。
- **多引擎支持**:支持 Qoder CLI、Claude Code、CodexAgent 引擎
- **多引擎支持**:内置支持 Qoder CLI、Claude Code、Codex;亦可通过 `engine.custom` 接入用户自定义 Agent(本地传输,详见 [docs/design/custom-engine.md](docs/design/custom-engine.md))
- **灵活评分**:支持 `rule_based`(规则匹配)、`script`(脚本评分)、`agent_judge`(Agent 评分)三种评估策略。
- **结构化报告**:输出 Anthropic 兼容的 `grading.json`、`benchmark.json`、`benchmark.md`,以及 `result.json`、JUnit XML 和 HTML 报告。
- **Anthropic 兼容**:通过 `skill-up import` 导入 `evals.json`,或使用 `--auto` 自动识别。
Expand Down
847 changes: 847 additions & 0 deletions docs/design/custom-engine.md

Large diffs are not rendered by default.

53 changes: 53 additions & 0 deletions docs/guide/writing-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,59 @@ report:
skill-up run evals/eval.yaml --engine-kwarg bypass_sandbox=true
```

### Custom Engine

When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Only `transport: local` is implemented today; `transport: http` is reserved and currently fails validation with "not yet implemented".

```yaml
engine:
name: my-agent
model:
provider: anthropic
name: claude-sonnet-4-6
custom:
transport: local # local (implemented) | http (planned)
response_format: session_result # session_result (default) | text
timeout_seconds: 300
env: # credentials and secrets — NEVER reference these in command/args
MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
kwargs: # non-secret knobs exposed as ${kwargs.<key>}
profile: production
local:
command: /opt/my-agent/bin/run
args:
- --input
- ${input_file} # path to the SessionInput JSON skill-up writes
- --output
- ${output_file} # path your agent should write its SessionResult JSON to
cwd: ${workspace} # optional; confined to the runtime workspace
input_file: inputs/messages.json # optional override (relative to workspace)
output_file: outputs/session-result.json # optional override
```

Key fields (full contract in [docs/design/custom-engine.md](../design/custom-engine.md)):

- **`transport`** (required) — how skill-up invokes your agent.
- `local`: run `local.command` inside the current runtime via `runtime.Exec`. The agent process can read the runtime workspace, installed skills, fixtures, MCP config, and process environment variables.
- `http`: call a remote (or local) HTTP agent service. Designed in Phase 2 and rejected by validation today with an explicit "not yet implemented".
- **`response_format`** (optional, default `session_result`) — how skill-up parses the agent's output.
- `session_result`: read a full `SessionResult` JSON from `local.output_file` (when configured) or stdout. Carries `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`. **Recommended**: keeps the full context for judges and reports.
- `text`: take stdout verbatim as `final_message`. skill-up synthesises a minimal transcript (input messages + the assistant reply) so judges still receive a conversation. Use only for simple scripts that do not produce structured output.
- **`timeout_seconds`** (optional) — per-call deadline. Falls back to the case-level timeout when unset; when both are set, skill-up takes the smaller of the two so the value handed to the agent matches the real wall-clock budget.
- **`env`** (optional) — credentials and secret parameters. Values are injected into the agent process as environment variables. **This is the only channel allowed to carry credentials**: `command` / `args` / `cwd` / `input_file` / `output_file` reject secret-shaped values at config load.
- **`kwargs`** (optional) — non-secret knobs exposed to templates as `${kwargs.<key>}`. Unlike `env`, kwargs are subject to the same strict secret-rejection as command-line fields, so they must not carry credentials or credential-shaped keys.

Template variables available in `command` / `args` / `cwd` / `env` / `input_file` / `output_file`:
`${workspace}`, `${input_file}`, `${output_file}`, `${model}`, `${model_provider}`, `${model_name}`, `${case_id}`, `${variant}`, `${max_turns}`, `${timeout_seconds}`, `${kwargs.<key>}`, plus environment variables via `${VAR}` / `${VAR:-default}` / `${VAR?error message}`.

Secret-handling rules (enforced at config load):

- `${api_key}` and any kwarg whose key looks like a credential (`token`, `secret`, `api_key`, `apiKey`, `bearerToken`, …) cannot be referenced from `command` / `args` / `cwd` / `input_file` / `output_file`. Pass them through `engine.custom.env`, where they reach your agent as process environment variables instead of leaking into process listings.
- `${SOMEVAR:-...}` defaults that contain recognizable credential shapes (`sk-...`, `sk-ant-...`, `ghp_...`, `AIza...`, `AKIA...`, JWTs) are likewise rejected in command-line contexts.

See `docs/design/custom-engine.md` for the full SessionInput / SessionResult schema your agent must conform to.


### MCP configuration

MCP supports `mode: real` and `mode: mocked`. `real` installs a real MCP server into Agents such as `claude_code`, `qodercli`, or `codex`; `mocked` makes `internal/mcp` generate a local stdio mock server that is then installed into the Agent like any other MCP server.
Expand Down
52 changes: 52 additions & 0 deletions docs/zh/guide/writing-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,58 @@ report:

`cases.parallelism` 是配置文件中的默认用例并行数;临时运行时可以用 `skill-up run --parallelism N` 覆盖它,不需要修改 `eval.yaml`。命令行覆盖值必须在 1 到 256 之间。

### 自定义 Engine(Custom Engine)

当 `engine.name` 不是内置引擎(`claude_code` / `codex` / `qodercli`)时,必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。当前只实现了 `transport: local`;`transport: http` 已设计但尚未实现,validation 会直接报 "not yet implemented"。

```yaml
engine:
name: my-agent
model:
provider: anthropic
name: claude-sonnet-4-6
custom:
transport: local # local(已实现)| http(规划中)
Comment thread
zpzjzj marked this conversation as resolved.
response_format: session_result # session_result(默认)| text
timeout_seconds: 300
env: # 凭据 / 敏感参数 —— 不要在 command/args 里引用这些
MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
kwargs: # 非敏感开关,模板里以 ${kwargs.<key>} 暴露
profile: production
local:
command: /opt/my-agent/bin/run
args:
- --input
- ${input_file} # skill-up 写入的 SessionInput JSON 路径
- --output
- ${output_file} # 你的 Agent 应写入 SessionResult JSON 的路径
cwd: ${workspace} # 可选;被限制在 runtime workspace 内
input_file: inputs/messages.json # 可选覆盖(相对 workspace)
output_file: outputs/session-result.json
```

关键字段说明(完整契约见 [docs/design/custom-engine.md](../../design/custom-engine.md)):

- **`transport`**(必填)—— skill-up 调用 agent 的方式。
- `local`:通过 `runtime.Exec` 在当前 runtime 内执行 `local.command`,agent 进程可访问 runtime workspace、已安装的 skill、fixture、MCP 配置以及进程环境变量。
- `http`:调用远程(或本地)HTTP agent 服务。Phase 2 已完成设计,本版本 validate 阶段直接拒绝并提示 "not yet implemented"。
- **`response_format`**(可选,默认 `session_result`)—— skill-up 如何解析 agent 输出。
- `session_result`:从 `local.output_file`(若配置)或 stdout 读出完整的 `SessionResult` JSON,包含 `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`。**推荐保留默认**,可以让 judge 和报告拿到完整上下文。
- `text`:把 stdout 整体当作 `final_message`,skill-up 自动按输入消息 + 助手回复合成 minimal transcript,使 judge 仍能拿到一段对话。仅适合不输出结构化结果的简易脚本。
- **`timeout_seconds`**(可选)—— 单次调用的超时时间。未设时回退到 case 级 timeout;两者都设置时 skill-up 取较小值,保证传给 agent 的预算与真实墙钟一致。
- **`env`**(可选)—— 凭据 / 敏感参数通道。值会以进程环境变量形式注入到 agent。**这是唯一允许携带凭据的字段**:`command` / `args` / `cwd` / `input_file` / `output_file` 在配置加载阶段会拒绝凭据形态的值。
- **`kwargs`**(可选)—— 非敏感开关,模板里以 `${kwargs.<key>}` 暴露。与 `env` 不同,kwargs 也走严格凭据检查,不允许携带凭据值或凭据形态的 key(如 `token` / `api_key` / `bearerToken` 等)。

`command` / `args` / `cwd` / `env` / `input_file` / `output_file` 中可用的模板变量:
`${workspace}`、`${input_file}`、`${output_file}`、`${model}`、`${model_provider}`、`${model_name}`、`${case_id}`、`${variant}`、`${max_turns}`、`${timeout_seconds}`、`${kwargs.<key>}`,以及环境变量形式 `${VAR}` / `${VAR:-default}` / `${VAR?error message}`。

凭据收敛规则(配置加载期强校验):

- `${api_key}` 以及任何看起来像凭据的 kwarg key(`token` / `secret` / `api_key` / `apiKey` / `bearerToken` 等)都不允许出现在 `command` / `args` / `cwd` / `input_file` / `output_file` 中,必须经由 `engine.custom.env` 注入到子进程环境变量里。
- `${SOMEVAR:-...}` 默认值如果匹配常见凭据特征(`sk-...`、`sk-ant-...`、`ghp_...`、`AIza...`、`AKIA...`、JWT 等),同样会在命令行场景被拒。

SessionInput / SessionResult 的完整 JSON 契约见 `docs/design/custom-engine.md`。

### MCP 配置说明

MCP 支持 `mode: real` 和 `mode: mocked`。`real` 用于把真实 MCP Server 安装到 `claude_code`、`qodercli` 或 `codex` 等 Agent;`mocked` 会由 `internal/mcp` 生成本地 stdio Mock Server,并按普通 MCP 配置安装到 Agent。
Expand Down
133 changes: 133 additions & 0 deletions e2e/custom_engine_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
//go:build e2e

package e2e

import (
"os"
"path/filepath"
"runtime"
"strings"
"testing"
"time"
)

// getCustomEngineTestdataDir returns the path to e2e/testdata/custom-engine.
func getCustomEngineTestdataDir() string {
_, testFile, _, _ := runtime.Caller(0)
projectRoot := filepath.Dir(filepath.Dir(testFile))
return filepath.Join(projectRoot, "e2e", "testdata", "custom-engine")
}

// skipIfNoPOSIXShell skips the test on platforms where the agent.sh fixture
// cannot run. agent.sh uses bash with `set -euo pipefail` and is not portable
// to Windows; the test infrastructure for cross-platform fixtures is tracked
// in issue #54.
func skipIfNoPOSIXShell(t *testing.T) {
t.Helper()
if runtime.GOOS == "windows" {
t.Skip("agent.sh fixture is POSIX-only; see issue #54")
}
}

// customEngineEnv returns an env slice that points the custom engine's
// ${CUSTOM_AGENT_BIN} placeholder at the fixture's agent.sh, while
// preserving PATH and HOME so the test still resolves system tools.
func customEngineEnv(agentBin string) []string {
return []string{
"PATH=" + os.Getenv("PATH"),
"HOME=" + os.Getenv("HOME"),
"CUSTOM_AGENT_BIN=" + agentBin,
}
}

// TestPipeline_CustomEngine_LocalTransport runs the entire evaluation pipeline
// against a user-defined Custom Engine (transport: local). It verifies that:
// 1. The eval loads and the custom engine block is env-resolved and validated.
// 2. The framework writes the SessionInput JSON to ${input_file}.
// 3. The agent's stdout is parsed back into a SessionResult.
// 4. final_message reaches the report and an expect.must_contain rule passes.
func TestPipeline_CustomEngine_LocalTransport(t *testing.T) {
skipIfNoPOSIXShell(t)
t.Parallel()

dir := getCustomEngineTestdataDir()
evalPath := filepath.Join(dir, "evals", "eval.yaml")
agentBin := filepath.Join(dir, "agent.sh")

outputDir := t.TempDir()
result := Run(t, RunConfig{
Env: customEngineEnv(agentBin),
WorkDir: dir,
Timeout: 60 * time.Second,
}, "run", evalPath, "--no-delete", "--output-dir", outputDir)

if result.ExitCode != 0 {
t.Fatalf("custom engine run failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
result.ExitCode, result.Stdout, result.Stderr)
}
if !strings.Contains(result.Stdout, "Running evaluation") {
t.Errorf("expected runner stage log in output, got:\n%s", result.Stdout)
}
if !strings.Contains(result.Stdout, "PASS") {
t.Errorf("expected at least one PASS case, got:\n%s", result.Stdout)
}
if !strings.Contains(result.Stdout, "1 passed") {
t.Errorf("expected the summary line to report 1 passed, got:\n%s", result.Stdout)
}
}

// TestPipeline_CustomEngine_Validate runs `skill-up validate` against the
// custom engine eval.yaml. This exercises the post-load
// config.ResolveCustomEngineConfig path that validate.go invokes, including
// env-variable resolution of ${CUSTOM_AGENT_BIN}.
func TestPipeline_CustomEngine_Validate(t *testing.T) {
skipIfNoPOSIXShell(t)
t.Parallel()

dir := getCustomEngineTestdataDir()
evalPath := filepath.Join(dir, "evals", "eval.yaml")
agentBin := filepath.Join(dir, "agent.sh")

result := Run(t, RunConfig{
Env: customEngineEnv(agentBin),
WorkDir: dir,
Timeout: 30 * time.Second,
}, "validate", evalPath)

if result.ExitCode != 0 {
t.Fatalf("validate failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
result.ExitCode, result.Stdout, result.Stderr)
}
if !strings.Contains(result.Stdout, "is valid") {
t.Errorf("expected validation success message, got:\n%s", result.Stdout)
}
}

// TestPipeline_CustomEngine_RejectsMissingEnvVar verifies the load-time env
// resolution: when ${CUSTOM_AGENT_BIN} is unset, `run` must fail with a
// configuration error before exec'ing anything.
func TestPipeline_CustomEngine_RejectsMissingEnvVar(t *testing.T) {
skipIfNoPOSIXShell(t)
t.Parallel()

dir := getCustomEngineTestdataDir()
evalPath := filepath.Join(dir, "evals", "eval.yaml")

// Note: CUSTOM_AGENT_BIN intentionally unset.
result := Run(t, RunConfig{
Env: []string{
"PATH=" + os.Getenv("PATH"),
"HOME=" + os.Getenv("HOME"),
},
WorkDir: dir,
Timeout: 30 * time.Second,
}, "run", evalPath, "--no-delete", "--output-dir", t.TempDir())

if result.ExitCode == 0 {
t.Fatalf("expected run to fail with missing env var, got exit 0\nstdout:\n%s", result.Stdout)
}
combined := result.Stdout + result.Stderr
if !strings.Contains(combined, "CUSTOM_AGENT_BIN") {
t.Errorf("expected the error to name the missing env var, got:\n%s", combined)
}
}
9 changes: 9 additions & 0 deletions e2e/testdata/custom-engine/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Custom Engine e2e fixture

This directory is a minimal skill-up fixture that exercises the **Custom
Engine** local transport end-to-end. It is consumed by `e2e/custom_engine_test.go`.

`agent.sh` is a deterministic stand-in for a real custom agent: it reads the
`SessionInput` JSON the framework writes to `${input_file}` and emits a fixed
`SessionResult` on stdout. The case asserts that `final_message` flows from
the agent's stdout into the report.
28 changes: 28 additions & 0 deletions e2e/testdata/custom-engine/agent.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# custom-engine/agent.sh — a deterministic mock for the Custom Engine local
# transport. The agent receives the SessionInput JSON path as its first arg
# (per the eval.yaml in this directory) and emits a SessionResult on stdout.
#
# The script intentionally ignores the input content and returns a fixed
# response so the e2e test is deterministic. It still verifies that the
# framework actually wrote the input file at the configured path, which is the
# main piece of the transport contract this test exercises.
set -euo pipefail

INPUT_FILE="${1:-}"
if [[ -z "$INPUT_FILE" || ! -f "$INPUT_FILE" ]]; then
echo "agent.sh: SessionInput file not provided or missing (got: '${INPUT_FILE:-}')" >&2
exit 1
fi

cat <<'JSON'
{
"exit_code": 0,
"final_message": "custom-engine-handled",
"turns": 1,
"transcript": [
{"role": "user", "content": "ping"},
{"role": "assistant", "content": "custom-engine-handled"}
]
}
JSON
7 changes: 7 additions & 0 deletions e2e/testdata/custom-engine/evals/cases/hello.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
id: hello
title: Custom engine emits expected final_message
input:
prompt: ping the custom engine
expect:
must_contain:
- "custom-engine-handled"
28 changes: 28 additions & 0 deletions e2e/testdata/custom-engine/evals/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
schema_version: v1alpha1

environment:
type: none

mcp:
servers: []

engine:
name: e2e-custom-agent
custom:
transport: local
response_format: session_result
local:
command: ${CUSTOM_AGENT_BIN}
args:
- ${input_file}

cases:
files:
- evals/cases/hello.yaml
defaults:
timeout_seconds: 60
max_turns: 3
parallelism: 1

report:
formats: [json]
Loading
Loading