alibaba · zpzjzj · May 26, 2026 · May 26, 2026 · Jun 1, 2026 · Jun 1, 2026
@@ -50,7 +50,7 @@
 ## Features
 
 - **Declarative Eval Config**: Define evaluation environment, engine, model, and cases through YAML (`eval.yaml` + `cases/*.yaml`).
-- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
+- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as built-in Agent Engines, plus user-defined agents via `engine.custom` (local transport — see [docs/design/custom-engine.md](docs/design/custom-engine.md)).
 - **Flexible Judging**: Supports `rule_based`, `script`, and `agent_judge` evaluation strategies.
 - **Structured Reports**: Outputs Anthropic-compatible `grading.json`, `benchmark.json`, `benchmark.md`, plus `result.json`, JUnit XML, and HTML reports.
 - **Anthropic Compatible**: Import `evals.json` via `skill-up import`, or auto-detect with `--auto`.

@@ -50,7 +50,7 @@
 ## 特性
 
 - **声明式评测配置**：通过 YAML（`eval.yaml` + `cases/*.yaml`）定义评测环境、引擎、模型和用例。
-- **多引擎支持**：支持 Qoder CLI、Claude Code、Codex 等 Agent 引擎。
+- **多引擎支持**：内置支持 Qoder CLI、Claude Code、Codex；亦可通过 `engine.custom` 接入用户自定义 Agent（本地传输，详见 [docs/design/custom-engine.md](docs/design/custom-engine.md)）。
 - **灵活评分**：支持 `rule_based`（规则匹配）、`script`（脚本评分）、`agent_judge`（Agent 评分）三种评估策略。
 - **结构化报告**：输出 Anthropic 兼容的 `grading.json`、`benchmark.json`、`benchmark.md`，以及 `result.json`、JUnit XML 和 HTML 报告。
 - **Anthropic 兼容**：通过 `skill-up import` 导入 `evals.json`，或使用 `--auto` 自动识别。

@@ -127,6 +127,59 @@ report:
 skill-up run evals/eval.yaml --engine-kwarg bypass_sandbox=true
 ```
 
+### Custom Engine
+
+When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Only `transport: local` is implemented today; `transport: http` is reserved and currently fails validation with "not yet implemented".
+
+```yaml
+engine:
+  name: my-agent
+  model:
+    provider: anthropic
+    name: claude-sonnet-4-6
+  custom:
+    transport: local             # local (implemented) | http (planned)
+    response_format: session_result   # session_result (default) | text
+    timeout_seconds: 300
+    env:                         # credentials and secrets — NEVER reference these in command/args
+      MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
+    kwargs:                      # non-secret knobs exposed as ${kwargs.<key>}
+      profile: production
+    local:
+      command: /opt/my-agent/bin/run
+      args:
+        - --input
+        - ${input_file}          # path to the SessionInput JSON skill-up writes
+        - --output
+        - ${output_file}         # path your agent should write its SessionResult JSON to
+      cwd: ${workspace}          # optional; confined to the runtime workspace
+      input_file: inputs/messages.json   # optional override (relative to workspace)
+      output_file: outputs/session-result.json   # optional override
+```
+
+Key fields (full contract in [docs/design/custom-engine.md](../design/custom-engine.md)):
+
+- **`transport`** (required) — how skill-up invokes your agent.
+  - `local`: run `local.command` inside the current runtime via `runtime.Exec`. The agent process can read the runtime workspace, installed skills, fixtures, MCP config, and process environment variables.
+  - `http`: call a remote (or local) HTTP agent service. Designed in Phase 2 and rejected by validation today with an explicit "not yet implemented".
+- **`response_format`** (optional, default `session_result`) — how skill-up parses the agent's output.
+  - `session_result`: read a full `SessionResult` JSON from `local.output_file` (when configured) or stdout. Carries `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`. **Recommended**: keeps the full context for judges and reports.
+  - `text`: take stdout verbatim as `final_message`. skill-up synthesises a minimal transcript (input messages + the assistant reply) so judges still receive a conversation. Use only for simple scripts that do not produce structured output.
+- **`timeout_seconds`** (optional) — per-call deadline. Falls back to the case-level timeout when unset; when both are set, skill-up takes the smaller of the two so the value handed to the agent matches the real wall-clock budget.
+- **`env`** (optional) — credentials and secret parameters. Values are injected into the agent process as environment variables. **This is the only channel allowed to carry credentials**: `command` / `args` / `cwd` / `input_file` / `output_file` reject secret-shaped values at config load.
+- **`kwargs`** (optional) — non-secret knobs exposed to templates as `${kwargs.<key>}`. Unlike `env`, kwargs are subject to the same strict secret-rejection as command-line fields, so they must not carry credentials or credential-shaped keys.
+
+Template variables available in `command` / `args` / `cwd` / `env` / `input_file` / `output_file`:
+`${workspace}`, `${input_file}`, `${output_file}`, `${model}`, `${model_provider}`, `${model_name}`, `${case_id}`, `${variant}`, `${max_turns}`, `${timeout_seconds}`, `${kwargs.<key>}`, plus environment variables via `${VAR}` / `${VAR:-default}` / `${VAR?error message}`.
+
+Secret-handling rules (enforced at config load):
+
+- `${api_key}` and any kwarg whose key looks like a credential (`token`, `secret`, `api_key`, `apiKey`, `bearerToken`, …) cannot be referenced from `command` / `args` / `cwd` / `input_file` / `output_file`. Pass them through `engine.custom.env`, where they reach your agent as process environment variables instead of leaking into process listings.
+- `${SOMEVAR:-...}` defaults that contain recognizable credential shapes (`sk-...`, `sk-ant-...`, `ghp_...`, `AIza...`, `AKIA...`, JWTs) are likewise rejected in command-line contexts.
+
+See `docs/design/custom-engine.md` for the full SessionInput / SessionResult schema your agent must conform to.
+
+
 ### MCP configuration
 
 MCP supports `mode: real` and `mode: mocked`. `real` installs a real MCP server into Agents such as `claude_code`, `qodercli`, or `codex`; `mocked` makes `internal/mcp` generate a local stdio mock server that is then installed into the Agent like any other MCP server.

@@ -111,6 +111,58 @@ report:
 
 `cases.parallelism` 是配置文件中的默认用例并行数；临时运行时可以用 `skill-up run --parallelism N` 覆盖它，不需要修改 `eval.yaml`。命令行覆盖值必须在 1 到 256 之间。
 
+### 自定义 Engine（Custom Engine）
+
+当 `engine.name` 不是内置引擎（`claude_code` / `codex` / `qodercli`）时，必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。当前只实现了 `transport: local`；`transport: http` 已设计但尚未实现，validation 会直接报 "not yet implemented"。
+
+```yaml
+engine:
+  name: my-agent
+  model:
+    provider: anthropic
+    name: claude-sonnet-4-6
+  custom:
+    transport: local              # local（已实现）| http（规划中）
+    response_format: session_result  # session_result（默认）| text
+    timeout_seconds: 300
+    env:                          # 凭据 / 敏感参数 —— 不要在 command/args 里引用这些
+      MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
+    kwargs:                       # 非敏感开关，模板里以 ${kwargs.<key>} 暴露
+      profile: production
+    local:
+      command: /opt/my-agent/bin/run
+      args:
+        - --input
+        - ${input_file}           # skill-up 写入的 SessionInput JSON 路径
+        - --output
+        - ${output_file}          # 你的 Agent 应写入 SessionResult JSON 的路径
+      cwd: ${workspace}           # 可选；被限制在 runtime workspace 内
+      input_file: inputs/messages.json     # 可选覆盖（相对 workspace）
+      output_file: outputs/session-result.json
+```
+
+关键字段说明（完整契约见 [docs/design/custom-engine.md](../../design/custom-engine.md)）：
+
+- **`transport`**（必填）—— skill-up 调用 agent 的方式。
+  - `local`：通过 `runtime.Exec` 在当前 runtime 内执行 `local.command`，agent 进程可访问 runtime workspace、已安装的 skill、fixture、MCP 配置以及进程环境变量。
+  - `http`：调用远程（或本地）HTTP agent 服务。Phase 2 已完成设计，本版本 validate 阶段直接拒绝并提示 "not yet implemented"。
+- **`response_format`**（可选，默认 `session_result`）—— skill-up 如何解析 agent 输出。
+  - `session_result`：从 `local.output_file`（若配置）或 stdout 读出完整的 `SessionResult` JSON，包含 `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`。**推荐保留默认**，可以让 judge 和报告拿到完整上下文。
+  - `text`：把 stdout 整体当作 `final_message`，skill-up 自动按输入消息 + 助手回复合成 minimal transcript，使 judge 仍能拿到一段对话。仅适合不输出结构化结果的简易脚本。
+- **`timeout_seconds`**（可选）—— 单次调用的超时时间。未设时回退到 case 级 timeout；两者都设置时 skill-up 取较小值，保证传给 agent 的预算与真实墙钟一致。
+- **`env`**（可选）—— 凭据 / 敏感参数通道。值会以进程环境变量形式注入到 agent。**这是唯一允许携带凭据的字段**：`command` / `args` / `cwd` / `input_file` / `output_file` 在配置加载阶段会拒绝凭据形态的值。
+- **`kwargs`**（可选）—— 非敏感开关，模板里以 `${kwargs.<key>}` 暴露。与 `env` 不同，kwargs 也走严格凭据检查，不允许携带凭据值或凭据形态的 key（如 `token` / `api_key` / `bearerToken` 等）。
+
+`command` / `args` / `cwd` / `env` / `input_file` / `output_file` 中可用的模板变量：
+`${workspace}`、`${input_file}`、`${output_file}`、`${model}`、`${model_provider}`、`${model_name}`、`${case_id}`、`${variant}`、`${max_turns}`、`${timeout_seconds}`、`${kwargs.<key>}`，以及环境变量形式 `${VAR}` / `${VAR:-default}` / `${VAR?error message}`。
+
+凭据收敛规则（配置加载期强校验）：
+
+- `${api_key}` 以及任何看起来像凭据的 kwarg key（`token` / `secret` / `api_key` / `apiKey` / `bearerToken` 等）都不允许出现在 `command` / `args` / `cwd` / `input_file` / `output_file` 中，必须经由 `engine.custom.env` 注入到子进程环境变量里。
+- `${SOMEVAR:-...}` 默认值如果匹配常见凭据特征（`sk-...`、`sk-ant-...`、`ghp_...`、`AIza...`、`AKIA...`、JWT 等），同样会在命令行场景被拒。
+
+SessionInput / SessionResult 的完整 JSON 契约见 `docs/design/custom-engine.md`。
+
 ### MCP 配置说明
 
 MCP 支持 `mode: real` 和 `mode: mocked`。`real` 用于把真实 MCP Server 安装到 `claude_code`、`qodercli` 或 `codex` 等 Agent；`mocked` 会由 `internal/mcp` 生成本地 stdio Mock Server，并按普通 MCP 配置安装到 Agent。

@@ -0,0 +1,133 @@
+//go:build e2e
+
+package e2e
+
+import (
+	"os"
+	"path/filepath"
+	"runtime"
+	"strings"
+	"testing"
+	"time"
+)
+
+// getCustomEngineTestdataDir returns the path to e2e/testdata/custom-engine.
+func getCustomEngineTestdataDir() string {
+	_, testFile, _, _ := runtime.Caller(0)
+	projectRoot := filepath.Dir(filepath.Dir(testFile))
+	return filepath.Join(projectRoot, "e2e", "testdata", "custom-engine")
+}
+
+// skipIfNoPOSIXShell skips the test on platforms where the agent.sh fixture
+// cannot run. agent.sh uses bash with `set -euo pipefail` and is not portable
+// to Windows; the test infrastructure for cross-platform fixtures is tracked
+// in issue #54.
+func skipIfNoPOSIXShell(t *testing.T) {
+	t.Helper()
+	if runtime.GOOS == "windows" {
+		t.Skip("agent.sh fixture is POSIX-only; see issue #54")
+	}
+}
+
+// customEngineEnv returns an env slice that points the custom engine's
+// ${CUSTOM_AGENT_BIN} placeholder at the fixture's agent.sh, while
+// preserving PATH and HOME so the test still resolves system tools.
+func customEngineEnv(agentBin string) []string {
+	return []string{
+		"PATH=" + os.Getenv("PATH"),
+		"HOME=" + os.Getenv("HOME"),
+		"CUSTOM_AGENT_BIN=" + agentBin,
+	}
+}
+
+// TestPipeline_CustomEngine_LocalTransport runs the entire evaluation pipeline
+// against a user-defined Custom Engine (transport: local). It verifies that:
+//  1. The eval loads and the custom engine block is env-resolved and validated.
+//  2. The framework writes the SessionInput JSON to ${input_file}.
+//  3. The agent's stdout is parsed back into a SessionResult.
+//  4. final_message reaches the report and an expect.must_contain rule passes.
+func TestPipeline_CustomEngine_LocalTransport(t *testing.T) {
+	skipIfNoPOSIXShell(t)
+	t.Parallel()
+
+	dir := getCustomEngineTestdataDir()
+	evalPath := filepath.Join(dir, "evals", "eval.yaml")
+	agentBin := filepath.Join(dir, "agent.sh")
+
+	outputDir := t.TempDir()
+	result := Run(t, RunConfig{
+		Env:     customEngineEnv(agentBin),
+		WorkDir: dir,
+		Timeout: 60 * time.Second,
+	}, "run", evalPath, "--no-delete", "--output-dir", outputDir)
+
+	if result.ExitCode != 0 {
+		t.Fatalf("custom engine run failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
+			result.ExitCode, result.Stdout, result.Stderr)
+	}
+	if !strings.Contains(result.Stdout, "Running evaluation") {
+		t.Errorf("expected runner stage log in output, got:\n%s", result.Stdout)
+	}
+	if !strings.Contains(result.Stdout, "PASS") {
+		t.Errorf("expected at least one PASS case, got:\n%s", result.Stdout)
+	}
+	if !strings.Contains(result.Stdout, "1 passed") {
+		t.Errorf("expected the summary line to report 1 passed, got:\n%s", result.Stdout)
+	}
+}
+
+// TestPipeline_CustomEngine_Validate runs `skill-up validate` against the
+// custom engine eval.yaml. This exercises the post-load
+// config.ResolveCustomEngineConfig path that validate.go invokes, including
+// env-variable resolution of ${CUSTOM_AGENT_BIN}.
+func TestPipeline_CustomEngine_Validate(t *testing.T) {
+	skipIfNoPOSIXShell(t)
+	t.Parallel()
+
+	dir := getCustomEngineTestdataDir()
+	evalPath := filepath.Join(dir, "evals", "eval.yaml")
+	agentBin := filepath.Join(dir, "agent.sh")
+
+	result := Run(t, RunConfig{
+		Env:     customEngineEnv(agentBin),
+		WorkDir: dir,
+		Timeout: 30 * time.Second,
+	}, "validate", evalPath)
+
+	if result.ExitCode != 0 {
+		t.Fatalf("validate failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
+			result.ExitCode, result.Stdout, result.Stderr)
+	}
+	if !strings.Contains(result.Stdout, "is valid") {
+		t.Errorf("expected validation success message, got:\n%s", result.Stdout)
+	}
+}
+
+// TestPipeline_CustomEngine_RejectsMissingEnvVar verifies the load-time env
+// resolution: when ${CUSTOM_AGENT_BIN} is unset, `run` must fail with a
+// configuration error before exec'ing anything.
+func TestPipeline_CustomEngine_RejectsMissingEnvVar(t *testing.T) {
+	skipIfNoPOSIXShell(t)
+	t.Parallel()
+
+	dir := getCustomEngineTestdataDir()
+	evalPath := filepath.Join(dir, "evals", "eval.yaml")
+
+	// Note: CUSTOM_AGENT_BIN intentionally unset.
+	result := Run(t, RunConfig{
+		Env: []string{
+			"PATH=" + os.Getenv("PATH"),
+			"HOME=" + os.Getenv("HOME"),
+		},
+		WorkDir: dir,
+		Timeout: 30 * time.Second,
+	}, "run", evalPath, "--no-delete", "--output-dir", t.TempDir())
+
+	if result.ExitCode == 0 {
+		t.Fatalf("expected run to fail with missing env var, got exit 0\nstdout:\n%s", result.Stdout)
+	}
+	combined := result.Stdout + result.Stderr
+	if !strings.Contains(combined, "CUSTOM_AGENT_BIN") {
+		t.Errorf("expected the error to name the missing env var, got:\n%s", combined)
+	}
+}
@@ -0,0 +1,9 @@
+# Custom Engine e2e fixture
+
+This directory is a minimal skill-up fixture that exercises the **Custom
+Engine** local transport end-to-end. It is consumed by `e2e/custom_engine_test.go`.
+
+`agent.sh` is a deterministic stand-in for a real custom agent: it reads the
+`SessionInput` JSON the framework writes to `${input_file}` and emits a fixed
+`SessionResult` on stdout. The case asserts that `final_message` flows from
+the agent's stdout into the report.
@@ -0,0 +1,28 @@
+#!/bin/bash
+# custom-engine/agent.sh — a deterministic mock for the Custom Engine local
+# transport. The agent receives the SessionInput JSON path as its first arg
+# (per the eval.yaml in this directory) and emits a SessionResult on stdout.
+#
+# The script intentionally ignores the input content and returns a fixed
+# response so the e2e test is deterministic. It still verifies that the
+# framework actually wrote the input file at the configured path, which is the
+# main piece of the transport contract this test exercises.
+set -euo pipefail
+
+INPUT_FILE="${1:-}"
+if [[ -z "$INPUT_FILE" || ! -f "$INPUT_FILE" ]]; then
+  echo "agent.sh: SessionInput file not provided or missing (got: '${INPUT_FILE:-}')" >&2
+  exit 1
+fi
+
+cat <<'JSON'
+{
+  "exit_code": 0,
+  "final_message": "custom-engine-handled",
+  "turns": 1,
+  "transcript": [
+    {"role": "user", "content": "ping"},
+    {"role": "assistant", "content": "custom-engine-handled"}
+  ]
+}
+JSON
@@ -0,0 +1,7 @@
+id: hello
+title: Custom engine emits expected final_message
+input:
+  prompt: ping the custom engine
+expect:
+  must_contain:
+    - "custom-engine-handled"
@@ -0,0 +1,28 @@
+schema_version: v1alpha1
+
+environment:
+  type: none
+
+mcp:
+  servers: []
+
+engine:
+  name: e2e-custom-agent
+  custom:
+    transport: local
+    response_format: session_result
+    local:
+      command: ${CUSTOM_AGENT_BIN}
+      args:
+        - ${input_file}
+
+cases:
+  files:
+    - evals/cases/hello.yaml
+  defaults:
+    timeout_seconds: 60
+    max_turns: 3
+  parallelism: 1
+
+report:
+  formats: [json]