skill-eval skill is the runner for the agentic-evals protocol.
It does not define what is correct to test. The agentic-evals repo defines targets,
suites, cases, assertions, and output format.
If you only remember one thing, remember this:
agentic-evals defines what to test, skill-eval runs the test, and a fresh runtime
sub-agent is the thing being judged.
README.md
agentic-evals/
├── AGENT.md
├── docs/
└── targets/
skill-eval/
├── SKILL.md
└── scripts/
Install skill-eval with:
npx skills add Jiayi-Ye02/skills-evaluation --skill skill-evalskill-eval: runs theagentic-evalsevaluation repo against a target skill, collects per-case results, and writes a short report
It now supports:
single-run: one target skill versionab-urls: two target skill versions supplied as GitHub HTTP URLs
In a normal run there are 4 separate roles:
agentic-evals/The test repo and source of truth for targets, suites, cases, assertions, and report contract.skill-evalThe evaluator skill that executes the repo-defined protocol.- target skill
The skill under test, for example
.agents/skills/<target-skill>/. - fresh runtime sub-agent The execution subject that receives the case prompt and produces the trace and answer to judge.
That separation matters:
- the test repo decides pass/fail rules
- the evaluator runs the process
- the fresh sub-agent is what actually gets tested
Your workspace will normally look like this:
<workspace>/
├── .agents/
│ └── skills/
│ └── <target-skill>/
├── agentic-evals/
└── skill-eval/
- one supported runtime:
- Codex with
spawn_agent - OpenClaw with
sessions_spawn,sessions_history, andsessions_yield
- Codex with
gitbash- local target skill files
- local
agentic-evalsrepo, or network access so it can be cloned if missing
Once you have workspace prepared, ask the runtime agent in plain language to use skill-eval. For example::
Single case:
Use skill-eval to test target_id=voice-ai-integration, case_id=convoai-phase1-only-before-gates.
Single suite:
Use skill-eval to test target_id=voice-ai-integration, suite=source-order.
Default target with default suites:
Use skill-eval to run the default target with its default suites.
Chinese example:
用 skill-eval 去测试 target_id=voice-ai-integration,测试范围 case_id=convoai-phase1-only-before-gates
A/B URL mode:
Use skill-eval in ab-urls mode for target_id=voice-ai-integration.
variant_a_url=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
variant_b_url=https://github.com/org/repo/tree/rewrite/.agents/skills/voice-ai-integration
case_id=auth-prefers-rtc-token
Chinese A/B example:
用 skill-eval 以 ab-urls 模式测试 target_id=voice-ai-integration。
variant_a_url=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
variant_b_url=https://github.com/org/repo/tree/rewrite/.agents/skills/voice-ai-integration
suite_id=convoai-api
When you ask the runtime agent to run an eval, the expected flow is:
- The runtime agent uses
skill-eval. skill-evalreadsagentic-evals/AGENT.mdandagentic-evals/docs/session-evidence.md.- It resolves the target from
agentic-evals/targets/<target_id>/target.yaml. - It reads the selected suite and case files from
agentic-evals/targets/<target_id>/. - It creates a new run directory under
agentic-evals/runs/<run_id>/. - For each case, it creates a brand-new isolated temp workspace.
- It spawns a fresh sub-agent for that case using the active runtime.
- The evaluator copies or normalizes the accepted child session and extracts the final user-facing answer.
- The evaluator validates isolation and judges the assertions from the accepted child session evidence.
- The evaluator writes final artifacts under
agentic-evals/runs/<run_id>/.
In ab-urls mode, steps 6-10 happen twice, once for A and once for B, and then the evaluator writes a parent comparison report.
During execution, the evaluator should:
- report which target, suite, or case it is using
- create a fresh case workspace under a temp directory
- spawn a fresh sub-agent for the case
- tell you the runtime-native child identifier:
- Codex mode: nickname or agent id
- OpenClaw mode: label or child session key
- write artifacts under
agentic-evals/runs/<run_id>/
The evaluator should not:
- run the case directly in your main workspace
- silently replace the fresh sub-agent with a fallback executor
- mark
passfrom a vague self-report alone
Every run should create:
agentic-evals/runs/<run_id>/
├── manifest.json
├── case-artifacts/
├── transcript.md
├── case-results/
└── report.md
report.md: short summary and next actionscase-results/<case_id>.json: official status for one casetranscript.md: readable transcript rendered from accepted child session evidencemanifest.json: run metadata, workspace mode, and environment mismatch notes
In ab-urls mode the parent run additionally contains:
agentic-evals/runs/<ab_run_id>/
├── manifest.json
├── variants/
│ ├── A/
│ │ ├── source-manifest.json
│ │ └── run/
│ └── B/
│ ├── source-manifest.json
│ └── run/
├── comparison.json
└── report.md
Helper scripts for this mode:
skill-eval/scripts/parse_github_skill_url.pyskill-eval/scripts/prepare_variant_source_workspace.pyskill-eval/scripts/init_ab_run.pyskill-eval/scripts/render_ab_report.py