skills-evaluation

skill-eval skill is the runner for the agentic-evals protocol. It does not define what is correct to test. The agentic-evals repo defines targets, suites, cases, assertions, and output format.

If you only remember one thing, remember this:

agentic-evals defines what to test, skill-eval runs the test, and a fresh runtime sub-agent is the thing being judged.

Skill Repo Layout

README.md
agentic-evals/
├── AGENT.md
├── docs/
└── targets/
skill-eval/
├── SKILL.md
└── scripts/

Install

Install skill-eval with:

npx skills add Jiayi-Ye02/skills-evaluation --skill skill-eval

skill-eval: runs the agentic-evals evaluation repo against a target skill, collects per-case results, and writes a short report

It now supports:

single-run: one target skill version
ab-urls: two target skill versions supplied as GitHub HTTP URLs

Framework Structure

In a normal run there are 4 separate roles:

agentic-evals/ The test repo and source of truth for targets, suites, cases, assertions, and report contract.
skill-eval The evaluator skill that executes the repo-defined protocol.
target skill The skill under test, for example .agents/skills/<target-skill>/.
fresh runtime sub-agent The execution subject that receives the case prompt and produces the trace and answer to judge.

That separation matters:

the test repo decides pass/fail rules
the evaluator runs the process
the fresh sub-agent is what actually gets tested

Expected Workspace Layout

Your workspace will normally look like this:

<workspace>/
├── .agents/
│   └── skills/
│       └── <target-skill>/
├── agentic-evals/
└── skill-eval/

Requirements

one supported runtime:
- Codex with spawn_agent
- OpenClaw with sessions_spawn, sessions_history, and sessions_yield
git
bash
local target skill files
local agentic-evals repo, or network access so it can be cloned if missing

Quickstart

Once you have workspace prepared, ask the runtime agent in plain language to use skill-eval. For example::

Single case:

Use skill-eval to test target_id=voice-ai-integration, case_id=convoai-phase1-only-before-gates.

Single suite:

Use skill-eval to test target_id=voice-ai-integration, suite=source-order.

Default target with default suites:

Use skill-eval to run the default target with its default suites.

Chinese example:

用 skill-eval 去测试 target_id=voice-ai-integration，测试范围 case_id=convoai-phase1-only-before-gates

A/B URL mode:

Use skill-eval in ab-urls mode for target_id=voice-ai-integration.
variant_a_url=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
variant_b_url=https://github.com/org/repo/tree/rewrite/.agents/skills/voice-ai-integration
case_id=auth-prefers-rtc-token

Chinese A/B example:

用 skill-eval 以 ab-urls 模式测试 target_id=voice-ai-integration。
variant_a_url=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
variant_b_url=https://github.com/org/repo/tree/rewrite/.agents/skills/voice-ai-integration
suite_id=convoai-api

5-Minute Mental Model

When you ask the runtime agent to run an eval, the expected flow is:

The runtime agent uses skill-eval.
skill-eval reads agentic-evals/AGENT.md and agentic-evals/docs/session-evidence.md.
It resolves the target from agentic-evals/targets/<target_id>/target.yaml.
It reads the selected suite and case files from agentic-evals/targets/<target_id>/.
It creates a new run directory under agentic-evals/runs/<run_id>/.
For each case, it creates a brand-new isolated temp workspace.
It spawns a fresh sub-agent for that case using the active runtime.
The evaluator copies or normalizes the accepted child session and extracts the final user-facing answer.
The evaluator validates isolation and judges the assertions from the accepted child session evidence.
The evaluator writes final artifacts under agentic-evals/runs/<run_id>/.

In ab-urls mode, steps 6-10 happen twice, once for A and once for B, and then the evaluator writes a parent comparison report.

What To Expect During A Run

During execution, the evaluator should:

report which target, suite, or case it is using
create a fresh case workspace under a temp directory
spawn a fresh sub-agent for the case
tell you the runtime-native child identifier:
- Codex mode: nickname or agent id
- OpenClaw mode: label or child session key
write artifacts under agentic-evals/runs/<run_id>/

The evaluator should not:

run the case directly in your main workspace
silently replace the fresh sub-agent with a fallback executor
mark pass from a vague self-report alone

Run Artifacts

Every run should create:

agentic-evals/runs/<run_id>/
├── manifest.json
├── case-artifacts/
├── transcript.md
├── case-results/
└── report.md

report.md: short summary and next actions
case-results/<case_id>.json: official status for one case
transcript.md: readable transcript rendered from accepted child session evidence
manifest.json: run metadata, workspace mode, and environment mismatch notes

In ab-urls mode the parent run additionally contains:

agentic-evals/runs/<ab_run_id>/
├── manifest.json
├── variants/
│   ├── A/
│   │   ├── source-manifest.json
│   │   └── run/
│   └── B/
│       ├── source-manifest.json
│       └── run/
├── comparison.json
└── report.md

Helper scripts for this mode:

skill-eval/scripts/parse_github_skill_url.py
skill-eval/scripts/prepare_variant_source_workspace.py
skill-eval/scripts/init_ab_run.py
skill-eval/scripts/render_ab_report.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
references		references
scripts		scripts
README.md		README.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skills-evaluation

Skill Repo Layout

Install

Framework Structure

Expected Workspace Layout

Requirements

Quickstart

5-Minute Mental Model

What To Expect During A Run

Run Artifacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skills-evaluation

Skill Repo Layout

Install

Framework Structure

Expected Workspace Layout

Requirements

Quickstart

5-Minute Mental Model

What To Expect During A Run

Run Artifacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages