diff --git a/CHANGELOG.md b/CHANGELOG.md index 340a552..91edb0e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,8 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ### Added +- **`prompt` eval case mode — score prompts and skill outputs (paw-workspace#47)** — the eval framework used to evaluate one thing: a seeded soul. It now also scores a plain prompt or the output of a skill. A case with `mode: prompt` skips the soul completely — no birth, no context, the `seed` block is ignored — and hands the case's `message` straight to the scorer. The new optional `CaseInputs.reference` field holds the text a skill was originally given; when it is set, the `judge` scorer puts it in front of the LLM as its own "Reference input" block, so the criteria can ask whether a candidate output improved on where it started rather than judging it cold. There is no new scoring kind: prompt and skill outputs go through the existing `JudgeScoring`. There is no new CLI command or flag either — `soul eval` runs a prompt-mode spec the same way it runs a soul spec. New reference spec `tests/eval_examples/humanizer_skill.yaml` scores the workspace `/humanize` skill: a deterministic `regex` gate that runs with no engine, plus four `judge` cases that check a humanized rewrite dropped its AI tells and kept the meaning. Docs: `eval-format.md` gains a "Prompt mode" section; `cli-reference.md` and `api-reference.md` cover the new mode and a Case modes table. This is the read-side of the workspace prompt-evaluation pair — a way to catch it when an edit to a tracked skill makes its output worse. + - **Soul optimize / autoresearch (#142)** — autonomous self-improvement loop that pairs with the soul-aware eval framework (#160). The soul runs an eval against itself, identifies failing cases, proposes targeted changes to its own behaviour-shaping "knobs" (OCEAN traits, persona text, memory thresholds, bond strength), keeps changes that improve the eval score, and reverts the rest. New `soul_protocol.optimize` module: `optimize()` entry point, `OptimizeRunner` class with custom knob registration, `Knob` protocol plus four built-in knobs (`OceanTraitKnob` ±0.1/±0.2 within [0,1] per OCEAN dimension; `PersonaTextKnob` LLM-driven persona rephrasings with heuristic no-op fallback; `SignificanceThresholdKnob` for `MemorySettings.importance_threshold` ±1 plus the `skip_deep_processing_on_low_significance` flip; `BondThresholdKnob` for default bond strength ±5/±10), `Proposer` (LLM-assisted with heuristic fallback when no engine or unparseable response), `OptimizeResult`/`OptimizeStep` Pydantic models. Defaults to dry-run (`apply=False`) — every change applied during the run is reverted at the end and no trust chain entries are written; the soul stays byte-identical. With `apply=True` the runner keeps the winning trajectory and appends one `soul.optimize.applied` trust chain entry per kept change with payload `{knob_name, before, after, score_delta}`. Reverted proposals never write entries either way. New `soul optimize ` CLI command (`--iterations`, `--target`, `--apply`, `--engine`, `--json`) and `soul_optimize` MCP tool with the same surface. Pairs naturally with #160 — without the eval, "improvement" is a vibe; with the eval, it's a number that goes up. Full doc at `docs/soul-optimize.md`. - **Graph traversal + typed entity ontology (#108, #190)** — entities now carry one of eight built-in kinds (`person`, `place`, `org`, `concept`, `tool`, `document`, `event`, `relation`) plus open-string extension. Eight matching relation predicates (`mentions`, `related`, `depends_on`, `contributes_to`, `causes`, `follows`, `supersedes`, `owned_by`) ship as `RelationType` with the same open contract. The cognitive engine's `extract_entities` prompt asks for the typed ontology plus a `relations` array per entity with `{target, relation, weight}` triples; heuristic-only souls keep working through a translation table that maps legacy types. New `Soul.graph` returns a `GraphView` with `nodes()`, `edges()`, `neighbors()`, `path()`, `subgraph()`, `to_mermaid()`, `reachable()`, `stats()`. `Soul.recall` accepts `graph_walk={"start": entity_id, "depth": 2, "edge_types": [...]}` plus `page_token` and `token_budget` for pagination + L0-abstract fallback under budget pressure; new `RecallResults` list subclass carries `next_page_token`, `total_estimate`, `truncated_for_budget` (legacy callers still get `list[MemoryEntry]`). Trust chain hooks: `Soul.observe()` appends `graph.entity_added` and `graph.relation_added` entries for net-new entities/edges. New `soul graph` CLI group (`nodes`/`edges`/`neighbors`/`path`/`mermaid`, all with `--json`) and `soul_graph_query` MCP tool. In-memory dict + adjacency-list storage with `to_dict`/`from_dict` round-trip; pre-0.5.0 graphs load cleanly. Heuristic third-person relation edges (e.g. "Alice knows Bob") now flow through to the graph instead of being dropped. diff --git a/docs/api-reference.md b/docs/api-reference.md index 80504c5..9c84101 100644 --- a/docs/api-reference.md +++ b/docs/api-reference.md @@ -16,6 +16,9 @@ soul_protocol.eval module — EvalSpec, EvalCase, EvalResult, CaseResult, the five scoring kinds (keyword/regex/semantic/judge/structural), and run_eval / run_eval_against_soul / run_eval_file entry points. + Updated: 2026-05-21 (paw-workspace#47): Added the Case modes table to the + Evaluation section — documents the new `prompt` mode (scores a verbatim + prompt or skill output, soul skipped) and the `reference` input field. Updated: 2026-04-27 — Documented user-driven memory update primitives: Soul.forget_one (audited single-id delete), Soul.supersede (write new memory + link old.superseded_by), Soul.supersede_audit property. Rewrote stale soul.forget() entry to match the real @@ -1704,6 +1707,18 @@ from soul_protocol.eval import ( - `run_eval_file(path, *, engine=None, case_filter=None) -> EvalResult` — convenience wrapper that loads then runs. - `run_eval_against_soul(spec, soul, *, engine=None, case_filter=None) -> EvalResult` — run cases against an existing `Soul` without re-birthing. Used by the `soul_eval` MCP tool. The `seed` block is ignored — the soul's live state is the seed. +### Case modes + +`CaseInputs.mode` selects how the runner produces the text the scorer sees: + +| Mode | What runs | Output scored | +|------|-----------|---------------| +| `respond` (default) | Soul produces a reply via `context_for` + the engine | The reply | +| `recall` | `Soul.recall(query=message, ...)` | The recalled memories, rendered as text | +| `prompt` | Nothing — the soul is skipped, `seed` is ignored | `inputs.message`, verbatim | + +`prompt` mode scores a standalone prompt or skill output. Set `inputs.reference` (prompt-mode only) to the pre-transform text and a `judge` case compares the candidate against it. See [eval-format.md](eval-format.md#prompt-mode-scoring-prompts-and-skills). + ### Result models `EvalResult`: diff --git a/docs/cli-reference.md b/docs/cli-reference.md index f0450be..65ca0b3 100644 --- a/docs/cli-reference.md +++ b/docs/cli-reference.md @@ -18,6 +18,9 @@ Runs cases against a soul seeded with explicit state (memories, OCEAN, bonds, mood, energy). Supports keyword / regex / semantic / judge / structural scoring. --json, --filter, --judge-engine, --verbose options. Exits 1 on any failure. Count: 47 → 48. + Updated: 2026-05-21 (paw-workspace#47): `soul eval` also scores prompts and skill + outputs — a `mode: prompt` case skips the soul and scores the case text verbatim. + No new command or flag; the `humanizer_skill.yaml` reference spec evaluates /humanize. Updated: 2026-04-29 — v0.4.0 (#42): Added `soul verify` and `soul audit` for trust-chain integrity checks and signed-action timelines. Both support --json. `soul verify` exits 1 on a tampered chain. Count: 45 → 47. @@ -1776,7 +1779,9 @@ Payloads are stored as hashes only — the table shows *what changed when*, not ### `soul eval` -Run YAML-driven soul-aware evals against a freshly seeded soul. The eval framework lets you pin the soul's state (memories, OCEAN, bonds, mood, energy) before each test runs, so you can measure memory-driven behaviour rather than just stateless input-output. See [eval-format.md](eval-format.md) for the full schema. +Run YAML-driven soul-aware evals against a freshly seeded soul. The eval framework lets you pin the soul's state (memories, OCEAN, bonds, mood, energy) before each test runs, so you can measure memory-driven behaviour rather than just stateless input-output. + +It also scores plain prompts and skill outputs. A case with `mode: prompt` skips the soul and scores the case text verbatim — point it at a workspace prompt or a skill's output (for example `/humanize`) to catch regressions when that prompt or skill changes. See [eval-format.md](eval-format.md) for the full schema, including the `prompt` mode and the `humanizer_skill.yaml` reference spec. ```bash soul eval @@ -1809,6 +1814,10 @@ soul eval tests/eval_examples/ # all .yaml in d soul eval tests/eval_examples/ --filter "creative" soul eval my_eval.yaml --json | jq '.specs[].cases' soul eval my_eval.yaml --judge-engine my_module:make_engine + +# Score the /humanize skill (prompt-mode spec). The judge cases need an +# engine; without one they SKIP and only the deterministic checks run. +soul eval tests/eval_examples/humanizer_skill.yaml --judge-engine my_module:make_engine ``` **Output:** one Rich table per spec (Case, Status, Score, Time, optional Details), plus a summary footer with totals. `--json` returns `{specs: [...], duration_ms, pass_count, fail_count, skip_count, error_count}`. diff --git a/docs/eval-format.md b/docs/eval-format.md index 067419a..587ac90 100644 --- a/docs/eval-format.md +++ b/docs/eval-format.md @@ -2,7 +2,11 @@ Created: 2026-04-29 — Documents the YAML schema, scoring kinds, runner contract, and CLI / MCP entry points for soul-aware evals. Companion to docs/api-reference.md (EvalSpec, EvalResult, - run_eval) and docs/cli-reference.md (`soul eval`). --> + run_eval) and docs/cli-reference.md (`soul eval`). + Updated: 2026-05-21 (paw-workspace#47) — Documented the `prompt` + case mode, which scores a verbatim prompt or skill output without + a soul. Used to evaluate workspace prompts and skills (/humanize). + Companion example: tests/eval_examples/humanizer_skill.yaml. --> # Soul-aware Eval Format @@ -17,6 +21,12 @@ interactions, current mood and energy. The same prompt to a soul that's than to one that's "energetic with high bond strength" — and that's the entire point of the protocol. +The format also handles a stateless case. A `prompt`-mode case scores a +verbatim prompt or skill output directly, with no soul involved. That is +how you point the same harness at workspace prompts and skills — for +example scoring the `/humanize` skill's output for AI tells. See +[Prompt mode](#prompt-mode-scoring-prompts-and-skills) below. + This page documents the schema and the runner. For the CLI command see [cli-reference.md](cli-reference.md#soul-eval). For the MCP tool see [mcp-server.md](mcp-server.md#soul_eval). For Python API access see @@ -97,10 +107,11 @@ EvalSpec │ ├── message: str # required │ ├── user_id: str | null │ ├── domain: str | null - │ ├── mode: "respond" | "recall" + │ ├── mode: "respond" | "recall" | "prompt" │ ├── observe: bool # default false │ ├── recall_limit: int # default 5 - │ └── recall_layer: str | null + │ ├── recall_layer: str | null + │ └── reference: str | null # prompt mode — the pre-transform text └── scoring: Scoring # see below ``` @@ -130,16 +141,62 @@ also queryable via `inputs.recall_layer`. A case has three parts: 1. **Mode** — `respond` (the soul produces a reply via context_for + the - engine) or `recall` (`Soul.recall(query=message, ...)`). + engine), `recall` (`Soul.recall(query=message, ...)`), or `prompt` + (the soul is skipped; `message` is scored verbatim — see + [Prompt mode](#prompt-mode-scoring-prompts-and-skills)). 2. **Inputs** — message, optional `user_id` (multi-user routing), - optional `domain` (for v0.4.0 domain isolation), and recall knobs. + optional `domain` (for v0.4.0 domain isolation), recall knobs, and the + prompt-mode `reference`. 3. **Scoring** — one of the five kinds below. The `kind` field is the discriminator; Pydantic resolves the right scorer at parse time. `observe: true` runs `Soul.observe()` after producing the response, so the soul's state mutates. By default `observe: false` keeps the state identical to the seed across cases — recommended for deterministic -evals. +evals. `observe` does nothing in `prompt` mode (there is no soul to +observe). + +### Prompt mode — scoring prompts and skills + +Most cases drive a soul. A `prompt`-mode case does not: it takes the +case's `message` as a verbatim string — a prompt, or the output of a +skill — and hands it straight to the scorer. No soul is birthed, no +context is built, the `seed` block is ignored. + +This is the path for evaluating the workspace's own prompts and skills. +The motivating case is `/humanize`: feed the skill's output in as +`message`, describe the qualities a good humanized text should have in a +`judge` block, and the eval tells you whether an edit to the skill made +its output better or worse. + +```yaml +cases: + - name: "rewrite drops the puffery" + inputs: + mode: prompt + # `reference` — the original text the skill was given. + reference: | + Version 2.0 stands as an enduring testament to our commitment. + # `message` — the candidate output to score. + message: | + Version 2.0 shipped Tuesday with offline mode. + scoring: + kind: judge + criteria: | + The candidate output should state plainly what changed, with no + significance-inflation language. It should keep the facts and + stay shorter than the reference. +``` + +`reference` is optional and prompt-mode only. When set, the `judge` +scorer shows it to the LLM as a separate "Reference input" block, so +criteria can ask whether the candidate improved on the original rather +than judging it in isolation. Any scoring kind works in prompt mode — +`regex` and `keyword` give you deterministic gates that pass without an +engine — but `judge` is the natural fit for "is this output good." + +The shipped reference spec is +[`tests/eval_examples/humanizer_skill.yaml`](../tests/eval_examples/humanizer_skill.yaml). ## Scoring kinds @@ -309,4 +366,5 @@ a follow-up the optimizer would benefit from, file an issue against it. - [api-reference.md](api-reference.md#evaluation) — Python API - [cli-reference.md](cli-reference.md#soul-eval) — `soul eval` command - [mcp-server.md](mcp-server.md#soul_eval) — `soul_eval` MCP tool -- `tests/eval_examples/` — five shipped example specs +- `tests/eval_examples/` — shipped example specs, including + `humanizer_skill.yaml` for the prompt-mode `/humanize` eval diff --git a/src/soul_protocol/eval/runner.py b/src/soul_protocol/eval/runner.py index deec711..54e53b3 100644 --- a/src/soul_protocol/eval/runner.py +++ b/src/soul_protocol/eval/runner.py @@ -4,6 +4,11 @@ # either drives the soul into producing a response (mode="respond") or # calls Soul.recall() (mode="recall"), captures state snapshots, and # delegates to the scoring module. +# Updated: 2026-05-21 (paw-workspace#47) — Added the "prompt" case mode. +# Prompt-mode cases skip the soul entirely: the runner takes the case's +# `message` as verbatim text (a prompt or a skill output) and hands it to +# the scorer. This lets the framework score workspace prompts and skills +# such as /humanize, scored via the existing JudgeScoring kind. # # The "respond" path is the interesting one. soul-protocol does not own a # response generator — that's the consumer's job — so the runner builds the @@ -213,7 +218,17 @@ async def _run_case( mood_before = soul.state.mood energy_before = soul.state.energy - if inputs.mode == "recall": + if inputs.mode == "prompt": + # Prompt mode — the soul is not involved at all. The case's + # ``message`` is the verbatim text under evaluation (a prompt, or a + # skill's output). Hand it straight to the scorer. The judge scorer + # picks up ``inputs.reference`` separately when present. + execution = CaseExecution( + output_text=inputs.message, + mood_before=mood_before, + energy_before=energy_before, + ) + elif inputs.mode == "recall": layer = inputs.recall_layer mtypes: list[MemoryType] | None = None if layer: diff --git a/src/soul_protocol/eval/schema.py b/src/soul_protocol/eval/schema.py index 7123410..6a4e68b 100644 --- a/src/soul_protocol/eval/schema.py +++ b/src/soul_protocol/eval/schema.py @@ -3,6 +3,13 @@ # union (keyword | regex | semantic | judge | structural). Evals are written # in YAML; this module parses and validates them. The runner consumes the # resulting Pydantic models and drives Soul.observe/recall/respond. +# Updated: 2026-05-21 (paw-workspace#47) — Added a third case mode, "prompt". +# In prompt mode the case input is a verbatim prompt or skill output (not a +# soul recall/respond); the runner scores that text directly without +# touching the soul. Lets the eval framework score workspace prompts and +# skills (e.g. /humanize) alongside seeded-soul behaviour. The optional +# `reference` field carries the original input a skill transformed, so a +# judge case can compare a candidate output against where it started. # # Design note: we keep the schema deliberately small. Anything the soul # already exposes (Personality, Mood, MemoryType) is referenced directly so @@ -248,7 +255,7 @@ class StructuralScoring(_ScoringBase): class CaseInputs(BaseModel): """Input for a single case. - Two modes: + Three modes: - ``mode="respond"`` (default) — runner builds a system prompt + context block from the soul, asks the engine for a reply to ``message``, and @@ -256,12 +263,24 @@ class CaseInputs(BaseModel): - ``mode="recall"`` — runner calls ``Soul.recall(query=message, ...)`` and hands the result list to the scorer (rendered as one entry per line for keyword/semantic/judge; full list for structural). + - ``mode="prompt"`` — the soul is left untouched. ``message`` is treated + as a verbatim prompt or skill output and handed straight to the + scorer. This is how the framework evaluates workspace prompts and + skills (e.g. ``/humanize``): the YAML carries the text under test and + a :class:`JudgeScoring` block describes the qualities a good output + should have. The ``seed`` block is ignored for prompt-mode cases. + + ``reference`` — optional. In prompt mode it carries the *original* text + a skill was meant to transform (e.g. the AI-slop input before + ``/humanize`` ran). The judge scorer shows it as a "Reference input" + block so criteria can ask whether the candidate improved on it. + Ignored outside prompt mode. ``observe`` (default false) — when true, the runner additionally calls ``Soul.observe()`` after generating the response, so subsequent cases in the same spec see the updated state. Defaults to false because evals should be deterministic and memory mutations between cases make that - harder. + harder. ``observe`` has no effect in prompt mode (no soul interaction). """ model_config = ConfigDict(extra="forbid") @@ -269,11 +288,13 @@ class CaseInputs(BaseModel): message: str user_id: str | None = None domain: str | None = None - mode: Literal["respond", "recall"] = "respond" + mode: Literal["respond", "recall", "prompt"] = "respond" observe: bool = False # recall-mode specific knobs recall_limit: int = 5 recall_layer: str | None = None + # prompt-mode specific knob — the original text a skill transformed + reference: str | None = None class EvalCase(BaseModel): diff --git a/src/soul_protocol/eval/scoring.py b/src/soul_protocol/eval/scoring.py index a655da1..de5eaf7 100644 --- a/src/soul_protocol/eval/scoring.py +++ b/src/soul_protocol/eval/scoring.py @@ -4,6 +4,16 @@ # functions of (soul, case, output) — no side effects on the soul. The # judge scorer is the only one that requires an engine; it returns a # "skipped" outcome when no engine is configured rather than failing. +# Updated: 2026-05-21 (paw-workspace#47) — The judge scorer now adds a +# "Reference input" block to its prompt when the case carries +# `inputs.reference` (set by prompt-mode cases). This lets a /humanize or +# skill eval ask the judge to compare a candidate output against the +# original text it was meant to transform. No new scoring kind is added — +# prompt/skill outputs reuse JudgeScoring. +# Updated: 2026-05-21 — Gate `inputs.reference` to prompt-mode cases. The +# field's docstring says it is ignored outside prompt mode; score_judge +# now honors that, so a respond/recall case carrying `reference` no +# longer silently drops the user message from the judge prompt. from __future__ import annotations @@ -191,6 +201,26 @@ def score_semantic( """ +# Used for prompt-mode cases that carry a `reference` — the original text a +# skill was meant to transform. The judge compares the candidate against it. +_JUDGE_PROMPT_WITH_REFERENCE = """You are evaluating the output of a text-processing prompt or skill. + +Criteria: +{criteria} + +Reference input (the original text the skill was given): +{reference} + +Candidate output (the text to score): +{output} + +Score the candidate output from 0.0 (does not meet criteria at all) to +1.0 (fully meets criteria). Return JSON only — no other text: + +{{"score": <0.0-1.0>, "reasoning": ""}} +""" + + _JSON_RE = re.compile(r"\{.*?\}", re.DOTALL) @@ -206,6 +236,11 @@ async def score_judge( failed) so a CI run that lacks API credentials can still validate the rest of the eval suite. When the judge call fails or returns unparseable output, score 0.0 with details explaining why. + + When the case carries ``inputs.reference`` (a prompt-mode case scoring + a skill output against the text it transformed), the judge prompt + shows that reference as a separate block so the criteria can ask the + judge to compare candidate against original. """ if engine is None: return ScoreOutcome( @@ -217,11 +252,23 @@ async def score_judge( }, ) - prompt = _JUDGE_PROMPT.format( - criteria=spec.criteria.strip(), - message=case.inputs.message, - output=execution.output_text, - ) + # `reference` is a prompt-mode-only field — its docstring on + # CaseInputs says so. Gate it on the mode so a respond/recall case + # that happens to set `reference` does not silently switch to the + # reference template (which omits the user's actual message). + reference = case.inputs.reference if case.inputs.mode == "prompt" else None + if reference: + prompt = _JUDGE_PROMPT_WITH_REFERENCE.format( + criteria=spec.criteria.strip(), + reference=reference, + output=execution.output_text, + ) + else: + prompt = _JUDGE_PROMPT.format( + criteria=spec.criteria.strip(), + message=case.inputs.message, + output=execution.output_text, + ) try: raw = await engine.think(prompt) except Exception as e: # pragma: no cover — network / engine errors diff --git a/tests/eval_examples/humanizer_skill.yaml b/tests/eval_examples/humanizer_skill.yaml new file mode 100644 index 0000000..862f40d --- /dev/null +++ b/tests/eval_examples/humanizer_skill.yaml @@ -0,0 +1,151 @@ +# humanizer_skill.yaml — Scores the workspace /humanize skill's output. +# Created: 2026-05-21 (paw-workspace#47) — First reference spec for the +# prompt case mode. Each case carries a humanized rewrite as `message` +# and the original AI-slop text as `reference`; JudgeScoring checks the +# rewrite kept the meaning while shedding AI tells. One deterministic +# regex case runs without an engine so the spec still has a real pass +# when no judge engine is wired (judge cases skip cleanly in that case). +# +# This is a regression harness: if an edit to .claude/skills/humanizer +# degrades the skill's output, the judge scores here drop and the eval +# fails. Run with `soul eval tests/eval_examples/humanizer_skill.yaml +# --judge-engine module:attr` to get live judge scores. + +name: "Humanizer skill — outputs read as human, not AI" +description: | + Evaluates the /humanize skill (.claude/skills/humanizer/SKILL.md). The + skill takes AI-generated text and rewrites it to sound human: it strips + significance inflation, promotional language, em-dash overuse, the rule + of three, chatbot artifacts, and the rest of the catalogue in the skill. + + Cases use `mode: prompt`, so no soul is involved. The `message` field + is a candidate humanized rewrite; `reference` is the original slop the + skill was given. The judge compares the two against criteria that + describe what a good rewrite looks like. Judge cases skip when no + engine is wired; the regex case runs everywhere. + +cases: + - name: "regex_no_curly_quotes_or_emoji" + description: | + Deterministic gate, runs with no engine. A humanized output must + use straight quotes and carry no emoji — two of the skill's hard + rules (patterns 18 and 19). The negative lookahead fails the case + if either a curly quote or a rocket/bulb/check emoji slipped + through. This keeps the spec honest even on a no-engine CI run. + inputs: + mode: prompt + message: | + We shipped the new planner this week. It is faster on the cases + we measured and it no longer drops tasks when the queue is long. + There is still a rough edge around retries that we want to fix + before the next release. + scoring: + kind: regex + pattern: '^(?!.*[“”‘’\U0001F680\U0001F4A1✅]).*$' + threshold: 1.0 + + - name: "judge_strips_significance_inflation" + description: | + The skill's first content pattern: kill "testament", "pivotal + moment", "evolving landscape", "vital role". A good rewrite states + what happened plainly and keeps the facts. + inputs: + mode: prompt + reference: | + The release of version 2.0 stands as an enduring testament to + the team's commitment to excellence, marking a pivotal moment in + the evolving landscape of the product and underscoring its vital + role in the broader ecosystem. + message: | + Version 2.0 shipped on Tuesday. It adds offline mode and cuts + cold-start time roughly in half. + scoring: + kind: judge + criteria: | + The candidate output should state plainly what version 2.0 is + and what changed, with no significance-inflation language — + nothing like "testament", "pivotal moment", "evolving + landscape", "vital role", or "broader ecosystem". It should + keep concrete facts and stay shorter than the reference. Score + high when the puffery is gone and the meaning survives; score + low if the rewrite still reaches for grand framing. + + - name: "judge_removes_chatbot_artifacts" + description: | + Pattern 20 — collaborative-communication artifacts. Text meant as + chat ("Great question!", "I hope this helps!", "Let me know if") + should not survive into prose. + inputs: + mode: prompt + reference: | + Great question! Here is a summary of how the cache works. The + cache stores responses for five minutes. I hope this helps! Let + me know if you would like me to expand on any section. + message: | + The cache stores responses for five minutes, then evicts them on + the next read. + scoring: + kind: judge + criteria: | + The candidate output must read as standalone prose with no + chatbot correspondence artifacts: no "Great question!", no + "Here is a...", no "I hope this helps!", no "Let me know if you + would like...". The factual claim about the five-minute cache + must still be present. Score high when the output is clean + prose; score low if any chat-assistant phrasing remains. + + - name: "judge_kills_rule_of_three_and_em_dashes" + description: | + Patterns 10 and 14 — forced triplets and em-dash overuse. A good + rewrite breaks the triad and uses commas or periods instead of + stacked em dashes. + inputs: + mode: prompt + reference: | + The conference offers keynotes, panels, and workshops — sessions + that inform, inspire, and connect — bringing together speakers, + sponsors, and attendees from across the industry. + message: | + The conference has keynote talks and panels, plus hands-on + workshops. There is also time between sessions for people to + meet each other. + scoring: + kind: judge + criteria: | + The candidate output should describe the conference without + forcing ideas into groups of three and without em dashes ("—"). + It is fine for the rewrite to drop the triplet structure + entirely and simply list what the conference includes. Score + high when the rule-of-three cadence and em dashes are gone and + the description still makes sense; score low if either pattern + survives. + + - name: "judge_keeps_voice_not_soulless" + description: | + The skill's PERSONALITY AND SOUL section: clean is not enough, the + rewrite should still have a human behind it — an opinion, varied + rhythm, a first-person take where it fits. + inputs: + mode: prompt + reference: | + The experiment produced interesting results. The agents + generated three million lines of code. Some developers were + impressed while others were skeptical. The implications remain + unclear. + message: | + I honestly don't know what to make of this one. Three million + lines of code, written while everyone slept. Half the people I + follow think it changes everything; the other half are busy + explaining why it doesn't count. I keep getting stuck on the + same thought: those agents running all night with nobody + watching. + scoring: + kind: judge + criteria: | + The candidate output should read like a person with an opinion, + not a neutral report. It should vary sentence rhythm, admit + mixed feelings or uncertainty, and use first person where it + fits — while still covering the same facts as the reference + (three million lines of code, a split reaction). Score high + when the rewrite has a clear voice and keeps the substance; + score low if it reads flat and committee-written. diff --git a/tests/test_eval/test_cli.py b/tests/test_eval/test_cli.py index 65f1938..af6b08b 100644 --- a/tests/test_eval/test_cli.py +++ b/tests/test_eval/test_cli.py @@ -3,6 +3,10 @@ # against tempdir fixtures and the shipped example YAMLs. Validates # exit codes (0 on all-pass, 1 on any-fail), --json output shape, and # --filter narrowing. +# Updated: 2026-05-21 (paw-workspace#47) — Added an end-to-end test that +# runs the humanizer_skill.yaml prompt-mode spec through the CLI with a +# deterministic judge engine. `make_fake_judge_engine` is module-level +# so the CLI's `--judge-engine module:attr` can import it. from __future__ import annotations @@ -17,6 +21,28 @@ EXAMPLES_DIR = Path(__file__).resolve().parents[1] / "eval_examples" +# --------------------------------------------------------------------------- +# Deterministic judge engine — importable via --judge-engine module:attr +# --------------------------------------------------------------------------- + + +class _FakeJudgeEngine: + """CognitiveEngine stand-in that always returns a passing judge verdict. + + Lets the CLI exercise judge-mode cases without API credentials. The + canned JSON scores above every threshold in the shipped specs, so a + structurally sound spec run reports all-pass. + """ + + async def think(self, prompt: str) -> str: + return '{"score": 0.92, "reasoning": "meets the criteria"}' + + +def make_fake_judge_engine() -> _FakeJudgeEngine: + """Factory the CLI resolves from `--judge-engine ...:make_fake_judge_engine`.""" + return _FakeJudgeEngine() + + def _write_spec(tmp: Path, name: str, body: str) -> Path: """Drop a YAML spec into ``tmp`` and return the path.""" path = tmp / name @@ -262,3 +288,66 @@ def test_eval_heuristic_engine_skips_judge() -> None: # the judge will return non-JSON and fail. The other 3 cases pass, # so exit code is 1 (any failure). assert result.exit_code == 1 + + +# --------------------------------------------------------------------------- +# Prompt-mode skill eval — humanizer_skill.yaml end-to-end (paw-workspace#47) +# --------------------------------------------------------------------------- + +HUMANIZER_SPEC = EXAMPLES_DIR / "humanizer_skill.yaml" + +# Module path the CLI uses to import the deterministic judge engine. +_FAKE_ENGINE_REF = "tests.test_eval.test_cli:make_fake_judge_engine" + + +def test_humanizer_spec_no_engine_skips_judge_exits_zero() -> None: + """The humanizer spec runs with no engine: the regex case passes, the + judge cases skip, and the run exits 0 (skips do not fail a run).""" + runner = CliRunner() + result = runner.invoke(cli, ["eval", str(HUMANIZER_SPEC)]) + assert result.exit_code == 0, result.output + assert "PASS" in result.output # the regex gate + assert "SKIP" in result.output # the judge cases + + +def test_humanizer_spec_with_judge_engine_all_pass_exits_zero() -> None: + """Wired with a deterministic judge engine, every case in the humanizer + spec passes and the CLI exits 0 — the success path of a skill eval.""" + runner = CliRunner() + result = runner.invoke( + cli, + ["eval", str(HUMANIZER_SPEC), "--judge-engine", _FAKE_ENGINE_REF, "--json"], + ) + assert result.exit_code == 0, result.output + payload = json.loads(result.output) + assert payload["fail_count"] == 0 + assert payload["error_count"] == 0 + # 5 cases: 1 regex + 4 judge, all of them score a pass with the engine. + assert payload["pass_count"] == 5 + assert payload["skip_count"] == 0 + spec = payload["specs"][0] + assert spec["spec_name"].startswith("Humanizer skill") + judge_cases = [c for c in spec["cases"] if c["name"].startswith("judge_")] + assert len(judge_cases) == 4 + assert all(c["passed"] for c in judge_cases) + + +def test_humanizer_spec_filter_runs_single_case() -> None: + """`--filter` narrows the humanizer spec to one case end-to-end.""" + runner = CliRunner() + result = runner.invoke( + cli, + [ + "eval", + str(HUMANIZER_SPEC), + "--filter", + "regex_no_curly", + "--json", + ], + ) + assert result.exit_code == 0, result.output + payload = json.loads(result.output) + cases = payload["specs"][0]["cases"] + assert len(cases) == 1 + assert cases[0]["name"] == "regex_no_curly_quotes_or_emoji" + assert cases[0]["passed"] diff --git a/tests/test_eval/test_runner.py b/tests/test_eval/test_runner.py index 8f9b4f7..c5daccb 100644 --- a/tests/test_eval/test_runner.py +++ b/tests/test_eval/test_runner.py @@ -3,6 +3,13 @@ # per-case execution (respond + recall modes), all five scoring kinds, # and the no-engine fallback path. Uses an in-memory FakeEngine for # judge-scoring tests so they don't need API credentials. +# Updated: 2026-05-21 (paw-workspace#47) — Added coverage for the "prompt" +# case mode: the runner scores the case message verbatim without +# touching the soul, the judge picks up the optional `reference` block. +# Updated: 2026-05-21 — Added a regression test asserting `reference` is +# ignored outside prompt mode: a respond-mode case carrying `reference` +# still gets the plain judge prompt (user message present, no reference +# block). from __future__ import annotations @@ -395,3 +402,133 @@ async def test_seed_failure_reports_in_result() -> None: assert result.error is None assert result.cases == [] assert result.all_passed + + +# --------------------------------------------------------------------------- +# Prompt case mode (paw-workspace#47) +# --------------------------------------------------------------------------- + + +class CapturingEngine: + """FakeEngine variant that records every prompt it was asked to think on. + + Used to assert what the runner actually puts in front of the judge — + e.g. that prompt-mode cases score the case message verbatim and that a + `reference` is surfaced as its own block. + """ + + def __init__(self, response: str) -> None: + self._response = response + self.prompts: list[str] = [] + + async def think(self, prompt: str) -> str: + self.prompts.append(prompt) + return self._response + + +@pytest.mark.asyncio +async def test_prompt_mode_dispatches_without_engine() -> None: + """A prompt-mode case runs through the runner with no engine. + + The judge scorer skips (no engine), and crucially the runner never + touches the soul — so no birth/recall/respond machinery can raise. + """ + spec = _spec_with( + CaseInputs(message="some candidate text", mode="prompt"), + JudgeScoring(criteria="is the text free of AI tells?"), + ) + result = await run_eval(spec) + assert result.error is None + assert result.cases[0].skipped + assert result.cases[0].error is None + + +@pytest.mark.asyncio +async def test_prompt_mode_scores_message_verbatim() -> None: + """Prompt mode hands the case message straight to the scorer. + + A keyword scorer over the verbatim message passes — proving the + runner did not run the soul fallback (which would prepend its own + "[soul-eval fallback response]" text and other content). + """ + spec = _spec_with( + CaseInputs(message="the planner now retries failed tasks", mode="prompt"), + KeywordScoring(expected=["planner", "retries"], mode="all", threshold=1.0), + ) + result = await run_eval(spec) + assert result.cases[0].passed, result.cases[0].details + # The output the scorer saw is exactly the message — no soul fallback. + assert result.cases[0].output == "the planner now retries failed tasks" + assert "fallback" not in result.cases[0].output + + +@pytest.mark.asyncio +async def test_prompt_mode_judge_scores_with_engine() -> None: + """With an engine wired, a prompt-mode judge case scores and passes.""" + spec = _spec_with( + CaseInputs(message="a clean, human-sounding rewrite", mode="prompt"), + JudgeScoring(criteria="does the text read as human?", threshold=0.5), + ) + engine = FakeEngine('{"score": 0.8, "reasoning": "natural voice"}') + result = await run_eval(spec, engine=engine) + assert result.cases[0].passed + assert result.cases[0].score == pytest.approx(0.8) + assert not result.cases[0].skipped + + +@pytest.mark.asyncio +async def test_prompt_mode_reference_reaches_judge() -> None: + """When a prompt-mode case carries `reference`, the judge sees it. + + The judge prompt must include the reference under its own block so + criteria can compare a candidate output against the original text. + """ + spec = _spec_with( + CaseInputs( + message="Version 2 ships Tuesday with offline mode.", + mode="prompt", + reference="Version 2 stands as a testament to our enduring commitment.", + ), + JudgeScoring(criteria="did the rewrite drop the puffery?", threshold=0.5), + ) + engine = CapturingEngine('{"score": 0.9, "reasoning": "puffery gone"}') + result = await run_eval(spec, engine=engine) + assert result.cases[0].passed + assert len(engine.prompts) == 1 + judge_prompt = engine.prompts[0] + assert "Reference input" in judge_prompt + assert "testament to our enduring commitment" in judge_prompt + assert "Version 2 ships Tuesday with offline mode." in judge_prompt + + +@pytest.mark.asyncio +async def test_reference_ignored_outside_prompt_mode() -> None: + """A non-prompt case that sets `reference` must not use it. + + `CaseInputs.reference` is documented as ignored outside prompt mode. + A respond-mode case that carries `reference` must still get the plain + judge prompt — the one that includes the user's actual message — not + the reference template (which omits the message entirely). + """ + spec = _spec_with( + CaseInputs( + message="how is the rollout going?", + mode="respond", + reference="Our rollout stands as a testament to enduring commitment.", + ), + JudgeScoring(criteria="is the response useful?", threshold=0.5), + ) + engine = CapturingEngine('{"score": 0.8, "reasoning": "useful"}') + result = await run_eval(spec, engine=engine) + assert result.cases[0].passed, result.cases[0].details + # respond mode calls the engine twice: once to generate the response, + # once for the judge. The judge prompt is the one carrying the scoring + # rubric — pick it by that signature rather than by index. + judge_prompts = [p for p in engine.prompts if "Score the output" in p] + assert len(judge_prompts) == 1 + judge_prompt = judge_prompts[0] + # The plain judge template is used: user message present, no reference. + assert "Agent input:" in judge_prompt + assert "how is the rollout going?" in judge_prompt + assert "Reference input" not in judge_prompt + assert "testament to enduring commitment" not in judge_prompt diff --git a/tests/test_eval/test_schema.py b/tests/test_eval/test_schema.py index 6fa52a4..8713cb8 100644 --- a/tests/test_eval/test_schema.py +++ b/tests/test_eval/test_schema.py @@ -2,6 +2,8 @@ # Created: 2026-04-29 — Covers EvalSpec parsing, scoring discriminator, # error cases (invalid threshold, unknown scoring kind, missing required # fields). Validates that all five scoring kinds round-trip cleanly. +# Updated: 2026-05-21 (paw-workspace#47) — Added smoke coverage for the +# "prompt" case mode and the optional `reference` input field. from __future__ import annotations @@ -188,6 +190,49 @@ def test_state_seed_mood_invalid_raises() -> None: parse_eval_spec(data) +# --------------------------------------------------------------------------- +# Prompt case mode (paw-workspace#47) +# --------------------------------------------------------------------------- + + +def test_prompt_mode_parses() -> None: + """A case with mode=prompt validates and keeps the mode.""" + data = _minimal_dict({"kind": "judge", "criteria": "is it humanized?"}) + data["cases"][0]["inputs"] = {"message": "some prompt text", "mode": "prompt"} + spec = parse_eval_spec(data) + assert spec.cases[0].inputs.mode == "prompt" + assert spec.cases[0].inputs.reference is None + + +def test_prompt_mode_reference_field_parses() -> None: + """The optional `reference` field round-trips on a prompt-mode case.""" + data = _minimal_dict({"kind": "judge", "criteria": "did it improve the text?"}) + data["cases"][0]["inputs"] = { + "message": "the humanized rewrite", + "mode": "prompt", + "reference": "the original AI-slop input", + } + spec = parse_eval_spec(data) + inputs = spec.cases[0].inputs + assert inputs.mode == "prompt" + assert inputs.reference == "the original AI-slop input" + + +def test_default_mode_is_respond() -> None: + """Existing specs that omit `mode` still default to respond.""" + spec = parse_eval_spec(_minimal_dict({"kind": "keyword", "expected": ["x"]})) + assert spec.cases[0].inputs.mode == "respond" + assert spec.cases[0].inputs.reference is None + + +def test_unknown_mode_rejected() -> None: + """A bogus mode value is a validation error, not a silent pass.""" + data = _minimal_dict({"kind": "keyword", "expected": ["x"]}) + data["cases"][0]["inputs"] = {"message": "hi", "mode": "teleport"} + with pytest.raises(SchemaValidationError): + parse_eval_spec(data) + + # --------------------------------------------------------------------------- # load_eval_spec end-to-end # ---------------------------------------------------------------------------