qbtrix · prakashUXtech · May 21, 2026 · May 21, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,6 +17,8 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
 ### Added
 
+- **`prompt` eval case mode — score prompts and skill outputs (paw-workspace#47)** — the eval framework used to evaluate one thing: a seeded soul. It now also scores a plain prompt or the output of a skill. A case with `mode: prompt` skips the soul completely — no birth, no context, the `seed` block is ignored — and hands the case's `message` straight to the scorer. The new optional `CaseInputs.reference` field holds the text a skill was originally given; when it is set, the `judge` scorer puts it in front of the LLM as its own "Reference input" block, so the criteria can ask whether a candidate output improved on where it started rather than judging it cold. There is no new scoring kind: prompt and skill outputs go through the existing `JudgeScoring`. There is no new CLI command or flag either — `soul eval` runs a prompt-mode spec the same way it runs a soul spec. New reference spec `tests/eval_examples/humanizer_skill.yaml` scores the workspace `/humanize` skill: a deterministic `regex` gate that runs with no engine, plus four `judge` cases that check a humanized rewrite dropped its AI tells and kept the meaning. Docs: `eval-format.md` gains a "Prompt mode" section; `cli-reference.md` and `api-reference.md` cover the new mode and a Case modes table. This is the read-side of the workspace prompt-evaluation pair — a way to catch it when an edit to a tracked skill makes its output worse.
+
 - **Soul optimize / autoresearch (#142)** — autonomous self-improvement loop that pairs with the soul-aware eval framework (#160). The soul runs an eval against itself, identifies failing cases, proposes targeted changes to its own behaviour-shaping "knobs" (OCEAN traits, persona text, memory thresholds, bond strength), keeps changes that improve the eval score, and reverts the rest. New `soul_protocol.optimize` module: `optimize()` entry point, `OptimizeRunner` class with custom knob registration, `Knob` protocol plus four built-in knobs (`OceanTraitKnob` ±0.1/±0.2 within [0,1] per OCEAN dimension; `PersonaTextKnob` LLM-driven persona rephrasings with heuristic no-op fallback; `SignificanceThresholdKnob` for `MemorySettings.importance_threshold` ±1 plus the `skip_deep_processing_on_low_significance` flip; `BondThresholdKnob` for default bond strength ±5/±10), `Proposer` (LLM-assisted with heuristic fallback when no engine or unparseable response), `OptimizeResult`/`OptimizeStep` Pydantic models. Defaults to dry-run (`apply=False`) — every change applied during the run is reverted at the end and no trust chain entries are written; the soul stays byte-identical. With `apply=True` the runner keeps the winning trajectory and appends one `soul.optimize.applied` trust chain entry per kept change with payload `{knob_name, before, after, score_delta}`. Reverted proposals never write entries either way. New `soul optimize <soul-path> <eval.yaml>` CLI command (`--iterations`, `--target`, `--apply`, `--engine`, `--json`) and `soul_optimize` MCP tool with the same surface. Pairs naturally with #160 — without the eval, "improvement" is a vibe; with the eval, it's a number that goes up. Full doc at `docs/soul-optimize.md`.
 
 - **Graph traversal + typed entity ontology (#108, #190)** — entities now carry one of eight built-in kinds (`person`, `place`, `org`, `concept`, `tool`, `document`, `event`, `relation`) plus open-string extension. Eight matching relation predicates (`mentions`, `related`, `depends_on`, `contributes_to`, `causes`, `follows`, `supersedes`, `owned_by`) ship as `RelationType` with the same open contract. The cognitive engine's `extract_entities` prompt asks for the typed ontology plus a `relations` array per entity with `{target, relation, weight}` triples; heuristic-only souls keep working through a translation table that maps legacy types. New `Soul.graph` returns a `GraphView` with `nodes()`, `edges()`, `neighbors()`, `path()`, `subgraph()`, `to_mermaid()`, `reachable()`, `stats()`. `Soul.recall` accepts `graph_walk={"start": entity_id, "depth": 2, "edge_types": [...]}` plus `page_token` and `token_budget` for pagination + L0-abstract fallback under budget pressure; new `RecallResults` list subclass carries `next_page_token`, `total_estimate`, `truncated_for_budget` (legacy callers still get `list[MemoryEntry]`). Trust chain hooks: `Soul.observe()` appends `graph.entity_added` and `graph.relation_added` entries for net-new entities/edges. New `soul graph` CLI group (`nodes`/`edges`/`neighbors`/`path`/`mermaid`, all with `--json`) and `soul_graph_query` MCP tool. In-memory dict + adjacency-list storage with `to_dict`/`from_dict` round-trip; pre-0.5.0 graphs load cleanly. Heuristic third-person relation edges (e.g. "Alice knows Bob") now flow through to the graph instead of being dropped.

diff --git a/docs/api-reference.md b/docs/api-reference.md
@@ -16,6 +16,9 @@
        soul_protocol.eval module — EvalSpec, EvalCase, EvalResult, CaseResult, the five
        scoring kinds (keyword/regex/semantic/judge/structural), and run_eval /
        run_eval_against_soul / run_eval_file entry points.
+     Updated: 2026-05-21 (paw-workspace#47): Added the Case modes table to the
+       Evaluation section — documents the new `prompt` mode (scores a verbatim
+       prompt or skill output, soul skipped) and the `reference` input field.
      Updated: 2026-04-27 — Documented user-driven memory update primitives: Soul.forget_one
        (audited single-id delete), Soul.supersede (write new memory + link old.superseded_by),
        Soul.supersede_audit property. Rewrote stale soul.forget() entry to match the real
@@ -1704,6 +1707,18 @@ from soul_protocol.eval import (
 - `run_eval_file(path, *, engine=None, case_filter=None) -> EvalResult` — convenience wrapper that loads then runs.
 - `run_eval_against_soul(spec, soul, *, engine=None, case_filter=None) -> EvalResult` — run cases against an existing `Soul` without re-birthing. Used by the `soul_eval` MCP tool. The `seed` block is ignored — the soul's live state is the seed.
 
+### Case modes
+
+`CaseInputs.mode` selects how the runner produces the text the scorer sees:
+
+| Mode | What runs | Output scored |
+|------|-----------|---------------|
+| `respond` (default) | Soul produces a reply via `context_for` + the engine | The reply |
+| `recall` | `Soul.recall(query=message, ...)` | The recalled memories, rendered as text |
+| `prompt` | Nothing — the soul is skipped, `seed` is ignored | `inputs.message`, verbatim |
+
+`prompt` mode scores a standalone prompt or skill output. Set `inputs.reference` (prompt-mode only) to the pre-transform text and a `judge` case compares the candidate against it. See [eval-format.md](eval-format.md#prompt-mode-scoring-prompts-and-skills).
+
 ### Result models
 
 `EvalResult`:

diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -18,6 +18,9 @@
        Runs cases against a soul seeded with explicit state (memories, OCEAN, bonds, mood,
        energy). Supports keyword / regex / semantic / judge / structural scoring. --json,
        --filter, --judge-engine, --verbose options. Exits 1 on any failure. Count: 47 → 48.
+     Updated: 2026-05-21 (paw-workspace#47): `soul eval` also scores prompts and skill
+       outputs — a `mode: prompt` case skips the soul and scores the case text verbatim.
+       No new command or flag; the `humanizer_skill.yaml` reference spec evaluates /humanize.
      Updated: 2026-04-29 — v0.4.0 (#42): Added `soul verify` and `soul audit` for trust-chain
        integrity checks and signed-action timelines. Both support --json. `soul verify` exits
        1 on a tampered chain. Count: 45 → 47.
@@ -1776,7 +1779,9 @@ Payloads are stored as hashes only — the table shows *what changed when*, not
 
 ### `soul eval`
 
-Run YAML-driven soul-aware evals against a freshly seeded soul. The eval framework lets you pin the soul's state (memories, OCEAN, bonds, mood, energy) before each test runs, so you can measure memory-driven behaviour rather than just stateless input-output. See [eval-format.md](eval-format.md) for the full schema.
+Run YAML-driven soul-aware evals against a freshly seeded soul. The eval framework lets you pin the soul's state (memories, OCEAN, bonds, mood, energy) before each test runs, so you can measure memory-driven behaviour rather than just stateless input-output.
+
+It also scores plain prompts and skill outputs. A case with `mode: prompt` skips the soul and scores the case text verbatim — point it at a workspace prompt or a skill's output (for example `/humanize`) to catch regressions when that prompt or skill changes. See [eval-format.md](eval-format.md) for the full schema, including the `prompt` mode and the `humanizer_skill.yaml` reference spec.
 
 ```bash
 soul eval <path>
@@ -1809,6 +1814,10 @@ soul eval tests/eval_examples/                                  # all .yaml in d
 soul eval tests/eval_examples/ --filter "creative"
 soul eval my_eval.yaml --json | jq '.specs[].cases'
 soul eval my_eval.yaml --judge-engine my_module:make_engine
+
+# Score the /humanize skill (prompt-mode spec). The judge cases need an
+# engine; without one they SKIP and only the deterministic checks run.
+soul eval tests/eval_examples/humanizer_skill.yaml --judge-engine my_module:make_engine
 ```
 
 **Output:** one Rich table per spec (Case, Status, Score, Time, optional Details), plus a summary footer with totals. `--json` returns `{specs: [...], duration_ms, pass_count, fail_count, skip_count, error_count}`.

diff --git a/docs/eval-format.md b/docs/eval-format.md
@@ -2,7 +2,11 @@
      Created: 2026-04-29 — Documents the YAML schema, scoring kinds,
        runner contract, and CLI / MCP entry points for soul-aware evals.
        Companion to docs/api-reference.md (EvalSpec, EvalResult,
-       run_eval) and docs/cli-reference.md (`soul eval`). -->
+       run_eval) and docs/cli-reference.md (`soul eval`).
+     Updated: 2026-05-21 (paw-workspace#47) — Documented the `prompt`
+       case mode, which scores a verbatim prompt or skill output without
+       a soul. Used to evaluate workspace prompts and skills (/humanize).
+       Companion example: tests/eval_examples/humanizer_skill.yaml. -->
 
 # Soul-aware Eval Format
 
@@ -17,6 +21,12 @@ interactions, current mood and energy. The same prompt to a soul that's
 than to one that's "energetic with high bond strength" — and that's the
 entire point of the protocol.
 
+The format also handles a stateless case. A `prompt`-mode case scores a
+verbatim prompt or skill output directly, with no soul involved. That is
+how you point the same harness at workspace prompts and skills — for
+example scoring the `/humanize` skill's output for AI tells. See
+[Prompt mode](#prompt-mode-scoring-prompts-and-skills) below.
+
 This page documents the schema and the runner. For the CLI command see
 [cli-reference.md](cli-reference.md#soul-eval). For the MCP tool see
 [mcp-server.md](mcp-server.md#soul_eval). For Python API access see
@@ -97,10 +107,11 @@ EvalSpec
         │   ├── message: str            # required
         │   ├── user_id: str | null
         │   ├── domain: str | null
-        │   ├── mode: "respond" | "recall"
+        │   ├── mode: "respond" | "recall" | "prompt"
         │   ├── observe: bool            # default false
         │   ├── recall_limit: int        # default 5
-        │   └── recall_layer: str | null
+        │   ├── recall_layer: str | null
+        │   └── reference: str | null    # prompt mode — the pre-transform text
         └── scoring: Scoring             # see below
 ```
 
@@ -130,16 +141,62 @@ also queryable via `inputs.recall_layer`.
 A case has three parts:
 
 1. **Mode** — `respond` (the soul produces a reply via context_for + the
-   engine) or `recall` (`Soul.recall(query=message, ...)`).
+   engine), `recall` (`Soul.recall(query=message, ...)`), or `prompt`
+   (the soul is skipped; `message` is scored verbatim — see
+   [Prompt mode](#prompt-mode-scoring-prompts-and-skills)).
 2. **Inputs** — message, optional `user_id` (multi-user routing),
-   optional `domain` (for v0.4.0 domain isolation), and recall knobs.
+   optional `domain` (for v0.4.0 domain isolation), recall knobs, and the
+   prompt-mode `reference`.
 3. **Scoring** — one of the five kinds below. The `kind` field is the
    discriminator; Pydantic resolves the right scorer at parse time.
 
 `observe: true` runs `Soul.observe()` after producing the response, so
 the soul's state mutates. By default `observe: false` keeps the state
 identical to the seed across cases — recommended for deterministic
-evals.
+evals. `observe` does nothing in `prompt` mode (there is no soul to
+observe).
+
+### Prompt mode — scoring prompts and skills
+
+Most cases drive a soul. A `prompt`-mode case does not: it takes the
+case's `message` as a verbatim string — a prompt, or the output of a
+skill — and hands it straight to the scorer. No soul is birthed, no
+context is built, the `seed` block is ignored.
+
+This is the path for evaluating the workspace's own prompts and skills.
+The motivating case is `/humanize`: feed the skill's output in as
+`message`, describe the qualities a good humanized text should have in a
+`judge` block, and the eval tells you whether an edit to the skill made
+its output better or worse.
+
+```yaml
+cases:
+  - name: "rewrite drops the puffery"
+    inputs:
+      mode: prompt
+      # `reference` — the original text the skill was given.
+      reference: |
+        Version 2.0 stands as an enduring testament to our commitment.
+      # `message` — the candidate output to score.
+      message: |
+        Version 2.0 shipped Tuesday with offline mode.
+    scoring:
+      kind: judge
+      criteria: |
+        The candidate output should state plainly what changed, with no
+        significance-inflation language. It should keep the facts and
+        stay shorter than the reference.
+```
+
+`reference` is optional and prompt-mode only. When set, the `judge`
+scorer shows it to the LLM as a separate "Reference input" block, so
+criteria can ask whether the candidate improved on the original rather
+than judging it in isolation. Any scoring kind works in prompt mode —
+`regex` and `keyword` give you deterministic gates that pass without an
+engine — but `judge` is the natural fit for "is this output good."
+
+The shipped reference spec is
+[`tests/eval_examples/humanizer_skill.yaml`](../tests/eval_examples/humanizer_skill.yaml).
 
 ## Scoring kinds
 
@@ -309,4 +366,5 @@ a follow-up the optimizer would benefit from, file an issue against it.
 - [api-reference.md](api-reference.md#evaluation) — Python API
 - [cli-reference.md](cli-reference.md#soul-eval) — `soul eval` command
 - [mcp-server.md](mcp-server.md#soul_eval) — `soul_eval` MCP tool
-- `tests/eval_examples/` — five shipped example specs
+- `tests/eval_examples/` — shipped example specs, including
+  `humanizer_skill.yaml` for the prompt-mode `/humanize` eval
diff --git a/src/soul_protocol/eval/runner.py b/src/soul_protocol/eval/runner.py
@@ -4,6 +4,11 @@
 #   either drives the soul into producing a response (mode="respond") or
 #   calls Soul.recall() (mode="recall"), captures state snapshots, and
 #   delegates to the scoring module.
+# Updated: 2026-05-21 (paw-workspace#47) — Added the "prompt" case mode.
+#   Prompt-mode cases skip the soul entirely: the runner takes the case's
+#   `message` as verbatim text (a prompt or a skill output) and hands it to
+#   the scorer. This lets the framework score workspace prompts and skills
+#   such as /humanize, scored via the existing JudgeScoring kind.
 #
 # The "respond" path is the interesting one. soul-protocol does not own a
 # response generator — that's the consumer's job — so the runner builds the
@@ -213,7 +218,17 @@ async def _run_case(
     mood_before = soul.state.mood
     energy_before = soul.state.energy
 
-    if inputs.mode == "recall":
+    if inputs.mode == "prompt":
+        # Prompt mode — the soul is not involved at all. The case's
+        # ``message`` is the verbatim text under evaluation (a prompt, or a
+        # skill's output). Hand it straight to the scorer. The judge scorer
+        # picks up ``inputs.reference`` separately when present.
+        execution = CaseExecution(
+            output_text=inputs.message,
+            mood_before=mood_before,
+            energy_before=energy_before,
+        )
+    elif inputs.mode == "recall":
         layer = inputs.recall_layer
         mtypes: list[MemoryType] | None = None
         if layer:

diff --git a/src/soul_protocol/eval/schema.py b/src/soul_protocol/eval/schema.py
@@ -3,6 +3,13 @@
 #   union (keyword | regex | semantic | judge | structural). Evals are written
 #   in YAML; this module parses and validates them. The runner consumes the
 #   resulting Pydantic models and drives Soul.observe/recall/respond.
+# Updated: 2026-05-21 (paw-workspace#47) — Added a third case mode, "prompt".
+#   In prompt mode the case input is a verbatim prompt or skill output (not a
+#   soul recall/respond); the runner scores that text directly without
+#   touching the soul. Lets the eval framework score workspace prompts and
+#   skills (e.g. /humanize) alongside seeded-soul behaviour. The optional
+#   `reference` field carries the original input a skill transformed, so a
+#   judge case can compare a candidate output against where it started.
 #
 # Design note: we keep the schema deliberately small. Anything the soul
 # already exposes (Personality, Mood, MemoryType) is referenced directly so
@@ -248,32 +255,46 @@ class StructuralScoring(_ScoringBase):
 class CaseInputs(BaseModel):
     """Input for a single case.
 
-    Two modes:
+    Three modes:
 
     - ``mode="respond"`` (default) — runner builds a system prompt + context
       block from the soul, asks the engine for a reply to ``message``, and
       hands the reply to the scorer.
     - ``mode="recall"`` — runner calls ``Soul.recall(query=message, ...)``
       and hands the result list to the scorer (rendered as one entry per
       line for keyword/semantic/judge; full list for structural).
+    - ``mode="prompt"`` — the soul is left untouched. ``message`` is treated
+      as a verbatim prompt or skill output and handed straight to the
+      scorer. This is how the framework evaluates workspace prompts and
+      skills (e.g. ``/humanize``): the YAML carries the text under test and
+      a :class:`JudgeScoring` block describes the qualities a good output
+      should have. The ``seed`` block is ignored for prompt-mode cases.
+
+    ``reference`` — optional. In prompt mode it carries the *original* text
+    a skill was meant to transform (e.g. the AI-slop input before
+    ``/humanize`` ran). The judge scorer shows it as a "Reference input"
+    block so criteria can ask whether the candidate improved on it.
+    Ignored outside prompt mode.
 
     ``observe`` (default false) — when true, the runner additionally calls
     ``Soul.observe()`` after generating the response, so subsequent cases
     in the same spec see the updated state. Defaults to false because evals
     should be deterministic and memory mutations between cases make that
-    harder.
+    harder. ``observe`` has no effect in prompt mode (no soul interaction).
     """
 
     model_config = ConfigDict(extra="forbid")
 
     message: str
     user_id: str | None = None
     domain: str | None = None
-    mode: Literal["respond", "recall"] = "respond"
+    mode: Literal["respond", "recall", "prompt"] = "respond"
     observe: bool = False
     # recall-mode specific knobs
     recall_limit: int = 5
     recall_layer: str | None = None
+    # prompt-mode specific knob — the original text a skill transformed
+    reference: str | None = None
 
 
 class EvalCase(BaseModel):