Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

### Added

- **`prompt` eval case mode — score prompts and skill outputs (paw-workspace#47)** — the eval framework used to evaluate one thing: a seeded soul. It now also scores a plain prompt or the output of a skill. A case with `mode: prompt` skips the soul completely — no birth, no context, the `seed` block is ignored — and hands the case's `message` straight to the scorer. The new optional `CaseInputs.reference` field holds the text a skill was originally given; when it is set, the `judge` scorer puts it in front of the LLM as its own "Reference input" block, so the criteria can ask whether a candidate output improved on where it started rather than judging it cold. There is no new scoring kind: prompt and skill outputs go through the existing `JudgeScoring`. There is no new CLI command or flag either — `soul eval` runs a prompt-mode spec the same way it runs a soul spec. New reference spec `tests/eval_examples/humanizer_skill.yaml` scores the workspace `/humanize` skill: a deterministic `regex` gate that runs with no engine, plus four `judge` cases that check a humanized rewrite dropped its AI tells and kept the meaning. Docs: `eval-format.md` gains a "Prompt mode" section; `cli-reference.md` and `api-reference.md` cover the new mode and a Case modes table. This is the read-side of the workspace prompt-evaluation pair — a way to catch it when an edit to a tracked skill makes its output worse.

- **Soul optimize / autoresearch (#142)** — autonomous self-improvement loop that pairs with the soul-aware eval framework (#160). The soul runs an eval against itself, identifies failing cases, proposes targeted changes to its own behaviour-shaping "knobs" (OCEAN traits, persona text, memory thresholds, bond strength), keeps changes that improve the eval score, and reverts the rest. New `soul_protocol.optimize` module: `optimize()` entry point, `OptimizeRunner` class with custom knob registration, `Knob` protocol plus four built-in knobs (`OceanTraitKnob` ±0.1/±0.2 within [0,1] per OCEAN dimension; `PersonaTextKnob` LLM-driven persona rephrasings with heuristic no-op fallback; `SignificanceThresholdKnob` for `MemorySettings.importance_threshold` ±1 plus the `skip_deep_processing_on_low_significance` flip; `BondThresholdKnob` for default bond strength ±5/±10), `Proposer` (LLM-assisted with heuristic fallback when no engine or unparseable response), `OptimizeResult`/`OptimizeStep` Pydantic models. Defaults to dry-run (`apply=False`) — every change applied during the run is reverted at the end and no trust chain entries are written; the soul stays byte-identical. With `apply=True` the runner keeps the winning trajectory and appends one `soul.optimize.applied` trust chain entry per kept change with payload `{knob_name, before, after, score_delta}`. Reverted proposals never write entries either way. New `soul optimize <soul-path> <eval.yaml>` CLI command (`--iterations`, `--target`, `--apply`, `--engine`, `--json`) and `soul_optimize` MCP tool with the same surface. Pairs naturally with #160 — without the eval, "improvement" is a vibe; with the eval, it's a number that goes up. Full doc at `docs/soul-optimize.md`.

- **Graph traversal + typed entity ontology (#108, #190)** — entities now carry one of eight built-in kinds (`person`, `place`, `org`, `concept`, `tool`, `document`, `event`, `relation`) plus open-string extension. Eight matching relation predicates (`mentions`, `related`, `depends_on`, `contributes_to`, `causes`, `follows`, `supersedes`, `owned_by`) ship as `RelationType` with the same open contract. The cognitive engine's `extract_entities` prompt asks for the typed ontology plus a `relations` array per entity with `{target, relation, weight}` triples; heuristic-only souls keep working through a translation table that maps legacy types. New `Soul.graph` returns a `GraphView` with `nodes()`, `edges()`, `neighbors()`, `path()`, `subgraph()`, `to_mermaid()`, `reachable()`, `stats()`. `Soul.recall` accepts `graph_walk={"start": entity_id, "depth": 2, "edge_types": [...]}` plus `page_token` and `token_budget` for pagination + L0-abstract fallback under budget pressure; new `RecallResults` list subclass carries `next_page_token`, `total_estimate`, `truncated_for_budget` (legacy callers still get `list[MemoryEntry]`). Trust chain hooks: `Soul.observe()` appends `graph.entity_added` and `graph.relation_added` entries for net-new entities/edges. New `soul graph` CLI group (`nodes`/`edges`/`neighbors`/`path`/`mermaid`, all with `--json`) and `soul_graph_query` MCP tool. In-memory dict + adjacency-list storage with `to_dict`/`from_dict` round-trip; pre-0.5.0 graphs load cleanly. Heuristic third-person relation edges (e.g. "Alice knows Bob") now flow through to the graph instead of being dropped.
Expand Down
15 changes: 15 additions & 0 deletions docs/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@
soul_protocol.eval module — EvalSpec, EvalCase, EvalResult, CaseResult, the five
scoring kinds (keyword/regex/semantic/judge/structural), and run_eval /
run_eval_against_soul / run_eval_file entry points.
Updated: 2026-05-21 (paw-workspace#47): Added the Case modes table to the
Evaluation section — documents the new `prompt` mode (scores a verbatim
prompt or skill output, soul skipped) and the `reference` input field.
Updated: 2026-04-27 — Documented user-driven memory update primitives: Soul.forget_one
(audited single-id delete), Soul.supersede (write new memory + link old.superseded_by),
Soul.supersede_audit property. Rewrote stale soul.forget() entry to match the real
Expand Down Expand Up @@ -1704,6 +1707,18 @@ from soul_protocol.eval import (
- `run_eval_file(path, *, engine=None, case_filter=None) -> EvalResult` — convenience wrapper that loads then runs.
- `run_eval_against_soul(spec, soul, *, engine=None, case_filter=None) -> EvalResult` — run cases against an existing `Soul` without re-birthing. Used by the `soul_eval` MCP tool. The `seed` block is ignored — the soul's live state is the seed.

### Case modes

`CaseInputs.mode` selects how the runner produces the text the scorer sees:

| Mode | What runs | Output scored |
|------|-----------|---------------|
| `respond` (default) | Soul produces a reply via `context_for` + the engine | The reply |
| `recall` | `Soul.recall(query=message, ...)` | The recalled memories, rendered as text |
| `prompt` | Nothing — the soul is skipped, `seed` is ignored | `inputs.message`, verbatim |

`prompt` mode scores a standalone prompt or skill output. Set `inputs.reference` (prompt-mode only) to the pre-transform text and a `judge` case compares the candidate against it. See [eval-format.md](eval-format.md#prompt-mode-scoring-prompts-and-skills).

### Result models

`EvalResult`:
Expand Down
11 changes: 10 additions & 1 deletion docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
Runs cases against a soul seeded with explicit state (memories, OCEAN, bonds, mood,
energy). Supports keyword / regex / semantic / judge / structural scoring. --json,
--filter, --judge-engine, --verbose options. Exits 1 on any failure. Count: 47 → 48.
Updated: 2026-05-21 (paw-workspace#47): `soul eval` also scores prompts and skill
outputs — a `mode: prompt` case skips the soul and scores the case text verbatim.
No new command or flag; the `humanizer_skill.yaml` reference spec evaluates /humanize.
Updated: 2026-04-29 — v0.4.0 (#42): Added `soul verify` and `soul audit` for trust-chain
integrity checks and signed-action timelines. Both support --json. `soul verify` exits
1 on a tampered chain. Count: 45 → 47.
Expand Down Expand Up @@ -1776,7 +1779,9 @@ Payloads are stored as hashes only — the table shows *what changed when*, not

### `soul eval`

Run YAML-driven soul-aware evals against a freshly seeded soul. The eval framework lets you pin the soul's state (memories, OCEAN, bonds, mood, energy) before each test runs, so you can measure memory-driven behaviour rather than just stateless input-output. See [eval-format.md](eval-format.md) for the full schema.
Run YAML-driven soul-aware evals against a freshly seeded soul. The eval framework lets you pin the soul's state (memories, OCEAN, bonds, mood, energy) before each test runs, so you can measure memory-driven behaviour rather than just stateless input-output.

It also scores plain prompts and skill outputs. A case with `mode: prompt` skips the soul and scores the case text verbatim — point it at a workspace prompt or a skill's output (for example `/humanize`) to catch regressions when that prompt or skill changes. See [eval-format.md](eval-format.md) for the full schema, including the `prompt` mode and the `humanizer_skill.yaml` reference spec.

```bash
soul eval <path>
Expand Down Expand Up @@ -1809,6 +1814,10 @@ soul eval tests/eval_examples/ # all .yaml in d
soul eval tests/eval_examples/ --filter "creative"
soul eval my_eval.yaml --json | jq '.specs[].cases'
soul eval my_eval.yaml --judge-engine my_module:make_engine

# Score the /humanize skill (prompt-mode spec). The judge cases need an
# engine; without one they SKIP and only the deterministic checks run.
soul eval tests/eval_examples/humanizer_skill.yaml --judge-engine my_module:make_engine
```

**Output:** one Rich table per spec (Case, Status, Score, Time, optional Details), plus a summary footer with totals. `--json` returns `{specs: [...], duration_ms, pass_count, fail_count, skip_count, error_count}`.
Expand Down
72 changes: 65 additions & 7 deletions docs/eval-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@
Created: 2026-04-29 — Documents the YAML schema, scoring kinds,
runner contract, and CLI / MCP entry points for soul-aware evals.
Companion to docs/api-reference.md (EvalSpec, EvalResult,
run_eval) and docs/cli-reference.md (`soul eval`). -->
run_eval) and docs/cli-reference.md (`soul eval`).
Updated: 2026-05-21 (paw-workspace#47) — Documented the `prompt`
case mode, which scores a verbatim prompt or skill output without
a soul. Used to evaluate workspace prompts and skills (/humanize).
Companion example: tests/eval_examples/humanizer_skill.yaml. -->

# Soul-aware Eval Format

Expand All @@ -17,6 +21,12 @@ interactions, current mood and energy. The same prompt to a soul that's
than to one that's "energetic with high bond strength" — and that's the
entire point of the protocol.

The format also handles a stateless case. A `prompt`-mode case scores a
verbatim prompt or skill output directly, with no soul involved. That is
how you point the same harness at workspace prompts and skills — for
example scoring the `/humanize` skill's output for AI tells. See
[Prompt mode](#prompt-mode-scoring-prompts-and-skills) below.

This page documents the schema and the runner. For the CLI command see
[cli-reference.md](cli-reference.md#soul-eval). For the MCP tool see
[mcp-server.md](mcp-server.md#soul_eval). For Python API access see
Expand Down Expand Up @@ -97,10 +107,11 @@ EvalSpec
│ ├── message: str # required
│ ├── user_id: str | null
│ ├── domain: str | null
│ ├── mode: "respond" | "recall"
│ ├── mode: "respond" | "recall" | "prompt"
│ ├── observe: bool # default false
│ ├── recall_limit: int # default 5
│ └── recall_layer: str | null
│ ├── recall_layer: str | null
│ └── reference: str | null # prompt mode — the pre-transform text
└── scoring: Scoring # see below
```

Expand Down Expand Up @@ -130,16 +141,62 @@ also queryable via `inputs.recall_layer`.
A case has three parts:

1. **Mode** — `respond` (the soul produces a reply via context_for + the
engine) or `recall` (`Soul.recall(query=message, ...)`).
engine), `recall` (`Soul.recall(query=message, ...)`), or `prompt`
(the soul is skipped; `message` is scored verbatim — see
[Prompt mode](#prompt-mode-scoring-prompts-and-skills)).
2. **Inputs** — message, optional `user_id` (multi-user routing),
optional `domain` (for v0.4.0 domain isolation), and recall knobs.
optional `domain` (for v0.4.0 domain isolation), recall knobs, and the
prompt-mode `reference`.
3. **Scoring** — one of the five kinds below. The `kind` field is the
discriminator; Pydantic resolves the right scorer at parse time.

`observe: true` runs `Soul.observe()` after producing the response, so
the soul's state mutates. By default `observe: false` keeps the state
identical to the seed across cases — recommended for deterministic
evals.
evals. `observe` does nothing in `prompt` mode (there is no soul to
observe).

### Prompt mode — scoring prompts and skills

Most cases drive a soul. A `prompt`-mode case does not: it takes the
case's `message` as a verbatim string — a prompt, or the output of a
skill — and hands it straight to the scorer. No soul is birthed, no
context is built, the `seed` block is ignored.

This is the path for evaluating the workspace's own prompts and skills.
The motivating case is `/humanize`: feed the skill's output in as
`message`, describe the qualities a good humanized text should have in a
`judge` block, and the eval tells you whether an edit to the skill made
its output better or worse.

```yaml
cases:
- name: "rewrite drops the puffery"
inputs:
mode: prompt
# `reference` — the original text the skill was given.
reference: |
Version 2.0 stands as an enduring testament to our commitment.
# `message` — the candidate output to score.
message: |
Version 2.0 shipped Tuesday with offline mode.
scoring:
kind: judge
criteria: |
The candidate output should state plainly what changed, with no
significance-inflation language. It should keep the facts and
stay shorter than the reference.
```

`reference` is optional and prompt-mode only. When set, the `judge`
scorer shows it to the LLM as a separate "Reference input" block, so
criteria can ask whether the candidate improved on the original rather
than judging it in isolation. Any scoring kind works in prompt mode —
`regex` and `keyword` give you deterministic gates that pass without an
engine — but `judge` is the natural fit for "is this output good."

The shipped reference spec is
[`tests/eval_examples/humanizer_skill.yaml`](../tests/eval_examples/humanizer_skill.yaml).

## Scoring kinds

Expand Down Expand Up @@ -309,4 +366,5 @@ a follow-up the optimizer would benefit from, file an issue against it.
- [api-reference.md](api-reference.md#evaluation) — Python API
- [cli-reference.md](cli-reference.md#soul-eval) — `soul eval` command
- [mcp-server.md](mcp-server.md#soul_eval) — `soul_eval` MCP tool
- `tests/eval_examples/` — five shipped example specs
- `tests/eval_examples/` — shipped example specs, including
`humanizer_skill.yaml` for the prompt-mode `/humanize` eval
17 changes: 16 additions & 1 deletion src/soul_protocol/eval/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@
# either drives the soul into producing a response (mode="respond") or
# calls Soul.recall() (mode="recall"), captures state snapshots, and
# delegates to the scoring module.
# Updated: 2026-05-21 (paw-workspace#47) — Added the "prompt" case mode.
# Prompt-mode cases skip the soul entirely: the runner takes the case's
# `message` as verbatim text (a prompt or a skill output) and hands it to
# the scorer. This lets the framework score workspace prompts and skills
# such as /humanize, scored via the existing JudgeScoring kind.
#
# The "respond" path is the interesting one. soul-protocol does not own a
# response generator — that's the consumer's job — so the runner builds the
Expand Down Expand Up @@ -213,7 +218,17 @@ async def _run_case(
mood_before = soul.state.mood
energy_before = soul.state.energy

if inputs.mode == "recall":
if inputs.mode == "prompt":
# Prompt mode — the soul is not involved at all. The case's
# ``message`` is the verbatim text under evaluation (a prompt, or a
# skill's output). Hand it straight to the scorer. The judge scorer
# picks up ``inputs.reference`` separately when present.
execution = CaseExecution(
output_text=inputs.message,
mood_before=mood_before,
energy_before=energy_before,
)
elif inputs.mode == "recall":
layer = inputs.recall_layer
mtypes: list[MemoryType] | None = None
if layer:
Expand Down
27 changes: 24 additions & 3 deletions src/soul_protocol/eval/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@
# union (keyword | regex | semantic | judge | structural). Evals are written
# in YAML; this module parses and validates them. The runner consumes the
# resulting Pydantic models and drives Soul.observe/recall/respond.
# Updated: 2026-05-21 (paw-workspace#47) — Added a third case mode, "prompt".
# In prompt mode the case input is a verbatim prompt or skill output (not a
# soul recall/respond); the runner scores that text directly without
# touching the soul. Lets the eval framework score workspace prompts and
# skills (e.g. /humanize) alongside seeded-soul behaviour. The optional
# `reference` field carries the original input a skill transformed, so a
# judge case can compare a candidate output against where it started.
#
# Design note: we keep the schema deliberately small. Anything the soul
# already exposes (Personality, Mood, MemoryType) is referenced directly so
Expand Down Expand Up @@ -248,32 +255,46 @@ class StructuralScoring(_ScoringBase):
class CaseInputs(BaseModel):
"""Input for a single case.

Two modes:
Three modes:

- ``mode="respond"`` (default) — runner builds a system prompt + context
block from the soul, asks the engine for a reply to ``message``, and
hands the reply to the scorer.
- ``mode="recall"`` — runner calls ``Soul.recall(query=message, ...)``
and hands the result list to the scorer (rendered as one entry per
line for keyword/semantic/judge; full list for structural).
- ``mode="prompt"`` — the soul is left untouched. ``message`` is treated
as a verbatim prompt or skill output and handed straight to the
scorer. This is how the framework evaluates workspace prompts and
skills (e.g. ``/humanize``): the YAML carries the text under test and
a :class:`JudgeScoring` block describes the qualities a good output
should have. The ``seed`` block is ignored for prompt-mode cases.

``reference`` — optional. In prompt mode it carries the *original* text
a skill was meant to transform (e.g. the AI-slop input before
``/humanize`` ran). The judge scorer shows it as a "Reference input"
block so criteria can ask whether the candidate improved on it.
Ignored outside prompt mode.

``observe`` (default false) — when true, the runner additionally calls
``Soul.observe()`` after generating the response, so subsequent cases
in the same spec see the updated state. Defaults to false because evals
should be deterministic and memory mutations between cases make that
harder.
harder. ``observe`` has no effect in prompt mode (no soul interaction).
"""

model_config = ConfigDict(extra="forbid")

message: str
user_id: str | None = None
domain: str | None = None
mode: Literal["respond", "recall"] = "respond"
mode: Literal["respond", "recall", "prompt"] = "respond"
observe: bool = False
# recall-mode specific knobs
recall_limit: int = 5
recall_layer: str | None = None
# prompt-mode specific knob — the original text a skill transformed
reference: str | None = None


class EvalCase(BaseModel):
Expand Down
Loading
Loading