mrviduus · mrviduus · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,10 @@
 
 ## [Unreleased]
 
+### Learning Tutor Agent — plans what to study next over real SRS state (AI-Agent-2) — backend (2026-06-24)
+
+The third and largest agent: a **Tutor** that reasons over the learner's actual vocabulary state and **plans what to study next**, rather than running a fixed review queue. `TutorAgent` runs on the existing `AgentLoop` runtime and calls four thin `ITool`s — `get_due_vocabulary` (due/near-due SRS cards), `get_weak_vocabulary` (lowest-accuracy / earliest-stage words), `get_reading_context` (what they're actually reading — keeps practice tied to reading, the product thesis), and `get_example_sentence` (a real in-context sentence: the learner's saved sentence, else a **spoiler-gated, owner-isolated RAG** pull from their own book) — then emits an **ordered study plan** (`{wordId, word, stage, exerciseType, difficulty, why}` + an overall `rationale` + a `readingNudge`), exercise type/difficulty **recalibrated from the real SRS stage** (recognition→recall→context-cloze). **Server-held `tutor_session`** (new entity/table, jsonb `PlanJson`, status, turn count) persists the plan between turns; **HITL**: `POST /me/tutor/session` starts/resumes and `POST /me/tutor/session/{id}/feedback` re-plans on the learner's results — re-fetching state (so SRS updates are seen), deterministically **dropping cards just answered correctly**, ignoring feedback for ids not in the prior plan, and preserving the session length. **Two hard guarantees, QA-verified**: (1) **anti-hallucination** — every scheduled `wordId` must come from a `get_due`/`get_weak` tool result (harvested ok-only from the transcript), word+stage **re-projected** from the real row, invented ids dropped, empty transcript → empty plan (the model can't fabricate or rename a card); (2) **cross-user isolation** — the example-sentence tool resolves the card with `Id == wordId && UserId == userId` and the RAG path filters on `user_id AND user_book_id`, so no other user's `user_chapter_chunk` content is reachable. All inbound book text (example sentences from user uploads, reading titles) is run through `ExternalTextSanitizer` + length-capped before entering the prompt (prompt-injection boundary). Telemetry: each turn persists an `agent_run` (agent=`tutor`, `tool_calls_count`); route `tutor.agent → gpt-4.1-mini`. **Eval**: `TutorEvalRunner` (deterministic structural rubric over synthetic learner states — due-coverage, weak-targeting, difficulty-appropriateness, no-hallucination, thesis-alignment; a golden where weak ∉ due makes weak-targeting discriminating), admin-runnable `POST /admin/ai-quality/tutor/eval`. EF migration `AddTutorSession` (reversible). `dotnet build` green, `dotnet format` clean; 968 unit + 72 AiEvals tests green. **Deferred**: SSE streaming, the tutor UI surface (frontend/mobile slice), generated free-text exercises beyond MC reuse, longitudinal pedagogical-efficacy A/B (offline evals validate planner mechanics, not learning outcomes). Completes the 3-agent roadmap (`docs/04-dev/agents-roadmap.md`); Agent 1 (Enrichment) + Agent 3 (Librarian) already shipped.
+
 ### Librarian Agent — natural-language catalog discovery via a ReAct tool-use loop (AI-Agent-3) — backend (2026-06-23)
 
 A second true **agent**: turn a natural-language request — *"find books like 1984 about surveillance, in English, under 300 pages"* — into a ranked, **reasoned** list of recommendations. The `LibrarianAgent` runs on the existing `AgentLoop` runtime (plan→act→observe, hard `MaxSteps:6`/`CostCapUsd:0.04` caps, persisted transcript) and **decides** how to search: two new `ITool`s wrap the existing catalog search — `search_library` (keyword, wraps the Postgres FTS provider) and `search_library_semantic` ("books like X"/conceptual, wraps the AI-057 hybrid FTS+embedding RRF) — plus it **reuses Agent 1's** `search_open_library`/`get_open_library_work` to expand externally when the library is thin. Both library tools share one `LibrarySearchService` seam that runs the real search, collapses chapter hits to distinct editions, and **enriches** each with the metadata the agent post-filters on (authors, genres, language, aggregate word count → `approxPages` at ~275 w/page, since catalog editions carry no year/page column). **Constraints (language, length) are deterministic post-filters** over the returned metadata — the agent reasons over real rows, FTS isn't trusted to enforce them. **Anti-hallucination is enforced in code, not the prompt**: a `RetrievedCatalog` is rebuilt from the run's `tool_result` transcript (only `ok:true` results), and `Parse` drops any `library` recommendation whose `editionId` wasn't actually retrieved and any `open_library` suggestion whose title wasn't seen — surviving `library` recs **re-project** their title/slug/authors from the retrieved row, so the model can't even rename a real book. Each result carries **provenance** (`library` vs `open_library`) + a one-line `why`; `usedExternal` is derived from what survived grounding, not the model's flag. **Recommend-only this slice** — external hits are clearly-marked suggestions ("not in your library yet"); **no ingest** (copyright + scope; ingest/HITL deferred). All external + user free-text runs through `ExternalTextSanitizer` (untrusted DATA, never instructions). Endpoint `POST /me/librarian` (authenticated, rate-limited `librarian` 8/min, **JSON** — SSE deferred) → `{ recommendations[], reasoning, usedExternal, runId }`; persists an `agent_run` (agent=`librarian`) with `tool_calls_count`. Model route `librarian.agent → openai-explain` (gpt-4.1-mini). **Eval**: `LibrarianEvalRunner` (10 goldens: in-library / constrained / needs-external) scores **recall@k**, **constraint-satisfaction**, **coverage-decision accuracy** (expand externally exactly when thin), and the **hallucination-free rate** (every returned library slug genuinely exists, via a DB probe); admin-runnable `POST /admin/ai-quality/librarian/eval` (503 keyless). `dotnet build textstack.sln` green, `dotnet format` clean; 934 unit tests green (tool schema + page-estimate + shaping/provenance, `RetrievedCatalog` grounding incl. failed-result-contributes-nothing, Parse anti-hallucination/re-projection/de-dup/external-allowlist, loop one-shot/library-then-summarize/external-expansion/invented-book-dropped/injection-sanitized/budget-exhausted, eval recall+coverage+hallucination-probe). **No migration** (reuses `agent_run` + existing `tool_calls_count`). **Deferred**: ingest/HITL confirmation, SSE streaming, dedicated book-similarity index, user-library personalization, catalog year/page-count coverage. Design: `docs/04-dev/agents-roadmap.md` §4. **Needs a real-model run** to read live recall/constraint numbers on gpt-4.1-mini against a seeded catalog.

diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/tutor.json b/backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/tutor.json
@@ -0,0 +1,73 @@
+[
+  {
+    "Name": "due-and-weak-mix",
+    "Cards": [
+      { "WordId": "11111111-0000-0000-0000-000000000001", "Word": "ostensibly", "Stage": 1, "ConsecutiveCorrect": 0, "Accuracy": 0.30, "Due": true, "HasSentence": true },
+      { "WordId": "11111111-0000-0000-0000-000000000002", "Word": "ephemeral", "Stage": 2, "ConsecutiveCorrect": 1, "Accuracy": 0.45, "Due": true, "HasSentence": true },
+      { "WordId": "11111111-0000-0000-0000-000000000003", "Word": "sanguine", "Stage": 0, "ConsecutiveCorrect": 0, "Accuracy": 0.20, "Due": true, "HasSentence": true },
+      { "WordId": "11111111-0000-0000-0000-000000000004", "Word": "lucid", "Stage": 4, "ConsecutiveCorrect": 5, "Accuracy": 0.95, "Due": false, "HasSentence": true },
+      { "WordId": "11111111-0000-0000-0000-000000000005", "Word": "candid", "Stage": 3, "ConsecutiveCorrect": 2, "Accuracy": 0.80, "Due": false, "HasSentence": true }
+    ],
+    "ReadingBook": "Nineteen Eighty-Four",
+    "ReadingLanguage": "en",
+    "ExpectedDueWordIds": [
+      "11111111-0000-0000-0000-000000000001",
+      "11111111-0000-0000-0000-000000000002",
+      "11111111-0000-0000-0000-000000000003"
+    ],
+    "ExpectedWeakWordIds": [
+      "11111111-0000-0000-0000-000000000003",
+      "11111111-0000-0000-0000-000000000001"
+    ]
+  },
+  {
+    "Name": "all-early-stage",
+    "Cards": [
+      { "WordId": "22222222-0000-0000-0000-000000000001", "Word": "obfuscate", "Stage": 0, "ConsecutiveCorrect": 0, "Accuracy": 0.10, "Due": true, "HasSentence": false },
+      { "WordId": "22222222-0000-0000-0000-000000000002", "Word": "ponderous", "Stage": 1, "ConsecutiveCorrect": 0, "Accuracy": 0.25, "Due": true, "HasSentence": true }
+    ],
+    "ReadingBook": "Dracula",
+    "ReadingLanguage": "en",
+    "ExpectedDueWordIds": [
+      "22222222-0000-0000-0000-000000000001",
+      "22222222-0000-0000-0000-000000000002"
+    ],
+    "ExpectedWeakWordIds": [
+      "22222222-0000-0000-0000-000000000001",
+      "22222222-0000-0000-0000-000000000002"
+    ]
+  },
+  {
+    "Name": "context-stage-no-sentence-downgrades",
+    "Cards": [
+      { "WordId": "33333333-0000-0000-0000-000000000001", "Word": "ineffable", "Stage": 4, "ConsecutiveCorrect": 1, "Accuracy": 0.55, "Due": true, "HasSentence": false },
+      { "WordId": "33333333-0000-0000-0000-000000000002", "Word": "quixotic", "Stage": 3, "ConsecutiveCorrect": 2, "Accuracy": 0.60, "Due": true, "HasSentence": true }
+    ],
+    "ReadingBook": "Don Quixote",
+    "ReadingLanguage": "en",
+    "ExpectedDueWordIds": [
+      "33333333-0000-0000-0000-000000000001",
+      "33333333-0000-0000-0000-000000000002"
+    ],
+    "ExpectedWeakWordIds": [
+      "33333333-0000-0000-0000-000000000001"
+    ]
+  },
+  {
+    "Name": "weak-not-due-vs-due-not-weak",
+    "Cards": [
+      { "WordId": "44444444-0000-0000-0000-000000000001", "Word": "recalcitrant", "Stage": 1, "ConsecutiveCorrect": 0, "Accuracy": 0.15, "Due": false, "HasSentence": true },
+      { "WordId": "44444444-0000-0000-0000-000000000002", "Word": "perfunctory", "Stage": 3, "ConsecutiveCorrect": 4, "Accuracy": 0.90, "Due": true, "HasSentence": true },
+      { "WordId": "44444444-0000-0000-0000-000000000003", "Word": "taciturn", "Stage": 4, "ConsecutiveCorrect": 5, "Accuracy": 0.95, "Due": true, "HasSentence": true }
+    ],
+    "ReadingBook": "Crime and Punishment",
+    "ReadingLanguage": "en",
+    "ExpectedDueWordIds": [
+      "44444444-0000-0000-0000-000000000002",
+      "44444444-0000-0000-0000-000000000003"
+    ],
+    "ExpectedWeakWordIds": [
+      "44444444-0000-0000-0000-000000000001"
+    ]
+  }
+]