mrviduus · mrviduus · Jun 22, 2026 · Jun 22, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,10 @@
 
 ## [Unreleased]
 
+### Ask this book — user-book RAG eval, P1 (live grounding validation) — backend (2026-06-22)
+
+Automated live grounding validation for **any user-uploaded book** — the user-book sibling to the catalog `RagEvalRunner` (AI-027). The catalog eval scores fixed golden sets keyed by `editionId`; user books have no goldens, so the new `UserBookRagEvalRunner` **synthesises probes from the book's own chunks**: it seed-retrieves a spread of chunks (`RetrieveUserBookAsync`, no gate), asks the generator for one self-contained question per chunk (`FeatureTag eval.userbook.gen`), runs the **real Ask path** per question, and judges the resulting answer's citations with the **shared `CitationJudge`** — same rubric + SupportRate (D1≥4) as the catalog. Two behaviour probes round it out: a **warm greeting** ("hi") checked *structurally* (answers, no citations, no `[n]` marker, not refused — no judge call) and a fixed **off-book** question judged for invented facts (passes iff the answer declines or stays grounded). **Empty/un-embedded book → short-circuit**: NO generator/judge LLM call, persist a failed 0-row with a note (mirrors the catalog no-LLM-on-empty invariant). Refactor: the catalog citation judge (`JudgeCitationsAsync`) + the `EvalRun` row factory (`MakeRun`) are extracted into one internal `CitationJudge` helper that both runners call — the rubric never forks; the catalog `RagEvalRunner` + its tests stay byte-for-byte green. New endpoint `POST /admin/rag/userbook/{id}/eval?judge=openai` (admin-auth inherited) resolves the owner via `db.UserBooks`, **logs the target userId** (privacy: admin eval reads private user content), runs 6 probes, persists, and returns `UserBookRagEvalDto` (citation score/supportRate, retrieval fraction, behaviour pass, per-probe breakdown). 404 on unknown book, 503 with no OpenAI key. Persists `rag.userbook.citation` / `rag.userbook.behavior` / `rag.userbook.retrieval` `eval_run` rows. `dotnet build -c Release` clean; AiEvals 61 + UnitTests 886 green (synthesised-probe aggregate, greeting structural pass/fail, off-book judge pass/fail, **empty-chunks → asserted zero generator/judge calls** via a throwing fake, SupportRate math, persist on/off). P2 admin UI is a separate slice.
+
 ### Ask this book — conversational, streaming web chat — backend (AI-028) (2026-06-19)
 
 Backend for the conversational "Ask this book" upgrade: **model bump + multi-turn memory + warm-companion prompt + SSE streaming**, with grounding, citations, and the spoiler gate intact. `rag.ask` now routes to a dedicated keyed provider `openai-rag` on **gpt-4.1-mini** (was gpt-4.1-nano), mirroring `openai-explain` (`OpenAI:RagAsk:Model`, `Ai:Routes:rag.ask → openai-rag`, decorator-loop entry, `ModelRegistrySeeder` row). The system prompt is rewritten from "answer ONLY from excerpts else refuse" to a **warm reading companion** that is still strictly grounded — every book-fact claim must come from the numbered excerpts and cite `[n]` (citation contract + parser unchanged), but greetings/meta ("hi", "what can you do") get a warm invite with **no forced citation and no refusal**, and a genuine question with no matching excerpt gets a graceful "I don't see that in what you've read so far" rather than an invented fact. **Multi-turn**: `AskRequest` gains `History: AskTurnDto[]` (role `"user"`/`"assistant"`); the server defensively clamps to the **last 6 turns**, caps each turn at 4000 chars, normalizes roles, and assembles a real chat (system → numbered-excerpts context block → prior turns → new question last). Retrieval still runs on the latest question only, so the grounding eval is byte-identical with `[]` history. **SSE** (content-negotiated, mirrors Explain): `Accept: text/event-stream` → `delta` events (token fragments) then a terminal `done` carrying `{ citations, lastReadOrd, insufficient }` (camelCase, identical citation shape to the JSON path); empty-chunks → one friendly `delta` + `done {insufficient:true}` with **no model call**; provider/mid-stream failure → terminal `error`. JSON path returns the unchanged `AskResponse` (eval + mobile keep working). Ask `MaxOutputTokens` raised 320 → 400 for conversational length. `dotnet build -c Release` clean; 868 unit tests green (history clamp, multi-turn message assembly, SSE event sequencing over a fake delta stream, companion greeting-vs-content prompt structure) + integration (catalog spoiler-gate, owner-404, SSE content-type + framing, JSON history passthrough — skip-on-unavailable). **Note: the grounding golden eval (`RagEvalRunner`) MUST be re-run on mini post-deploy** — the companion prompt loosened the refusal rule, so this is the real hallucination-risk gate (paid; not runnable in CI). Frontend = parallel agent (AI-026e).

diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/CitationJudge.cs b/backend/src/Ai/TextStack.Ai.EvalSuite/CitationJudge.cs
@@ -0,0 +1,134 @@
+using Application.Rag;
+using Domain.Entities;
+using Microsoft.Extensions.AI;
+using Microsoft.Extensions.AI.Evaluation;
+using TextStack.Ai.Core;
+using TextStack.Ai.Evals;
+using TextStack.Ai.Llm;
+using TextStack.Ai.Rag;
+
+namespace TextStack.Ai.EvalSuite;
+
+/// <summary>
+/// Shared citation-correctness machinery for the RAG evals (AI-027b + user-book P1). Both the catalog
+/// <see cref="RagEvalRunner"/> and the <see cref="UserBookRagEvalRunner"/> call ONE copy of the judge
+/// rubric + scoring + the <see cref="EvalRun"/> row factory, so the support metric never forks. The
+/// "support" axis MUST stay Dim1 — SupportRate reads <see cref="JudgeScore.D1"/>.
+/// </summary>
+internal static class CitationJudge
+{
+    internal const string CitationFeature = "rag.citation";
+    internal const int SupportPassThreshold = 4; // judge ≥4/5 on the support axis = a correct citation
+    internal const string NoJudge = "n/a";
+
+    internal static readonly Rubric Rubric = new(
+        "support: does the cited excerpt actually contain or directly imply the specific claim it is attached to?",
+        "relevance: is the excerpt genuinely on-topic for the answer, not a loosely-related passage?",
+        "faithfulness: does the answer avoid asserting anything the cited excerpts do not support (no outside knowledge)?");
+
+    private static readonly ChatMessage[] JudgePlaceholderMessages = [new ChatMessage(ChatRole.User, string.Empty)];
+
+    /// <summary>
+    /// Generates a grounded answer per question (over its already-retrieved chunks) and judges each
+    /// citation against the FULL text of the excerpt it points to. Returns the mean 1–5 score and the
+    /// support rate (citations scored ≥<see cref="SupportPassThreshold"/> on the support axis). Shared
+    /// verbatim by both runners.
+    /// </summary>
+    internal static async Task<RagCitationSummary> JudgeCitationsAsync(
+        IRagAskService ask,
+        ILlmService judge,
+        IReadOnlyList<(string Question, IReadOnlyList<RetrievedChunk> Chunks)> retrieved,
+        CancellationToken ct)
+    {
+        var chatConfig = new ChatConfiguration(new LlmServiceChatClient(judge, defaultFeatureTag: "eval.judge"));
+        var scores = new List<JudgeScore>();
+        var supported = 0;
+        var answersGenerated = 0;
+
+        foreach (var (question, chunks) in retrieved)
+        {
+            ct.ThrowIfCancellationRequested();
+            // lastReadOrd is irrelevant here — chunks are supplied directly (no user gating).
+            var answer = await ask.AskFromChunksAsync(question, chunks, [], [], lastReadOrd: int.MaxValue, ct);
+            if (answer.Insufficient || answer.Citations.Count == 0)
+                continue;
+            answersGenerated++;
+
+            foreach (var cited in answer.Citations)
+            {
+                ct.ThrowIfCancellationRequested();
+                var evidence =
+                    $"Question: {question}\n\nAnswer:\n{answer.Answer}\n\n" +
+                    $"Cited excerpt [{cited.Marker}] (the answer attributes a claim to this passage):\n{cited.Chunk.Text}";
+
+                var evaluator = new RubricEvaluator(CitationFeature, Rubric);
+                var result = await evaluator.EvaluateAsync(
+                    JudgePlaceholderMessages,
+                    new ChatResponse(new ChatMessage(ChatRole.Assistant, answer.Answer)),
+                    chatConfig, [new RubricEvidenceContext(evidence)], ct);
+
+                var score = new JudgeScore(
+                    ReadAxis(result, Rubric.Dim1),
+                    ReadAxis(result, Rubric.Dim2),
+                    ReadAxis(result, Rubric.Dim3),
+                    string.Empty);
+                scores.Add(score);
+                if (score.D1 >= SupportPassThreshold)
+                    supported++;
+            }
+        }
+
+        if (scores.Count == 0)
+            return new RagCitationSummary(0, 0, 0, answersGenerated);
+
+        var summary = JudgeRunner.Aggregate(scores);
+        var supportRate = (double)supported / scores.Count;
+        return new RagCitationSummary(summary.MeanOverall, supportRate, scores.Count, answersGenerated);
+    }
+
+    /// <summary>
+    /// Lightweight faithfulness check for the user-book off-book probe: does <paramref name="answer"/>
+    /// introduce facts not grounded in the book? Returns true when the answer is clean (gracefully
+    /// declines or stays grounded — no invented facts), false when it hallucinates outside knowledge.
+    /// One judge call; parses a leading yes/no.
+    /// </summary>
+    internal static async Task<bool> JudgeNoInventedFactsAsync(
+        ILlmService judge, string question, string answer, CancellationToken ct)
+    {
+        var prompt =
+            "You are grading whether an assistant answer about a book introduces facts NOT grounded in " +
+            "that book. A good answer either gracefully declines (says it can't find this in the book) " +
+            "or stays grounded. A bad answer asserts specific outside facts (dates, names, scores, " +
+            "events) as if from the book.\n\n" +
+            $"Question: {question}\n\nAnswer:\n{answer}\n\n" +
+            "Does the answer introduce facts not grounded in the book? Reply with exactly YES or NO.";
+        var request = new LlmRequest(
+            SystemPrompt: "You are a strict faithfulness grader. Reply YES or NO only.",
+            Messages: [new LlmMessage("user", prompt)],
+            MaxOutputTokens: 4,
+            FeatureTag: "eval.judge");
+        var response = await judge.CompleteAsync(request, ct);
+        var verdict = response.Text.Trim().ToLowerInvariant();
+        // "yes" = invented facts present → NOT clean. Anything else (incl. "no") = clean.
+        return !verdict.StartsWith("yes", StringComparison.Ordinal);
+    }
+
+    // RubricEvaluator names each axis "{feature}.{label}" (label = text before ':').
+    private static int ReadAxis(EvaluationResult result, string dim) =>
+        (int)Math.Round(result.Get<NumericMetric>($"{CitationFeature}.{dim.Split(':')[0].Trim()}").Value ?? 0);
+
+    /// <summary>Shared <see cref="EvalRun"/> row factory — one copy for every RAG eval feature.</summary>
+    internal static EvalRun MakeRun(
+        string feature, string modelId, string judgeModelId, decimal score, int n, string? gitSha, string breakdown) => new()
+        {
+            Id = Guid.NewGuid(),
+            Feature = feature,
+            ModelId = modelId,
+            JudgeModelId = judgeModelId,
+            Score = Math.Round(score, 3),
+            N = n,
+            BreakdownJson = breakdown,
+            GitSha = gitSha,
+            CreatedAt = DateTimeOffset.UtcNow,
+        };
+}
diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/RagEvalRunner.cs b/backend/src/Ai/TextStack.Ai.EvalSuite/RagEvalRunner.cs
@@ -1,12 +1,8 @@
 using Application.Common.Interfaces;
 using Application.Rag;
-using Domain.Entities;
-using Microsoft.Extensions.AI;
-using Microsoft.Extensions.AI.Evaluation;
 using Microsoft.Extensions.Logging;
 using TextStack.Ai.Core;
 using TextStack.Ai.Evals;
-using TextStack.Ai.Llm;
 using TextStack.Ai.Rag;
 
 namespace TextStack.Ai.EvalSuite;
@@ -49,17 +45,7 @@ public sealed class RagEvalRunner(ILogger<RagEvalRunner> logger)
 {
     // Retrieval scores 0–1 (recall / 1−leak), unlike the 1–5 judged features — the feature key disambiguates.
     private const string RetrievalModelId = "hybrid-retrieval";
-    private const string NoJudge = "n/a";
-    private const int SupportPassThreshold = 4; // judge ≥4/5 on the support axis = a correct citation
-    private const string CitationFeature = "rag.citation";
-
-    // The "support" axis MUST stay Dim1 — SupportRate reads JudgeScore.D1.
-    private static readonly Rubric CitationRubric = new(
-        "support: does the cited excerpt actually contain or directly imply the specific claim it is attached to?",
-        "relevance: is the excerpt genuinely on-topic for the answer, not a loosely-related passage?",
-        "faithfulness: does the answer avoid asserting anything the cited excerpts do not support (no outside knowledge)?");
-
-    private static readonly ChatMessage[] JudgePlaceholderMessages = [new ChatMessage(ChatRole.User, string.Empty)];
+    private const string NoJudge = CitationJudge.NoJudge;
 
     public async Task<RagEvalResult> RunAsync(
         IRagService rag,
@@ -110,7 +96,7 @@ public async Task<RagEvalResult> RunAsync(
         // Citation correctness (027b) — only when a generator + judge are supplied.
         RagCitationSummary? citation = null;
         if (ask is not null && judge is not null)
-            citation = await JudgeCitationsAsync(ask, judge, retrievedByQuestion, ct);
+            citation = await CitationJudge.JudgeCitationsAsync(ask, judge, retrievedByQuestion, ct);
 
         logger.LogInformation(
             "RAG eval edition={Edition} recall@{K}={Recall:0.00} (N={RecallN}) spoilerLeakRate={Leak:0.00} (N={SpoilerN}) citation={Cit}",
@@ -119,92 +105,17 @@ public async Task<RagEvalResult> RunAsync(
 
         if (persist && db is not null)
         {
-            db.EvalRuns.Add(MakeRun("rag.retrieval", RetrievalModelId, NoJudge, (decimal)recall, recallCases.Count, gitSha,
+            db.EvalRuns.Add(CitationJudge.MakeRun("rag.retrieval", RetrievalModelId, NoJudge, (decimal)recall, recallCases.Count, gitSha,
                 $"{{\"recallAtK\":{recall:0.000},\"k\":{k},\"hits\":{recallDetail.Count(c => c.Hit)}}}"));
-            db.EvalRuns.Add(MakeRun("rag.spoiler", RetrievalModelId, NoJudge, (decimal)(1.0 - leakRate), spoilerCases.Count, gitSha,
+            db.EvalRuns.Add(CitationJudge.MakeRun("rag.spoiler", RetrievalModelId, NoJudge, (decimal)(1.0 - leakRate), spoilerCases.Count, gitSha,
                 $"{{\"leakRate\":{leakRate:0.000},\"leakingCases\":{spoilerDetail.Count(c => c.LeakCount > 0)}}}"));
             if (citation is not null)
-                db.EvalRuns.Add(MakeRun(CitationFeature, RagAskService.FeatureTag, judgeModelId ?? NoJudge,
+                db.EvalRuns.Add(CitationJudge.MakeRun(CitationJudge.CitationFeature, RagAskService.FeatureTag, judgeModelId ?? NoJudge,
                     (decimal)citation.Score, citation.CitationsJudged, gitSha,
                     $"{{\"supportRate\":{citation.SupportRate:0.000},\"answers\":{citation.AnswersGenerated}}}"));
             await db.SaveChangesAsync(ct);
         }
 
         return new RagEvalResult(recall, recallCases.Count, leakRate, spoilerCases.Count, recallDetail, spoilerDetail, citation);
     }
-
-    /// <summary>
-    /// Generates a grounded answer per question (over its already-retrieved chunks) and judges each
-    /// citation against the FULL text of the excerpt it points to. Returns the mean 1–5 score and the
-    /// support rate (citations scored ≥<see cref="SupportPassThreshold"/> on the support axis).
-    /// </summary>
-    private async Task<RagCitationSummary> JudgeCitationsAsync(
-        IRagAskService ask,
-        ILlmService judge,
-        IReadOnlyList<(string Question, IReadOnlyList<RetrievedChunk> Chunks)> retrieved,
-        CancellationToken ct)
-    {
-        var chatConfig = new ChatConfiguration(new LlmServiceChatClient(judge, defaultFeatureTag: "eval.judge"));
-        var scores = new List<JudgeScore>();
-        var supported = 0;
-        var answersGenerated = 0;
-
-        foreach (var (question, chunks) in retrieved)
-        {
-            ct.ThrowIfCancellationRequested();
-            // lastReadOrd is irrelevant here — chunks are supplied directly (no user gating).
-            var answer = await ask.AskFromChunksAsync(question, chunks, [], [], lastReadOrd: int.MaxValue, ct);
-            if (answer.Insufficient || answer.Citations.Count == 0)
-                continue;
-            answersGenerated++;
-
-            foreach (var cited in answer.Citations)
-            {
-                ct.ThrowIfCancellationRequested();
-                var evidence =
-                    $"Question: {question}\n\nAnswer:\n{answer.Answer}\n\n" +
-                    $"Cited excerpt [{cited.Marker}] (the answer attributes a claim to this passage):\n{cited.Chunk.Text}";
-
-                var evaluator = new RubricEvaluator(CitationFeature, CitationRubric);
-                var result = await evaluator.EvaluateAsync(
-                    JudgePlaceholderMessages,
-                    new ChatResponse(new ChatMessage(ChatRole.Assistant, answer.Answer)),
-                    chatConfig, [new RubricEvidenceContext(evidence)], ct);
-
-                var score = new JudgeScore(
-                    ReadAxis(result, CitationRubric.Dim1),
-                    ReadAxis(result, CitationRubric.Dim2),
-                    ReadAxis(result, CitationRubric.Dim3),
-                    string.Empty);
-                scores.Add(score);
-                if (score.D1 >= SupportPassThreshold)
-                    supported++;
-            }
-        }
-
-        if (scores.Count == 0)
-            return new RagCitationSummary(0, 0, 0, answersGenerated);
-
-        var summary = JudgeRunner.Aggregate(scores);
-        var supportRate = (double)supported / scores.Count;
-        return new RagCitationSummary(summary.MeanOverall, supportRate, scores.Count, answersGenerated);
-    }
-
-    // RubricEvaluator names each axis "{feature}.{label}" (label = text before ':').
-    private static int ReadAxis(EvaluationResult result, string dim) =>
-        (int)Math.Round(result.Get<NumericMetric>($"{CitationFeature}.{dim.Split(':')[0].Trim()}").Value ?? 0);
-
-    private static EvalRun MakeRun(
-        string feature, string modelId, string judgeModelId, decimal score, int n, string? gitSha, string breakdown) => new()
-        {
-            Id = Guid.NewGuid(),
-            Feature = feature,
-            ModelId = modelId,
-            JudgeModelId = judgeModelId,
-            Score = Math.Round(score, 3),
-            N = n,
-            BreakdownJson = breakdown,
-            GitSha = gitSha,
-            CreatedAt = DateTimeOffset.UtcNow,
-        };
 }