Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## [Unreleased]

### Ask this book — user-book RAG eval, P1 (live grounding validation) — backend (2026-06-22)

Automated live grounding validation for **any user-uploaded book** — the user-book sibling to the catalog `RagEvalRunner` (AI-027). The catalog eval scores fixed golden sets keyed by `editionId`; user books have no goldens, so the new `UserBookRagEvalRunner` **synthesises probes from the book's own chunks**: it seed-retrieves a spread of chunks (`RetrieveUserBookAsync`, no gate), asks the generator for one self-contained question per chunk (`FeatureTag eval.userbook.gen`), runs the **real Ask path** per question, and judges the resulting answer's citations with the **shared `CitationJudge`** — same rubric + SupportRate (D1≥4) as the catalog. Two behaviour probes round it out: a **warm greeting** ("hi") checked *structurally* (answers, no citations, no `[n]` marker, not refused — no judge call) and a fixed **off-book** question judged for invented facts (passes iff the answer declines or stays grounded). **Empty/un-embedded book → short-circuit**: NO generator/judge LLM call, persist a failed 0-row with a note (mirrors the catalog no-LLM-on-empty invariant). Refactor: the catalog citation judge (`JudgeCitationsAsync`) + the `EvalRun` row factory (`MakeRun`) are extracted into one internal `CitationJudge` helper that both runners call — the rubric never forks; the catalog `RagEvalRunner` + its tests stay byte-for-byte green. New endpoint `POST /admin/rag/userbook/{id}/eval?judge=openai` (admin-auth inherited) resolves the owner via `db.UserBooks`, **logs the target userId** (privacy: admin eval reads private user content), runs 6 probes, persists, and returns `UserBookRagEvalDto` (citation score/supportRate, retrieval fraction, behaviour pass, per-probe breakdown). 404 on unknown book, 503 with no OpenAI key. Persists `rag.userbook.citation` / `rag.userbook.behavior` / `rag.userbook.retrieval` `eval_run` rows. `dotnet build -c Release` clean; AiEvals 61 + UnitTests 886 green (synthesised-probe aggregate, greeting structural pass/fail, off-book judge pass/fail, **empty-chunks → asserted zero generator/judge calls** via a throwing fake, SupportRate math, persist on/off). P2 admin UI is a separate slice.

### Ask this book — conversational, streaming web chat — backend (AI-028) (2026-06-19)

Backend for the conversational "Ask this book" upgrade: **model bump + multi-turn memory + warm-companion prompt + SSE streaming**, with grounding, citations, and the spoiler gate intact. `rag.ask` now routes to a dedicated keyed provider `openai-rag` on **gpt-4.1-mini** (was gpt-4.1-nano), mirroring `openai-explain` (`OpenAI:RagAsk:Model`, `Ai:Routes:rag.ask → openai-rag`, decorator-loop entry, `ModelRegistrySeeder` row). The system prompt is rewritten from "answer ONLY from excerpts else refuse" to a **warm reading companion** that is still strictly grounded — every book-fact claim must come from the numbered excerpts and cite `[n]` (citation contract + parser unchanged), but greetings/meta ("hi", "what can you do") get a warm invite with **no forced citation and no refusal**, and a genuine question with no matching excerpt gets a graceful "I don't see that in what you've read so far" rather than an invented fact. **Multi-turn**: `AskRequest` gains `History: AskTurnDto[]` (role `"user"`/`"assistant"`); the server defensively clamps to the **last 6 turns**, caps each turn at 4000 chars, normalizes roles, and assembles a real chat (system → numbered-excerpts context block → prior turns → new question last). Retrieval still runs on the latest question only, so the grounding eval is byte-identical with `[]` history. **SSE** (content-negotiated, mirrors Explain): `Accept: text/event-stream` → `delta` events (token fragments) then a terminal `done` carrying `{ citations, lastReadOrd, insufficient }` (camelCase, identical citation shape to the JSON path); empty-chunks → one friendly `delta` + `done {insufficient:true}` with **no model call**; provider/mid-stream failure → terminal `error`. JSON path returns the unchanged `AskResponse` (eval + mobile keep working). Ask `MaxOutputTokens` raised 320 → 400 for conversational length. `dotnet build -c Release` clean; 868 unit tests green (history clamp, multi-turn message assembly, SSE event sequencing over a fake delta stream, companion greeting-vs-content prompt structure) + integration (catalog spoiler-gate, owner-404, SSE content-type + framing, JSON history passthrough — skip-on-unavailable). **Note: the grounding golden eval (`RagEvalRunner`) MUST be re-run on mini post-deploy** — the companion prompt loosened the refusal rule, so this is the real hallucination-risk gate (paid; not runnable in CI). Frontend = parallel agent (AI-026e).
Expand Down
134 changes: 134 additions & 0 deletions backend/src/Ai/TextStack.Ai.EvalSuite/CitationJudge.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
using Application.Rag;
using Domain.Entities;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using TextStack.Ai.Core;
using TextStack.Ai.Evals;
using TextStack.Ai.Llm;
using TextStack.Ai.Rag;

namespace TextStack.Ai.EvalSuite;

/// <summary>
/// Shared citation-correctness machinery for the RAG evals (AI-027b + user-book P1). Both the catalog
/// <see cref="RagEvalRunner"/> and the <see cref="UserBookRagEvalRunner"/> call ONE copy of the judge
/// rubric + scoring + the <see cref="EvalRun"/> row factory, so the support metric never forks. The
/// "support" axis MUST stay Dim1 — SupportRate reads <see cref="JudgeScore.D1"/>.
/// </summary>
internal static class CitationJudge
{
internal const string CitationFeature = "rag.citation";
internal const int SupportPassThreshold = 4; // judge ≥4/5 on the support axis = a correct citation
internal const string NoJudge = "n/a";

internal static readonly Rubric Rubric = new(
"support: does the cited excerpt actually contain or directly imply the specific claim it is attached to?",
"relevance: is the excerpt genuinely on-topic for the answer, not a loosely-related passage?",
"faithfulness: does the answer avoid asserting anything the cited excerpts do not support (no outside knowledge)?");

private static readonly ChatMessage[] JudgePlaceholderMessages = [new ChatMessage(ChatRole.User, string.Empty)];

/// <summary>
/// Generates a grounded answer per question (over its already-retrieved chunks) and judges each
/// citation against the FULL text of the excerpt it points to. Returns the mean 1–5 score and the
/// support rate (citations scored ≥<see cref="SupportPassThreshold"/> on the support axis). Shared
/// verbatim by both runners.
/// </summary>
internal static async Task<RagCitationSummary> JudgeCitationsAsync(
IRagAskService ask,
ILlmService judge,
IReadOnlyList<(string Question, IReadOnlyList<RetrievedChunk> Chunks)> retrieved,
CancellationToken ct)
{
var chatConfig = new ChatConfiguration(new LlmServiceChatClient(judge, defaultFeatureTag: "eval.judge"));
var scores = new List<JudgeScore>();
var supported = 0;
var answersGenerated = 0;

foreach (var (question, chunks) in retrieved)
{
ct.ThrowIfCancellationRequested();
// lastReadOrd is irrelevant here — chunks are supplied directly (no user gating).
var answer = await ask.AskFromChunksAsync(question, chunks, [], [], lastReadOrd: int.MaxValue, ct);
if (answer.Insufficient || answer.Citations.Count == 0)
continue;
answersGenerated++;

foreach (var cited in answer.Citations)
{
ct.ThrowIfCancellationRequested();
var evidence =
$"Question: {question}\n\nAnswer:\n{answer.Answer}\n\n" +
$"Cited excerpt [{cited.Marker}] (the answer attributes a claim to this passage):\n{cited.Chunk.Text}";

var evaluator = new RubricEvaluator(CitationFeature, Rubric);
var result = await evaluator.EvaluateAsync(
JudgePlaceholderMessages,
new ChatResponse(new ChatMessage(ChatRole.Assistant, answer.Answer)),
chatConfig, [new RubricEvidenceContext(evidence)], ct);

var score = new JudgeScore(
ReadAxis(result, Rubric.Dim1),
ReadAxis(result, Rubric.Dim2),
ReadAxis(result, Rubric.Dim3),
string.Empty);
scores.Add(score);
if (score.D1 >= SupportPassThreshold)
supported++;
}
}

if (scores.Count == 0)
return new RagCitationSummary(0, 0, 0, answersGenerated);

var summary = JudgeRunner.Aggregate(scores);
var supportRate = (double)supported / scores.Count;
return new RagCitationSummary(summary.MeanOverall, supportRate, scores.Count, answersGenerated);
}

/// <summary>
/// Lightweight faithfulness check for the user-book off-book probe: does <paramref name="answer"/>
/// introduce facts not grounded in the book? Returns true when the answer is clean (gracefully
/// declines or stays grounded — no invented facts), false when it hallucinates outside knowledge.
/// One judge call; parses a leading yes/no.
/// </summary>
internal static async Task<bool> JudgeNoInventedFactsAsync(
ILlmService judge, string question, string answer, CancellationToken ct)
{
var prompt =
"You are grading whether an assistant answer about a book introduces facts NOT grounded in " +
"that book. A good answer either gracefully declines (says it can't find this in the book) " +
"or stays grounded. A bad answer asserts specific outside facts (dates, names, scores, " +
"events) as if from the book.\n\n" +
$"Question: {question}\n\nAnswer:\n{answer}\n\n" +
"Does the answer introduce facts not grounded in the book? Reply with exactly YES or NO.";
var request = new LlmRequest(
SystemPrompt: "You are a strict faithfulness grader. Reply YES or NO only.",
Messages: [new LlmMessage("user", prompt)],
MaxOutputTokens: 4,
FeatureTag: "eval.judge");
var response = await judge.CompleteAsync(request, ct);
var verdict = response.Text.Trim().ToLowerInvariant();
// "yes" = invented facts present → NOT clean. Anything else (incl. "no") = clean.
return !verdict.StartsWith("yes", StringComparison.Ordinal);
}

// RubricEvaluator names each axis "{feature}.{label}" (label = text before ':').
private static int ReadAxis(EvaluationResult result, string dim) =>
(int)Math.Round(result.Get<NumericMetric>($"{CitationFeature}.{dim.Split(':')[0].Trim()}").Value ?? 0);

/// <summary>Shared <see cref="EvalRun"/> row factory — one copy for every RAG eval feature.</summary>
internal static EvalRun MakeRun(
string feature, string modelId, string judgeModelId, decimal score, int n, string? gitSha, string breakdown) => new()
{
Id = Guid.NewGuid(),
Feature = feature,
ModelId = modelId,
JudgeModelId = judgeModelId,
Score = Math.Round(score, 3),
N = n,
BreakdownJson = breakdown,
GitSha = gitSha,
CreatedAt = DateTimeOffset.UtcNow,
};
}
99 changes: 5 additions & 94 deletions backend/src/Ai/TextStack.Ai.EvalSuite/RagEvalRunner.cs
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
using Application.Common.Interfaces;
using Application.Rag;
using Domain.Entities;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using Microsoft.Extensions.Logging;
using TextStack.Ai.Core;
using TextStack.Ai.Evals;
using TextStack.Ai.Llm;
using TextStack.Ai.Rag;

namespace TextStack.Ai.EvalSuite;
Expand Down Expand Up @@ -49,17 +45,7 @@ public sealed class RagEvalRunner(ILogger<RagEvalRunner> logger)
{
// Retrieval scores 0–1 (recall / 1−leak), unlike the 1–5 judged features — the feature key disambiguates.
private const string RetrievalModelId = "hybrid-retrieval";
private const string NoJudge = "n/a";
private const int SupportPassThreshold = 4; // judge ≥4/5 on the support axis = a correct citation
private const string CitationFeature = "rag.citation";

// The "support" axis MUST stay Dim1 — SupportRate reads JudgeScore.D1.
private static readonly Rubric CitationRubric = new(
"support: does the cited excerpt actually contain or directly imply the specific claim it is attached to?",
"relevance: is the excerpt genuinely on-topic for the answer, not a loosely-related passage?",
"faithfulness: does the answer avoid asserting anything the cited excerpts do not support (no outside knowledge)?");

private static readonly ChatMessage[] JudgePlaceholderMessages = [new ChatMessage(ChatRole.User, string.Empty)];
private const string NoJudge = CitationJudge.NoJudge;

public async Task<RagEvalResult> RunAsync(
IRagService rag,
Expand Down Expand Up @@ -110,7 +96,7 @@ public async Task<RagEvalResult> RunAsync(
// Citation correctness (027b) — only when a generator + judge are supplied.
RagCitationSummary? citation = null;
if (ask is not null && judge is not null)
citation = await JudgeCitationsAsync(ask, judge, retrievedByQuestion, ct);
citation = await CitationJudge.JudgeCitationsAsync(ask, judge, retrievedByQuestion, ct);

logger.LogInformation(
"RAG eval edition={Edition} recall@{K}={Recall:0.00} (N={RecallN}) spoilerLeakRate={Leak:0.00} (N={SpoilerN}) citation={Cit}",
Expand All @@ -119,92 +105,17 @@ public async Task<RagEvalResult> RunAsync(

if (persist && db is not null)
{
db.EvalRuns.Add(MakeRun("rag.retrieval", RetrievalModelId, NoJudge, (decimal)recall, recallCases.Count, gitSha,
db.EvalRuns.Add(CitationJudge.MakeRun("rag.retrieval", RetrievalModelId, NoJudge, (decimal)recall, recallCases.Count, gitSha,
$"{{\"recallAtK\":{recall:0.000},\"k\":{k},\"hits\":{recallDetail.Count(c => c.Hit)}}}"));
db.EvalRuns.Add(MakeRun("rag.spoiler", RetrievalModelId, NoJudge, (decimal)(1.0 - leakRate), spoilerCases.Count, gitSha,
db.EvalRuns.Add(CitationJudge.MakeRun("rag.spoiler", RetrievalModelId, NoJudge, (decimal)(1.0 - leakRate), spoilerCases.Count, gitSha,
$"{{\"leakRate\":{leakRate:0.000},\"leakingCases\":{spoilerDetail.Count(c => c.LeakCount > 0)}}}"));
if (citation is not null)
db.EvalRuns.Add(MakeRun(CitationFeature, RagAskService.FeatureTag, judgeModelId ?? NoJudge,
db.EvalRuns.Add(CitationJudge.MakeRun(CitationJudge.CitationFeature, RagAskService.FeatureTag, judgeModelId ?? NoJudge,
(decimal)citation.Score, citation.CitationsJudged, gitSha,
$"{{\"supportRate\":{citation.SupportRate:0.000},\"answers\":{citation.AnswersGenerated}}}"));
await db.SaveChangesAsync(ct);
}

return new RagEvalResult(recall, recallCases.Count, leakRate, spoilerCases.Count, recallDetail, spoilerDetail, citation);
}

/// <summary>
/// Generates a grounded answer per question (over its already-retrieved chunks) and judges each
/// citation against the FULL text of the excerpt it points to. Returns the mean 1–5 score and the
/// support rate (citations scored ≥<see cref="SupportPassThreshold"/> on the support axis).
/// </summary>
private async Task<RagCitationSummary> JudgeCitationsAsync(
IRagAskService ask,
ILlmService judge,
IReadOnlyList<(string Question, IReadOnlyList<RetrievedChunk> Chunks)> retrieved,
CancellationToken ct)
{
var chatConfig = new ChatConfiguration(new LlmServiceChatClient(judge, defaultFeatureTag: "eval.judge"));
var scores = new List<JudgeScore>();
var supported = 0;
var answersGenerated = 0;

foreach (var (question, chunks) in retrieved)
{
ct.ThrowIfCancellationRequested();
// lastReadOrd is irrelevant here — chunks are supplied directly (no user gating).
var answer = await ask.AskFromChunksAsync(question, chunks, [], [], lastReadOrd: int.MaxValue, ct);
if (answer.Insufficient || answer.Citations.Count == 0)
continue;
answersGenerated++;

foreach (var cited in answer.Citations)
{
ct.ThrowIfCancellationRequested();
var evidence =
$"Question: {question}\n\nAnswer:\n{answer.Answer}\n\n" +
$"Cited excerpt [{cited.Marker}] (the answer attributes a claim to this passage):\n{cited.Chunk.Text}";

var evaluator = new RubricEvaluator(CitationFeature, CitationRubric);
var result = await evaluator.EvaluateAsync(
JudgePlaceholderMessages,
new ChatResponse(new ChatMessage(ChatRole.Assistant, answer.Answer)),
chatConfig, [new RubricEvidenceContext(evidence)], ct);

var score = new JudgeScore(
ReadAxis(result, CitationRubric.Dim1),
ReadAxis(result, CitationRubric.Dim2),
ReadAxis(result, CitationRubric.Dim3),
string.Empty);
scores.Add(score);
if (score.D1 >= SupportPassThreshold)
supported++;
}
}

if (scores.Count == 0)
return new RagCitationSummary(0, 0, 0, answersGenerated);

var summary = JudgeRunner.Aggregate(scores);
var supportRate = (double)supported / scores.Count;
return new RagCitationSummary(summary.MeanOverall, supportRate, scores.Count, answersGenerated);
}

// RubricEvaluator names each axis "{feature}.{label}" (label = text before ':').
private static int ReadAxis(EvaluationResult result, string dim) =>
(int)Math.Round(result.Get<NumericMetric>($"{CitationFeature}.{dim.Split(':')[0].Trim()}").Value ?? 0);

private static EvalRun MakeRun(
string feature, string modelId, string judgeModelId, decimal score, int n, string? gitSha, string breakdown) => new()
{
Id = Guid.NewGuid(),
Feature = feature,
ModelId = modelId,
JudgeModelId = judgeModelId,
Score = Math.Round(score, 3),
N = n,
BreakdownJson = breakdown,
GitSha = gitSha,
CreatedAt = DateTimeOffset.UtcNow,
};
}
Loading
Loading