feat(eval): user-book RAG eval — grounding validation for uploads#387
Merged
Conversation
…loads
User books are the priority; the catalog-only RAG eval (editionId + hardcoded
DDIA golden set) didn't cover them. New content-agnostic user-book eval:
- UserBookRagEvalRunner: 6 generated grounding probes (question synthesised from
each retrieved chunk -> real Ask path -> judge citation support, shared rubric)
+ greeting probe (structural: warm, 0 citations, not insufficient, no [n]) +
off-book probe (judge: no invented facts). No expected-chapter recall (can't,
arbitrary content); no spoiler gate (user owns the doc).
- POST /admin/rag/userbook/{id}/eval?judge=openai (admin; resolves owner, logs
target userId for privacy; 503 no key). Persists eval_run tags
rag.userbook.{citation,behavior,retrieval} -> visible on /ai-quality.
- Empty/un-embedded book -> short-circuit, NO generator/judge LLM call.
- Extracted shared CitationJudge (rubric + JudgeCitations + MakeRun) so the
catalog RagEvalRunner and the user-book runner share one copy.
architect-planned. 61 AiEvals (catalog runner still green after extraction) +
886 unit tests green; build clean. P2 = admin UI trigger button. Validated the
behavior live earlier on the owner's DDIA upload (grounds+cites, warm greeting,
graceful off-book, multi-turn) — this automates it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
User books are the priority; the catalog RAG eval (editionId + hardcoded DDIA golden set) didn't cover them. Adds a content-agnostic user-book eval that automates the live grounding validation.
UserBookRagEvalRunner: 6 generated grounding probes (synthesise a question from each retrieved chunk → real Ask path → judge citation support via the shared rubric) + a greeting probe (structural: warm, 0 citations, not insufficient, no[n]) + an off-book probe (judge: no invented facts). No recall@k (arbitrary content); no spoiler gate (user owns the doc).POST /admin/rag/userbook/{id}/eval?judge=openai— admin; resolves owner, logs target userId (privacy), 503 if no key. Persistseval_runtagsrag.userbook.{citation,behavior,retrieval}→/ai-quality.CitationJudgeso the catalogRagEvalRunner+ the user-book runner share one rubric copy.61 AiEvals (catalog runner still green after the extraction) + 886 unit tests green; build clean. P2 = admin UI trigger. Already validated the behavior live on the owner's DDIA upload — this automates it.
🤖 Generated with Claude Code