Problem
vitest-evals@0.9.0-beta.1 now exposes the configured suite harness to judges through JudgeContext.harness, and Harness.prompt is the right provider-agnostic model seam. That resolves the most important design issue from #45 / #46.
The remaining repeated work for consumers is the rubric judge itself. Any app that wants LLM-judged natural-language behavior still has to implement the same pieces locally:
- prompt text that asks for a calibrated rubric verdict
- a score scale, usually A-E or pass/partial/fail
- JSON-only response instructions
- JSON extraction / parse error handling
- score mapping
- rationale metadata
- optional raw answer metadata for reporter output
This is not app-specific enough to justify every consumer carrying its own version.
Desired Shape
Add a first-party RubricJudge that returns a normal JudgeFn<JudgeContext<...>> and calls ctx.harness.prompt(...) internally.
Example:
import { RubricJudge, describeEval } from "vitest-evals";
const BehaviorJudge = RubricJudge<SlackEvalInput>({
getCriteria: ({ inputValue }) => [
`Contract:\n${inputValue.criteria.contract}`,
formatList("Pass", inputValue.criteria.pass),
formatList("Allow", inputValue.criteria.allow),
formatList("Fail", inputValue.criteria.fail),
].filter(Boolean).join("\n\n"),
});
describeEval("Slack behavior", {
harness: slackHarness,
judges: [BehaviorJudge],
judgeThreshold: 0.75,
}, (it) => {
it("answers the user-visible request", async ({ run }) => {
await run(input);
});
});
The important property is that model/provider/auth stays with the harness:
const response = await ctx.harness.prompt(prompt, {
system,
metadata: { judge: "RubricJudge" },
});
RubricJudge should not import or require AI SDK, OpenAI, Pi, or a specific Gateway provider.
Possible API
Something like:
type RubricJudgeConfig<TInput, TMetadata, THarness> = {
name?: string;
getCriteria: (ctx: JudgeContext<TInput, TMetadata, THarness>) => string;
system?: string | ((ctx: JudgeContext<TInput, TMetadata, THarness>) => string);
prompt?: (args: {
output: string;
criteria: string;
ctx: JudgeContext<TInput, TMetadata, THarness>;
}) => string;
scale?: "abcde" | Array<{
label: string;
score: number;
description: string;
}>;
parser?: (text: string) => {
label: string;
rationale?: string;
metadata?: Record<string, unknown>;
};
};
Default behavior could be an A-E scale:
- A:
1
- B:
0.75
- C:
0.5
- D:
0.25
- E:
0
The returned metadata should include at least:
{
answer: "A" | "B" | "C" | "D" | "E",
rationale: string,
}
Design Constraints
- Stay harness-first. Do not reintroduce global
configure(...) or scorer-first APIs.
- Stay provider-agnostic. The only model seam should be
ctx.harness.prompt.
- Keep it useful for automatic suite judges and explicit
toSatisfyJudge(...).
- Do not hide Vitest's
it(..., async ({ run }) => ...) shape.
- Do not make deterministic judges depend on this helper.
Open Questions
- Should the default rubric output schema be strict JSON only, or should the parser recover fenced JSON / JSON embedded in text?
- Should
RubricJudge export its default prompt formatter so consumers can snapshot or customize it incrementally?
- Is A-E the right default scale, or should the default be pass/partial/fail with custom scales documented?
- Should
getCriteria accept a structured rubric object directly, or should vitest-evals only require a final criteria string?
Acceptance Criteria
- A consumer can replace a local custom LLM rubric judge with a small
RubricJudge({ getCriteria }) call.
- The judge uses
JudgeContext.harness.prompt and does not require provider-specific dependencies.
- Reporter output includes score, selected answer/label, and rationale.
- Tests cover valid JSON, malformed judge output, custom scales, and explicit matcher usage.
- Docs show the helper with a custom harness and with a first-party harness package.
Related
Problem
vitest-evals@0.9.0-beta.1now exposes the configured suite harness to judges throughJudgeContext.harness, andHarness.promptis the right provider-agnostic model seam. That resolves the most important design issue from #45 / #46.The remaining repeated work for consumers is the rubric judge itself. Any app that wants LLM-judged natural-language behavior still has to implement the same pieces locally:
This is not app-specific enough to justify every consumer carrying its own version.
Desired Shape
Add a first-party
RubricJudgethat returns a normalJudgeFn<JudgeContext<...>>and callsctx.harness.prompt(...)internally.Example:
The important property is that model/provider/auth stays with the harness:
RubricJudgeshould not import or require AI SDK, OpenAI, Pi, or a specific Gateway provider.Possible API
Something like:
Default behavior could be an A-E scale:
10.750.50.250The returned metadata should include at least:
Design Constraints
configure(...)or scorer-first APIs.ctx.harness.prompt.toSatisfyJudge(...).it(..., async ({ run }) => ...)shape.Open Questions
RubricJudgeexport its default prompt formatter so consumers can snapshot or customize it incrementally?getCriteriaaccept a structured rubric object directly, or shouldvitest-evalsonly require a final criteria string?Acceptance Criteria
RubricJudge({ getCriteria })call.JudgeContext.harness.promptand does not require provider-specific dependencies.Related