Skip to content

Add a first-party harness-backed RubricJudge #47

@dcramer

Description

@dcramer

Problem

vitest-evals@0.9.0-beta.1 now exposes the configured suite harness to judges through JudgeContext.harness, and Harness.prompt is the right provider-agnostic model seam. That resolves the most important design issue from #45 / #46.

The remaining repeated work for consumers is the rubric judge itself. Any app that wants LLM-judged natural-language behavior still has to implement the same pieces locally:

  • prompt text that asks for a calibrated rubric verdict
  • a score scale, usually A-E or pass/partial/fail
  • JSON-only response instructions
  • JSON extraction / parse error handling
  • score mapping
  • rationale metadata
  • optional raw answer metadata for reporter output

This is not app-specific enough to justify every consumer carrying its own version.

Desired Shape

Add a first-party RubricJudge that returns a normal JudgeFn<JudgeContext<...>> and calls ctx.harness.prompt(...) internally.

Example:

import { RubricJudge, describeEval } from "vitest-evals";

const BehaviorJudge = RubricJudge<SlackEvalInput>({
  getCriteria: ({ inputValue }) => [
    `Contract:\n${inputValue.criteria.contract}`,
    formatList("Pass", inputValue.criteria.pass),
    formatList("Allow", inputValue.criteria.allow),
    formatList("Fail", inputValue.criteria.fail),
  ].filter(Boolean).join("\n\n"),
});

describeEval("Slack behavior", {
  harness: slackHarness,
  judges: [BehaviorJudge],
  judgeThreshold: 0.75,
}, (it) => {
  it("answers the user-visible request", async ({ run }) => {
    await run(input);
  });
});

The important property is that model/provider/auth stays with the harness:

const response = await ctx.harness.prompt(prompt, {
  system,
  metadata: { judge: "RubricJudge" },
});

RubricJudge should not import or require AI SDK, OpenAI, Pi, or a specific Gateway provider.

Possible API

Something like:

type RubricJudgeConfig<TInput, TMetadata, THarness> = {
  name?: string;
  getCriteria: (ctx: JudgeContext<TInput, TMetadata, THarness>) => string;
  system?: string | ((ctx: JudgeContext<TInput, TMetadata, THarness>) => string);
  prompt?: (args: {
    output: string;
    criteria: string;
    ctx: JudgeContext<TInput, TMetadata, THarness>;
  }) => string;
  scale?: "abcde" | Array<{
    label: string;
    score: number;
    description: string;
  }>;
  parser?: (text: string) => {
    label: string;
    rationale?: string;
    metadata?: Record<string, unknown>;
  };
};

Default behavior could be an A-E scale:

  • A: 1
  • B: 0.75
  • C: 0.5
  • D: 0.25
  • E: 0

The returned metadata should include at least:

{
  answer: "A" | "B" | "C" | "D" | "E",
  rationale: string,
}

Design Constraints

  • Stay harness-first. Do not reintroduce global configure(...) or scorer-first APIs.
  • Stay provider-agnostic. The only model seam should be ctx.harness.prompt.
  • Keep it useful for automatic suite judges and explicit toSatisfyJudge(...).
  • Do not hide Vitest's it(..., async ({ run }) => ...) shape.
  • Do not make deterministic judges depend on this helper.

Open Questions

  • Should the default rubric output schema be strict JSON only, or should the parser recover fenced JSON / JSON embedded in text?
  • Should RubricJudge export its default prompt formatter so consumers can snapshot or customize it incrementally?
  • Is A-E the right default scale, or should the default be pass/partial/fail with custom scales documented?
  • Should getCriteria accept a structured rubric object directly, or should vitest-evals only require a final criteria string?

Acceptance Criteria

  • A consumer can replace a local custom LLM rubric judge with a small RubricJudge({ getCriteria }) call.
  • The judge uses JudgeContext.harness.prompt and does not require provider-specific dependencies.
  • Reporter output includes score, selected answer/label, and rationale.
  • Tests cover valid JSON, malformed judge output, custom scales, and explicit matcher usage.
  • Docs show the helper with a custom harness and with a first-party harness package.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions