Add a first-party harness-backed RubricJudge

## Problem

`vitest-evals@0.9.0-beta.1` now exposes the configured suite harness to judges through `JudgeContext.harness`, and `Harness.prompt` is the right provider-agnostic model seam. That resolves the most important design issue from #45 / #46.

The remaining repeated work for consumers is the rubric judge itself. Any app that wants LLM-judged natural-language behavior still has to implement the same pieces locally:

- prompt text that asks for a calibrated rubric verdict
- a score scale, usually A-E or pass/partial/fail
- JSON-only response instructions
- JSON extraction / parse error handling
- score mapping
- rationale metadata
- optional raw answer metadata for reporter output

This is not app-specific enough to justify every consumer carrying its own version.

## Desired Shape

Add a first-party `RubricJudge` that returns a normal `JudgeFn<JudgeContext<...>>` and calls `ctx.harness.prompt(...)` internally.

Example:

```ts
import { RubricJudge, describeEval } from "vitest-evals";

const BehaviorJudge = RubricJudge<SlackEvalInput>({
  getCriteria: ({ inputValue }) => [
    `Contract:\n${inputValue.criteria.contract}`,
    formatList("Pass", inputValue.criteria.pass),
    formatList("Allow", inputValue.criteria.allow),
    formatList("Fail", inputValue.criteria.fail),
  ].filter(Boolean).join("\n\n"),
});

describeEval("Slack behavior", {
  harness: slackHarness,
  judges: [BehaviorJudge],
  judgeThreshold: 0.75,
}, (it) => {
  it("answers the user-visible request", async ({ run }) => {
    await run(input);
  });
});
```

The important property is that model/provider/auth stays with the harness:

```ts
const response = await ctx.harness.prompt(prompt, {
  system,
  metadata: { judge: "RubricJudge" },
});
```

`RubricJudge` should not import or require AI SDK, OpenAI, Pi, or a specific Gateway provider.

## Possible API

Something like:

```ts
type RubricJudgeConfig<TInput, TMetadata, THarness> = {
  name?: string;
  getCriteria: (ctx: JudgeContext<TInput, TMetadata, THarness>) => string;
  system?: string | ((ctx: JudgeContext<TInput, TMetadata, THarness>) => string);
  prompt?: (args: {
    output: string;
    criteria: string;
    ctx: JudgeContext<TInput, TMetadata, THarness>;
  }) => string;
  scale?: "abcde" | Array<{
    label: string;
    score: number;
    description: string;
  }>;
  parser?: (text: string) => {
    label: string;
    rationale?: string;
    metadata?: Record<string, unknown>;
  };
};
```

Default behavior could be an A-E scale:

- A: `1`
- B: `0.75`
- C: `0.5`
- D: `0.25`
- E: `0`

The returned metadata should include at least:

```ts
{
  answer: "A" | "B" | "C" | "D" | "E",
  rationale: string,
}
```

## Design Constraints

- Stay harness-first. Do not reintroduce global `configure(...)` or scorer-first APIs.
- Stay provider-agnostic. The only model seam should be `ctx.harness.prompt`.
- Keep it useful for automatic suite judges and explicit `toSatisfyJudge(...)`.
- Do not hide Vitest's `it(..., async ({ run }) => ...)` shape.
- Do not make deterministic judges depend on this helper.

## Open Questions

- Should the default rubric output schema be strict JSON only, or should the parser recover fenced JSON / JSON embedded in text?
- Should `RubricJudge` export its default prompt formatter so consumers can snapshot or customize it incrementally?
- Is A-E the right default scale, or should the default be pass/partial/fail with custom scales documented?
- Should `getCriteria` accept a structured rubric object directly, or should `vitest-evals` only require a final criteria string?

## Acceptance Criteria

- A consumer can replace a local custom LLM rubric judge with a small `RubricJudge({ getCriteria })` call.
- The judge uses `JudgeContext.harness.prompt` and does not require provider-specific dependencies.
- Reporter output includes score, selected answer/label, and rationale.
- Tests cover valid JSON, malformed judge output, custom scales, and explicit matcher usage.
- Docs show the helper with a custom harness and with a first-party harness package.

## Related

- #45
- #46


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a first-party harness-backed RubricJudge #47

Problem

Desired Shape

Possible API

Design Constraints

Open Questions

Acceptance Criteria

Related

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add a first-party harness-backed RubricJudge #47

Description

Problem

Desired Shape

Possible API

Design Constraints

Open Questions

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions