Skip to content

Support assertions on evals #8

@dcramer

Description

@dcramer

While evals are primarily qualitative, one important thing we care about is the way an LLM comes to answer.

For example, I want to test that a specific tool was called as part of an eval.

import { describeEval } from "vitest-evals";
import { Factuality, FIXTURES, TaskRunner } from "./utils";

describeEval("begin-autofix", {
  data: async () => {
    return [
      {
        input: `Can you root cause this issue in Sentry?\n${FIXTURES.autofixIssueUrl}\n\nJust kick off the process and give me the Run ID.`,
        expected: "The analysis has started\n.Run ID: 123",
        assert: () => {
          // do something here
        }
      },
      {
        input: `Whats the status on rooting causing this issue in Sentry?\n${FIXTURES.autofixIssueUrl}`,
        expected:
          'Batched TRPC request incorrectly passed bottle ID 3216 to `bottleById`, instead of 16720, resulting in a "Bottle not found" error.',
      },
      {
        input: `Can you root cause this issue and retrieve the analysis?\n${FIXTURES.autofixIssueUrl}`,
        expected:
          'Batched TRPC request incorrectly passed bottle ID 3216 to `bottleById`, instead of 16720, resulting in a "Bottle not found" error.',
      },
    ];
  },
  task: TaskRunner(),
  scorers: [Factuality()],
  threshold: 0.6,
  timeout: 30000,
});

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions