Skip to content

Add helpers for constructing HarnessRun objects #49

@dcramer

Description

@dcramer

Problem

Custom harnesses currently need to hand-assemble HarnessRun objects. For realistic app/integration harnesses, that means every consumer repeats low-level normalization code:

  • convert arbitrary app artifacts into JsonValue
  • build normalized messages
  • attach assistant output and session.outputText
  • attach tool calls to a synthetic assistant/tool message
  • preserve metadata
  • fill usage, errors, artifacts, and optional timings

The primitives exist (HarnessRun, NormalizedMessage, ToolCallRecord, toJsonValue, etc.), but the core package does not provide a higher-level construction helper. This makes custom harnesses verbose and easier to get subtly wrong, especially around reporter-facing output and judge-facing text.

Desired Shape

Add a small helper for constructing a normalized HarnessRun from common app-harness artifacts.

Example:

import { createHarnessRun } from "vitest-evals/harness";

return createHarnessRun({
  output: {
    assistant_posts: posts,
    channel_posts: channelPosts,
    reactions,
  },
  messages: posts.map((post) => ({
    role: "assistant",
    content: post.text,
    metadata: {
      channel: post.channel,
      thread_ts: post.thread_ts,
      files: post.files,
    },
  })),
  toolCalls,
  usage: {
    toolCalls: toolCalls.length,
  },
});

The helper should normalize values into the existing JsonValue contract and fill the fields judges/reporters expect.

Possible API

Minimal option:

type CreateHarnessRunOptions = {
  output?: unknown;
  outputText?: string;
  messages?: Array<{
    role: NormalizedMessage["role"];
    content?: unknown;
    toolCalls?: ToolCallRecord[];
    metadata?: Record<string, unknown>;
  }>;
  toolCalls?: ToolCallRecord[];
  usage?: UsageSummary;
  timings?: TimingSummary;
  artifacts?: Record<string, unknown>;
  errors?: Array<Record<string, unknown>>;
  metadata?: Record<string, unknown>;
};

function createHarnessRun(options: CreateHarnessRunOptions): HarnessRun;

Useful defaults:

  • output is normalized with toJsonValue.
  • session.outputText defaults to outputText, then string output, then pretty JSON for object/array output.
  • messages are normalized with normalizeContent / normalizeMetadata.
  • toolCalls can either be appended to an existing assistant/tool message or added as a synthetic assistant tool-call message when no message includes them.
  • usage.toolCalls defaults to toolCalls.length when omitted.
  • errors defaults to [].

Potential convenience helpers:

assistantMessage(content, metadata?)
userMessage(content, metadata?)
toolCall(name, args?, result?)

Those should only be added if they keep the API smaller in practice; a single createHarnessRun may be enough.

Design Constraints

  • Do not make app harnesses opaque. The helper should build a standard HarnessRun, not introduce a new abstraction layer around Harness.run.
  • Preserve explicit consumer values. If callers pass session.outputText, usage, errors, or artifacts, the helper should not silently replace them.
  • Keep normalization predictable and JSON-safe.
  • Avoid app-specific concepts like Slack posts, web requests, or agent decisions.

Why This Helps

  • Reduces boilerplate in custom harnesses that are not covered by first-party runtime adapters.
  • Makes reporter-facing and judge-facing output more consistent.
  • Lowers the chance that custom harnesses omit errors, forget outputText, or attach tool calls in a shape built-in judges do not read.
  • Gives docs a canonical way to teach custom harness authoring.

Acceptance Criteria

  • A custom app harness can build a correct HarnessRun with one helper call.
  • The helper covers output, messages, tool calls, metadata, usage, artifacts, timings, and errors.
  • Tests cover non-JSON values, omitted fields/defaults, tool-call attachment, and explicit override behavior.
  • Existing lower-level helpers remain available for advanced harnesses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions