Problem
Custom harnesses currently need to hand-assemble HarnessRun objects. For realistic app/integration harnesses, that means every consumer repeats low-level normalization code:
- convert arbitrary app artifacts into
JsonValue
- build normalized
messages
- attach assistant output and
session.outputText
- attach tool calls to a synthetic assistant/tool message
- preserve metadata
- fill
usage, errors, artifacts, and optional timings
The primitives exist (HarnessRun, NormalizedMessage, ToolCallRecord, toJsonValue, etc.), but the core package does not provide a higher-level construction helper. This makes custom harnesses verbose and easier to get subtly wrong, especially around reporter-facing output and judge-facing text.
Desired Shape
Add a small helper for constructing a normalized HarnessRun from common app-harness artifacts.
Example:
import { createHarnessRun } from "vitest-evals/harness";
return createHarnessRun({
output: {
assistant_posts: posts,
channel_posts: channelPosts,
reactions,
},
messages: posts.map((post) => ({
role: "assistant",
content: post.text,
metadata: {
channel: post.channel,
thread_ts: post.thread_ts,
files: post.files,
},
})),
toolCalls,
usage: {
toolCalls: toolCalls.length,
},
});
The helper should normalize values into the existing JsonValue contract and fill the fields judges/reporters expect.
Possible API
Minimal option:
type CreateHarnessRunOptions = {
output?: unknown;
outputText?: string;
messages?: Array<{
role: NormalizedMessage["role"];
content?: unknown;
toolCalls?: ToolCallRecord[];
metadata?: Record<string, unknown>;
}>;
toolCalls?: ToolCallRecord[];
usage?: UsageSummary;
timings?: TimingSummary;
artifacts?: Record<string, unknown>;
errors?: Array<Record<string, unknown>>;
metadata?: Record<string, unknown>;
};
function createHarnessRun(options: CreateHarnessRunOptions): HarnessRun;
Useful defaults:
output is normalized with toJsonValue.
session.outputText defaults to outputText, then string output, then pretty JSON for object/array output.
messages are normalized with normalizeContent / normalizeMetadata.
toolCalls can either be appended to an existing assistant/tool message or added as a synthetic assistant tool-call message when no message includes them.
usage.toolCalls defaults to toolCalls.length when omitted.
errors defaults to [].
Potential convenience helpers:
assistantMessage(content, metadata?)
userMessage(content, metadata?)
toolCall(name, args?, result?)
Those should only be added if they keep the API smaller in practice; a single createHarnessRun may be enough.
Design Constraints
- Do not make app harnesses opaque. The helper should build a standard
HarnessRun, not introduce a new abstraction layer around Harness.run.
- Preserve explicit consumer values. If callers pass
session.outputText, usage, errors, or artifacts, the helper should not silently replace them.
- Keep normalization predictable and JSON-safe.
- Avoid app-specific concepts like Slack posts, web requests, or agent decisions.
Why This Helps
- Reduces boilerplate in custom harnesses that are not covered by first-party runtime adapters.
- Makes reporter-facing and judge-facing output more consistent.
- Lowers the chance that custom harnesses omit
errors, forget outputText, or attach tool calls in a shape built-in judges do not read.
- Gives docs a canonical way to teach custom harness authoring.
Acceptance Criteria
- A custom app harness can build a correct
HarnessRun with one helper call.
- The helper covers output, messages, tool calls, metadata, usage, artifacts, timings, and errors.
- Tests cover non-JSON values, omitted fields/defaults, tool-call attachment, and explicit override behavior.
- Existing lower-level helpers remain available for advanced harnesses.
Problem
Custom harnesses currently need to hand-assemble
HarnessRunobjects. For realistic app/integration harnesses, that means every consumer repeats low-level normalization code:JsonValuemessagessession.outputTextusage,errors,artifacts, and optional timingsThe primitives exist (
HarnessRun,NormalizedMessage,ToolCallRecord,toJsonValue, etc.), but the core package does not provide a higher-level construction helper. This makes custom harnesses verbose and easier to get subtly wrong, especially around reporter-facing output and judge-facing text.Desired Shape
Add a small helper for constructing a normalized
HarnessRunfrom common app-harness artifacts.Example:
The helper should normalize values into the existing
JsonValuecontract and fill the fields judges/reporters expect.Possible API
Minimal option:
Useful defaults:
outputis normalized withtoJsonValue.session.outputTextdefaults tooutputText, then string output, then pretty JSON for object/array output.messagesare normalized withnormalizeContent/normalizeMetadata.toolCallscan either be appended to an existing assistant/tool message or added as a synthetic assistant tool-call message when no message includes them.usage.toolCallsdefaults totoolCalls.lengthwhen omitted.errorsdefaults to[].Potential convenience helpers:
Those should only be added if they keep the API smaller in practice; a single
createHarnessRunmay be enough.Design Constraints
HarnessRun, not introduce a new abstraction layer aroundHarness.run.session.outputText,usage,errors, orartifacts, the helper should not silently replace them.Why This Helps
errors, forgetoutputText, or attach tool calls in a shape built-in judges do not read.Acceptance Criteria
HarnessRunwith one helper call.