Simplify multimodal evals around transcripts by dcramer · Pull Request #36 · getsentry/vitest-evals

dcramer · 2026-03-17T18:19:14Z

This replaces the public message-chain eval surface with a transcript-first
contract. TaskFn, describeEval(), and custom scorers can now accept and
return structured transcript messages for multimodal cases, while plain string
inputs and outputs still work as the shorthand for text-only tests. Tool usage
also stays separate as explicit toolCalls metadata instead of being mixed
into the user-visible conversation.

The earlier multimodal support worked, but it still centered the API around
derived input and output strings plus a result field. Moving the public
surface to transcripts makes the data model line up with what users actually
see, gives scorers the full conversation when they need it, and keeps the
string helpers as a compatibility layer rather than the primary abstraction.

evaluate() now judges the transcript directly, including multimodal parts,
instead of flattening everything into a single text prompt. That keeps the
judge focused on the user-facing exchange and avoids accidentally scoring tool
metadata or other internal scaffolding as if it were assistant output.

I considered continuing to layer multimodal support onto the existing
result-oriented API, but that kept leaking normalization details into the
public surface. Pulling the transcript validation, normalization, and debug
formatting into messages.ts gives us one place to keep scorer payloads,
judge input, and failure output aligned.

github-actions · 2026-03-17T18:19:29Z

Semver Impact of This PR

⚪ None (no version bump detected)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).

New Features ✨

(evaluate) Add evaluate() for single-scenario LLM-judged evals by dcramer in #31

Bug Fixes 🐛

(release) Remove github target from Craft config by dcramer in #33
Bump version to 0.6.0 by dcramer in #34

Documentation 📚

Add StructuredOutputScorer documentation to README by dcramer in #24

Internal Changes 🔧

Release

Fix changelog-preview permissions by BYK in #30
Replace free-text version input with bump type selector by dcramer in #32
Bump Craft version to fix issues by BYK in #28
Switch from action-prepare-release to Craft by BYK in #27

Other

Widen ai and zod peer dependency ranges by dcramer in #35
Use pull_request_target for changelog preview by BYK in #29

Other

Simplify multimodal evals around transcripts by dcramer in #36

_{🤖 This preview updates automatically when you update the PR.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-17T19:00:15Z

+        Array.isArray(message.parts),
+    )
+  );
+}


isTranscript matches empty arrays causing wrong formatting

Low Severity

isTranscript returns true for empty arrays because [].every(...) is vacuously true. This causes formatEvalValue([]) to produce "(empty transcript)" instead of "[]". Since formatEvalValue is used in formatScores for arbitrary score.metadata.output values, any scorer returning an empty array as output metadata gets it misformatted as a transcript.

Additional Locations (1)

src/messages.ts#L269-L280

Add multimodal message-chain eval support

48a7957

cursor Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread src/evaluate/index.ts

Simplify multimodal evals around transcripts

8a8e78b

dcramer changed the title ~~Add multimodal message-chain support to evals~~ Simplify multimodal evals around transcripts Mar 17, 2026

cursor Bot reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify multimodal evals around transcripts#36

Simplify multimodal evals around transcripts#36
dcramer wants to merge 2 commits intomainfrom
multimodal-message-chain-evals

dcramer commented Mar 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 17, 2026 •

edited

Loading

New Features ✨

Bug Fixes 🐛

Documentation 📚

Internal Changes 🔧

Release

Other

Other

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dcramer commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semver Impact of This PR

New Features ✨

Bug Fixes 🐛

Documentation 📚

Internal Changes 🔧

Release

Other

Other

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 17, 2026

Choose a reason for hiding this comment

isTranscript matches empty arrays causing wrong formatting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dcramer commented Mar 17, 2026 •

edited

Loading

github-actions Bot commented Mar 17, 2026 •

edited

Loading

`isTranscript` matches empty arrays causing wrong formatting