Conversation
Semver Impact of This PR⚪ None (no version bump detected) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛
Documentation 📚
Internal Changes 🔧Release
Other
Other
🤖 This preview updates automatically when you update the PR. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| Array.isArray(message.parts), | ||
| ) | ||
| ); | ||
| } |
There was a problem hiding this comment.
isTranscript matches empty arrays causing wrong formatting
Low Severity
isTranscript returns true for empty arrays because [].every(...) is vacuously true. This causes formatEvalValue([]) to produce "(empty transcript)" instead of "[]". Since formatEvalValue is used in formatScores for arbitrary score.metadata.output values, any scorer returning an empty array as output metadata gets it misformatted as a transcript.


This replaces the public message-chain eval surface with a transcript-first
contract.
TaskFn,describeEval(), and custom scorers can now accept andreturn structured transcript messages for multimodal cases, while plain string
inputs and outputs still work as the shorthand for text-only tests. Tool usage
also stays separate as explicit
toolCallsmetadata instead of being mixedinto the user-visible conversation.
The earlier multimodal support worked, but it still centered the API around
derived
inputandoutputstrings plus aresultfield. Moving the publicsurface to transcripts makes the data model line up with what users actually
see, gives scorers the full conversation when they need it, and keeps the
string helpers as a compatibility layer rather than the primary abstraction.
evaluate()now judges the transcript directly, including multimodal parts,instead of flattening everything into a single text prompt. That keeps the
judge focused on the user-facing exchange and avoids accidentally scoring tool
metadata or other internal scaffolding as if it were assistant output.
I considered continuing to layer multimodal support onto the existing
result-oriented API, but that kept leaking normalization details into thepublic surface. Pulling the transcript validation, normalization, and debug
formatting into
messages.tsgives us one place to keep scorer payloads,judge input, and failure output aligned.