Skip to content

Simplify multimodal evals around transcripts#36

Draft
dcramer wants to merge 2 commits intomainfrom
multimodal-message-chain-evals
Draft

Simplify multimodal evals around transcripts#36
dcramer wants to merge 2 commits intomainfrom
multimodal-message-chain-evals

Conversation

@dcramer
Copy link
Copy Markdown
Member

@dcramer dcramer commented Mar 17, 2026

This replaces the public message-chain eval surface with a transcript-first
contract. TaskFn, describeEval(), and custom scorers can now accept and
return structured transcript messages for multimodal cases, while plain string
inputs and outputs still work as the shorthand for text-only tests. Tool usage
also stays separate as explicit toolCalls metadata instead of being mixed
into the user-visible conversation.

The earlier multimodal support worked, but it still centered the API around
derived input and output strings plus a result field. Moving the public
surface to transcripts makes the data model line up with what users actually
see, gives scorers the full conversation when they need it, and keeps the
string helpers as a compatibility layer rather than the primary abstraction.

evaluate() now judges the transcript directly, including multimodal parts,
instead of flattening everything into a single text prompt. That keeps the
judge focused on the user-facing exchange and avoids accidentally scoring tool
metadata or other internal scaffolding as if it were assistant output.

I considered continuing to layer multimodal support onto the existing
result-oriented API, but that kept leaking normalization details into the
public surface. Pulling the transcript validation, normalization, and debug
formatting into messages.ts gives us one place to keep scorer payloads,
judge input, and failure output aligned.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 17, 2026

Semver Impact of This PR

None (no version bump detected)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

  • (evaluate) Add evaluate() for single-scenario LLM-judged evals by dcramer in #31

Bug Fixes 🐛

  • (release) Remove github target from Craft config by dcramer in #33
  • Bump version to 0.6.0 by dcramer in #34

Documentation 📚

  • Add StructuredOutputScorer documentation to README by dcramer in #24

Internal Changes 🔧

Release

  • Fix changelog-preview permissions by BYK in #30
  • Replace free-text version input with bump type selector by dcramer in #32
  • Bump Craft version to fix issues by BYK in #28
  • Switch from action-prepare-release to Craft by BYK in #27

Other

  • Widen ai and zod peer dependency ranges by dcramer in #35
  • Use pull_request_target for changelog preview by BYK in #29

Other

  • Simplify multimodal evals around transcripts by dcramer in #36

🤖 This preview updates automatically when you update the PR.

Comment thread src/evaluate/index.ts
@dcramer dcramer changed the title Add multimodal message-chain support to evals Simplify multimodal evals around transcripts Mar 17, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread src/messages.ts
Array.isArray(message.parts),
)
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isTranscript matches empty arrays causing wrong formatting

Low Severity

isTranscript returns true for empty arrays because [].every(...) is vacuously true. This causes formatEvalValue([]) to produce "(empty transcript)" instead of "[]". Since formatEvalValue is used in formatScores for arbitrary score.metadata.output values, any scorer returning an empty array as output metadata gets it misformatted as a transcript.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant