Skip to content

Change the output interface of evaluate #8003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

TomeHirata
Copy link
Collaborator

@TomeHirata TomeHirata commented Mar 24, 2025

In this PR, we change the output interface of Evaluate.__call__.
Instead of returning either score, (score, outputs), (score, scores, outputs) based on arguments, it will always return dspy.Prediction containing the following fields:

  • score: A float percentage score (e.g., 67.30) representing overall performance
  • all_outputs: a list of (example, prediction, score) tuples for each example in devset

Since this is a breaking change, this should be released in the next minor release rather than a patch release.

Copy link
Collaborator

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work! LGTM with one minor comment.

Let's talk offline for potential breakage, and align on the release schedule.

if isinstance(other, (float, int)):
return self.__float__() == other
elif isinstance(other, Prediction):
return self.__float__() == float(other)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we do float(self) == float(other) for consistency?

Copy link
Collaborator Author

@TomeHirata TomeHirata Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be consistent with how __ge__ or __le__ are implemented?

@TomeHirata TomeHirata force-pushed the feat/evaluate-response branch from 7aeb618 to ed8fd13 Compare March 27, 2025 00:35
@Nasreddine
Copy link

please merge the PR to be able to get the individual example level evaluation score. This will be useful for mlflow tracing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants