Change the output interface of evaluate #8003

TomeHirata · 2025-03-24T09:55:13Z

In this PR, we change the output interface of Evaluate.__call__.
Instead of returning either score, (score, outputs), (score, scores, outputs) based on arguments, it will always return dspy.Prediction containing the following fields:

score: A float percentage score (e.g., 67.30) representing overall performance
all_outputs: a list of (example, prediction, score) tuples for each example in devset

Since this is a breaking change, this should be released in the next minor release rather than a patch release.

dspy/evaluate/evaluate.py

chenmoneygithub

Solid work! LGTM with one minor comment.

Let's talk offline for potential breakage, and align on the release schedule.

chenmoneygithub · 2025-03-25T03:08:17Z

dspy/primitives/prediction.py

+        if isinstance(other, (float, int)):
+            return self.__float__() == other
+        elif isinstance(other, Prediction):
+            return self.__float__() == float(other)


nit: shall we do float(self) == float(other) for consistency?

I guess this should be consistent with how __ge__ or __le__ are implemented?

Nasreddine · 2025-04-25T06:54:20Z

please merge the PR to be able to get the individual example level evaluation score. This will be useful for mlflow tracing.

TomeHirata requested review from okhat and chenmoneygithub March 24, 2025 09:55

TomeHirata commented Mar 24, 2025

View reviewed changes

dspy/evaluate/evaluate.py Outdated Show resolved Hide resolved

TomeHirata mentioned this pull request Mar 25, 2025

Add logging of result table for DSPy optimizer tracking mlflow/mlflow#15061

Merged

39 tasks

chenmoneygithub approved these changes Mar 25, 2025

View reviewed changes

TomeHirata force-pushed the feat/evaluate-response branch from 7aeb618 to ed8fd13 Compare March 27, 2025 00:35

chenmoneygithub approved these changes Mar 27, 2025

View reviewed changes

TomeHirata added 9 commits April 1, 2025 16:39

change the output interface of evaluate

6f1968a

make the usage consistent

2efdc85

clean up remaining codes

d7e466a

fix mipro

302ba77

remove all_scores

bb5563c

format comment

c82851c

rename outputs

790f05b

rename it to results

86dabf1

pass empty results

ce1ea2d

TomeHirata force-pushed the feat/evaluate-response branch from 2f7a26e to ce1ea2d Compare April 1, 2025 07:39

TomeHirata mentioned this pull request Apr 9, 2025

[Bug] when return_outputs is True and return_all_scores is True, COPRO compile will crash #8027

Open

Merge branch 'main' into feat/evaluate-response

d4b6667

okhat self-assigned this Apr 16, 2025

TomeHirata mentioned this pull request Apr 25, 2025

[FR] DSPy : Log examples evaluation scores mlflow/mlflow#15476

Open

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the output interface of evaluate #8003

Change the output interface of evaluate #8003

TomeHirata commented Mar 24, 2025 •

edited

Loading

chenmoneygithub left a comment

chenmoneygithub Mar 25, 2025

TomeHirata Mar 25, 2025 •

edited

Loading

Nasreddine commented Apr 25, 2025

Change the output interface of evaluate #8003

Are you sure you want to change the base?

Change the output interface of evaluate #8003

Conversation

TomeHirata commented Mar 24, 2025 • edited Loading

chenmoneygithub left a comment

Choose a reason for hiding this comment

chenmoneygithub Mar 25, 2025

Choose a reason for hiding this comment

TomeHirata Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

Nasreddine commented Apr 25, 2025

TomeHirata commented Mar 24, 2025 •

edited

Loading

TomeHirata Mar 25, 2025 •

edited

Loading