-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
As a sanity test on evaluating agents on DA-Code I checked the evaluation of the gold standard against itself. See https://github.com/michaelrglass/da-code/blob/gold-vs-gold/tests/self_eval_check.py
This revealed a few issues:
Easily fixed
-
In
calculate_list, the empty list is not equal to itself.
https://github.com/michaelrglass/da-code/blob/270c59b36c7c961c82a4d4c73b83e2932cd52638/da_agent/evaluators/metrics/text.py#L20
This is a problem indi-text-029anddi-text-035 -
The file
gold/ml-cluster-003/cluster.csvis not present, instead it isclustering.csv
However, renaming this file gives a total score of 0.005763917074549833. Like a number of other 'ml' instances this has a non-zero, non-perfect score.
Unknown failures
- data-wrangling-038 (maybe because it is a .db file?)
ML score below 1
- ml-cluster-001
- ml-cluster-002
- ml-cluster-003
- ml-cluster-004
- ml-cluster-006
- ml-cluster-007
- ml-cluster-008
- ml-cluster-009
- ml-cluster-010
- ml-cluster-015
- ml-cluster-017
- ml-cluster-018
- ml-cluster-021
- ml-competition-003
- ml-competition-005
- ml-competition-007
- ml-competition-009
- ml-competition-010
- ml-competition-011
- ml-competition-013
- ml-competition-014
- ml-competition-015
- ml-competition-019
Metadata
Metadata
Assignees
Labels
No labels