Skip to content

Bug in Benchmark in chat format #1653

@yoavkatz

Description

@yoavkatz

I think there is one problem with the Benchmark.

from unitxt.benchmark import Benchmark
from unitxt.api import DatasetRecipe,load_dataset
from unitxt.inference import OpenAiInferenceEngine


benchmark = Benchmark(
    # format="formats.user_agent",
    format="formats.chat_api",
    max_samples_per_subset=5,
    loader_limit=5,
    subsets={
        "cola": DatasetRecipe(
            card="cards.cola",
            template="templates.classification.multi_class.instruction",
        ),
    },
)

print("*****Load as BENCHMARK ")
dataset = list(benchmark()["test"])
print(dataset[0])
print(type(dataset[0]['source']))

print("*****Load as DATASET ")
dataset = load_dataset(card="cards.cola",template="templates.classification.multi_class.instruction", format="formats.chat_api", loader_limit=5,split="test")
print(dataset[0])
print(type(dataset[0]['source']))

We can see that when loading as benchmark it loads the messages as string, while when loading as a dataset it correctly loads as list of message:

*****Load as BENCHMARK 

Loading limited to 5 instances by setting LoadHF.loader_limit;
{'metrics': ['metrics.matthews_correlation'], 'data_classification_policy': ['public'], 'media': {'images': [], 'audios': []}, 'postprocessors': ['processors.take_first_non_empty_line', 'processors.lower_case_till_punc'], 'target': 'acceptable', 'references': ['acceptable'], 'source': '[{"role": "system", "content": "Classify the grammatical acceptability of the following text to one of these options: unacceptable, acceptable."}, {"role": "user", "content": "text: The sailors rode the breeze clear of the rocks."}]', 'task_data': '{"text": "The sailors rode the breeze clear of the rocks.", "text_type": "text", "classes": ["unacceptable", "acceptable"], "type_of_class": "grammatical acceptability", "metadata": {"data_classification_policy": ["public"], "num_demos": 0, "demos_pool_size": 0, "template": "templates.classification.multi_class.instruction"}, "label": "acceptable"}', 'groups': [], 'subset': ['cola']}
<class 'str'>
*****Load as DATASET 
Loader line limit was set to  5
Generating test split: 5 examples [00:00, 1284.00 examples/s]
/Users/yoavkatz/miniforge3/envs/fme/lib/python3.10/site-packages/datasets/builder.py:1243: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=all_checks' instead.
 warnings.warn(
{'metrics': ['metrics.matthews_correlation'], 'data_classification_policy': ['public'], 'media': {'audios': [], 'images': []}, 'postprocessors': ['processors.take_first_non_empty_line', 'processors.lower_case_till_punc'], 'target': 'acceptable', 'references': ['acceptable'], 'source': [{'role': 'system', 'content': 'Classify the grammatical acceptability of the following text to one of these options: unacceptable, acceptable.'}, {'role': 'user', 'content': 'text: The sailors rode the breeze clear of the rocks.'}], 'task_data': '{"text": "The sailors rode the breeze clear of the rocks.", "text_type": "text", "classes": ["unacceptable", "acceptable"], "type_of_class": "grammatical acceptability", "metadata": {"data_classification_policy": ["public"], "num_demos": 0, "demos_pool_size": 0, "template": "templates.classification.multi_class.instruction"}, "label": "acceptable"}', 'groups': [], 'subset': []}
<class 'list'>

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions