Bug in Benchmark in chat format

I think there is one problem with the Benchmark.
```
from unitxt.benchmark import Benchmark
from unitxt.api import DatasetRecipe,load_dataset
from unitxt.inference import OpenAiInferenceEngine


benchmark = Benchmark(
    # format="formats.user_agent",
    format="formats.chat_api",
    max_samples_per_subset=5,
    loader_limit=5,
    subsets={
        "cola": DatasetRecipe(
            card="cards.cola",
            template="templates.classification.multi_class.instruction",
        ),
    },
)

print("*****Load as BENCHMARK ")
dataset = list(benchmark()["test"])
print(dataset[0])
print(type(dataset[0]['source']))

print("*****Load as DATASET ")
dataset = load_dataset(card="cards.cola",template="templates.classification.multi_class.instruction", format="formats.chat_api", loader_limit=5,split="test")
print(dataset[0])
print(type(dataset[0]['source']))
```

We can see that when loading as benchmark it loads the messages as string, while when loading as a dataset it correctly loads as list of message:

```
*****Load as BENCHMARK 

Loading limited to 5 instances by setting LoadHF.loader_limit;
{'metrics': ['metrics.matthews_correlation'], 'data_classification_policy': ['public'], 'media': {'images': [], 'audios': []}, 'postprocessors': ['processors.take_first_non_empty_line', 'processors.lower_case_till_punc'], 'target': 'acceptable', 'references': ['acceptable'], 'source': '[{"role": "system", "content": "Classify the grammatical acceptability of the following text to one of these options: unacceptable, acceptable."}, {"role": "user", "content": "text: The sailors rode the breeze clear of the rocks."}]', 'task_data': '{"text": "The sailors rode the breeze clear of the rocks.", "text_type": "text", "classes": ["unacceptable", "acceptable"], "type_of_class": "grammatical acceptability", "metadata": {"data_classification_policy": ["public"], "num_demos": 0, "demos_pool_size": 0, "template": "templates.classification.multi_class.instruction"}, "label": "acceptable"}', 'groups': [], 'subset': ['cola']}
<class 'str'>
*****Load as DATASET 
Loader line limit was set to  5
Generating test split: 5 examples [00:00, 1284.00 examples/s]
/Users/yoavkatz/miniforge3/envs/fme/lib/python3.10/site-packages/datasets/builder.py:1243: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=all_checks' instead.
 warnings.warn(
{'metrics': ['metrics.matthews_correlation'], 'data_classification_policy': ['public'], 'media': {'audios': [], 'images': []}, 'postprocessors': ['processors.take_first_non_empty_line', 'processors.lower_case_till_punc'], 'target': 'acceptable', 'references': ['acceptable'], 'source': [{'role': 'system', 'content': 'Classify the grammatical acceptability of the following text to one of these options: unacceptable, acceptable.'}, {'role': 'user', 'content': 'text: The sailors rode the breeze clear of the rocks.'}], 'task_data': '{"text": "The sailors rode the breeze clear of the rocks.", "text_type": "text", "classes": ["unacceptable", "acceptable"], "type_of_class": "grammatical acceptability", "metadata": {"data_classification_policy": ["public"], "num_demos": 0, "demos_pool_size": 0, "template": "templates.classification.multi_class.instruction"}, "label": "acceptable"}', 'groups': [], 'subset': []}
<class 'list'>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in Benchmark in chat format #1653

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug in Benchmark in chat format #1653

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions