[Task Submission] mmlusr (`mmlusr`) by SkySuperCat · Pull Request #3 · GenBench/genbench_cbt

SkySuperCat · 2024-10-22T06:54:15Z

MMLU-SR

mmlusr aims to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms.

Authors

Wentian Wang, wwang834@usc.edu
Sarthak Jain
Paul Kantor
Jacob Feldman
Lazaros Gallos
Hao Wang

Implementation

We have task.py under mmlusr folder, which is a custom method to load answer choices from HuggingFace.

Usage

We need to figure out a way to run all tasks on Genbench. In our Git repo, it's easily to run all tasks and we specifically made every single task a config file line so that it's simple to pick any task user wants. But the loading strategy I see here, for now we have to manually change the task name in config.jsonnet. We cannot change on the huggingface side as it's already used in lm-eval-harness repo.

Checklist:

[ √] I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
[√ ] Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
[ √] I have read the description of what should be in the doc.md of my task, and have added the required arguments.
[ √] I have submitted or will submit an accompanying paper to the GenBench workshop.

Add mmlusr

47cbbef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Submission] mmlusr (`mmlusr`)#3

[Task Submission] mmlusr (`mmlusr`)#3
SkySuperCat wants to merge 1 commit intoGenBench:mainfrom
SkySuperCat:mmlusr

SkySuperCat commented Oct 22, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SkySuperCat commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MMLU-SR

Authors

Implementation

Usage

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SkySuperCat commented Oct 22, 2024 •

edited

Loading