CORE centered accuracy calculation and random baseline

Hello, I was looking at the [eval_meta_data.csv](https://github.com/mlfoundations/dclm/blob/main/eval/eval_meta_data.csv) and I noticed that the "Random baseline" column seems somewhat arbitrarily (and incorrectly?) set in some cases. For example, when there are 4 choices in a multiple choice, the baseline is 25% - great, that makes sense and it's what I assumed from the paper. But why is `siqa` at 33% when it's a binary task? Or `commonsense_qa` at 20% when it's a 4-choice task. Or `boolq` at 62% when it's a binary task (this makes e.g. Olmo1B, with accuracy of only ~64% contribute a centered accuracy of -0.25 towards the average CORE score, which looks wrong/buggy). Can you help me understand where these numbers come from?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CORE centered accuracy calculation and random baseline #114

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CORE centered accuracy calculation and random baseline #114

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions