Hello, I was looking at the eval_meta_data.csv and I noticed that the "Random baseline" column seems somewhat arbitrarily (and incorrectly?) set in some cases. For example, when there are 4 choices in a multiple choice, the baseline is 25% - great, that makes sense and it's what I assumed from the paper. But why is siqa at 33% when it's a binary task? Or commonsense_qa at 20% when it's a 4-choice task. Or boolq at 62% when it's a binary task (this makes e.g. Olmo1B, with accuracy of only ~64% contribute a centered accuracy of -0.25 towards the average CORE score, which looks wrong/buggy). Can you help me understand where these numbers come from?