This is the repo with the code for the paper Improving score reliability of multiple choice benchmarks with consistency evaluation and altered answer choices.
To run this code, we advise that you create a Python enviroment first, install the required libraries, and then follow the steps in Steps to generate results section and execute the scripts to generate results.
After cloning this repository, create a virtual environment:
python -m venv .venv
Activate the virtual environment:
source .venv/bin/activate
Install the required packages:
pip install -r requirements.txt
Inside the \src
folder, we have enumerated the script folders in the order they are run:
-
Data preparation
-
Output generation
- Generate alternative evaluations
- Generate outputs
- Evaluate outputs
-
Output analysis
There are no enumerations for Data preparation
and Output analysis
, because Data preparation
only leaves the data in an specific format and Output analysis
contains the files for generating the metrics after running Output generation
.
Initialy, prepare the data, then generate multiple versions of the data, which are the alternative evaluations. With those evaluations, generate the model outputs and, finally, evaluate those outputs to obtain results.
As an example, we have provided the MedQA data, the same data we have used in the paper, originally from MedQA's paper github
You can modify the format_data
script to add different data.
Those are the steps to generate the results:
python -m src.data_preparation.prepare_dataset -i data/MedQA/MedQA.xlsx -o data/MedQA/MedQA_prepared.xlsx
python -m src.output_generation.generate_alternative_evaluations -i data/MedQA/MedQA_prepared.xlsx -o data/MedQA/MedQA_wAlternativeEvaluations.xlsx
python -m src.output_generation.generate_outputs -i data/MedQA/MedQA_wAlternativeEvaluations.xlsx -o data/MedQA/MedQA_wOutputs.xlsx
python -m src.output_generation.evaluate_outputs -i data/MedQA/MedQA_wOutputs.xlsx -o data/MedQA/MedQA_results.json
python -m src.output_analysis.compute_accuracies -i data/MedQA/MedQA_results.json
Documentation can be found primarily in this file and soon at Cora's github pages.
If you have any questions or issues you can create a new issue here.
Pull requests are very welcome! Make sure your patches are well tested. Ideally create a topic branch for every separate change you make. For example:
- Fork the repo
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
This project is licensed under the Apache License 2.0.