RuozhiBench is a dataset for evaluating Large Language Models (LLMs) through the lens of logical fallacies and misleading premises. Our benchmark provides insights into how different LLMs handle logically challenging scenarios.
- Overall Results: Overall scores of advanced LLMs across different categories.
-
LLM Evaluator Correlation: While absolute scores vary among different LLM evaluators, we observed high correlation between their assessments.
-
Paired Question Analysis: Normal questions consistently received higher scores compared to tricky questions with logical fallacies.
-
Evaluation Method Comparison: Free-form generation evaluation showed strong correlation with multiple-choice evaluation.
git clone https://github.com/LibrAIResearch/libra-eval
cd libra-eval
pip install -e .Create an API configuration file at libra_eval/config/api_config.json:
{
"OPENAI_API_KEY": "your_openai_api_key"
}- Generate model responses:
python get_response.py \
--mode gen \
--model gpt-4o-mini \
--client openai \
--data_dir ../data/ \
--api_config ../libra_eval/config/api_config.json- Run evaluation:
python evaluate.py \
--mode gen \
--evaluator gpt-4o-2024-08-06 \
--client openai \
--data_dir ../data/ \
--api_config ../libra_eval/config/api_config.jsonpython evaluate_mc.py \
--model gpt-4o-mini \
--client openai \
--data_dir ../data/ \
--api_config ../libra_eval/config/api_config.jsonIf you use RuozhiBench in your research, please cite our paper:
@misc{zhai2025ruozhibenchevaluatingllmslogical,
title={RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises},
author={Zenan Zhai and Hao Li and Xudong Han and Zhenxuan Zhang and Yixuan Zhang and Timothy Baldwin and Haonan Li},
year={2025},
eprint={2502.13125},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13125},
}



