RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

Overview

RuozhiBench is a dataset for evaluating Large Language Models (LLMs) through the lens of logical fallacies and misleading premises. Our benchmark provides insights into how different LLMs handle logically challenging scenarios.

Key Findings

Data Sample

Evaluation Results

Overall Results: Overall scores of advanced LLMs across different categories.

LLM Evaluator Correlation: While absolute scores vary among different LLM evaluators, we observed high correlation between their assessments.
Paired Question Analysis: Normal questions consistently received higher scores compared to tricky questions with logical fallacies.
Evaluation Method Comparison: Free-form generation evaluation showed strong correlation with multiple-choice evaluation.

Getting Started

Installation

git clone https://github.com/LibrAIResearch/libra-eval
cd libra-eval
pip install -e .

Configuration

Create an API configuration file at libra_eval/config/api_config.json:

{
    "OPENAI_API_KEY": "your_openai_api_key"
}

Usage

Freestyle Evaluation

Generate model responses:

python get_response.py \
    --mode gen \
    --model gpt-4o-mini \
    --client openai \
    --data_dir ../data/ \
    --api_config ../libra_eval/config/api_config.json

Run evaluation:

python evaluate.py \
    --mode gen \
    --evaluator gpt-4o-2024-08-06 \
    --client openai \
    --data_dir ../data/ \
    --api_config ../libra_eval/config/api_config.json

Multiple-Choice Evaluation

python evaluate_mc.py \
    --model gpt-4o-mini \
    --client openai \
    --data_dir ../data/ \
    --api_config ../libra_eval/config/api_config.json

Citation

If you use RuozhiBench in your research, please cite our paper:

@misc{zhai2025ruozhibenchevaluatingllmslogical,
    title={RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises}, 
    author={Zenan Zhai and Hao Li and Xudong Han and Zhenxuan Zhang and Yixuan Zhang and Timothy Baldwin and Haonan Li},
    year={2025},
    eprint={2502.13125},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2502.13125}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

Overview

Key Findings

Data Sample

Evaluation Results

Getting Started

Installation

Configuration

Usage

Freestyle Evaluation

Multiple-Choice Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

LibrAIResearch/ruozhibench

Folders and files

Latest commit

History

Repository files navigation

RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

Overview

Key Findings

Data Sample

Evaluation Results

Getting Started

Installation

Configuration

Usage

Freestyle Evaluation

Multiple-Choice Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages