Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

This repository contains the code and data for the paper “Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards.”

We introduce SAR-Agent, the first AI agent for detecting annotation errors in text-to-SQL benchmarks via multi-turn interaction with the database. Using SAR-Agent and expert analysis, we find that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively.

We corrected 100 examples sampled from the BIRD Dev set and re-evaluated all 16 open-source agents from the BIRD leaderboard. We observe performance changes of −7% to +31% (relative) and ranking changes from −9 to +9.

Execution accuracy of agents on original and corrected BIRD Dev subsets

Agent ranking changes from original to corrected BIRD Dev subset

Repository layout

SAR-Agent: implementation of SAR-Agent
text_to_sql_agents: code and outputs for the 16 open-source agents we re-evaluated

SAR-Agent

We used OpenAI’s o3 model in our experiments.

Prerequisites

Python 3.12

Install

cd SAR-Agent
pip install -r requirements.txt

Set credentials

OpenAI:

export OPENAI_API_KEY='<your_api_key>'

Data setup

BIRD

Download the data from https://drive.google.com/file/d/1H_CoROs_rr-11cNg7UeP_cN9PxvGR1qf/view?usp=drive_link.
Place databases and files under, e.g., ./bird

Spider 2.0-Snow

Request Snowflake access via the Spider 2.0 repository: https://github.com/xlang-ai/Spider2.git
Configure Snowflake credentials.

gdown 1H_CoROs_rr-11cNg7UeP_cN9PxvGR1qf
unzip bird.zip

Run SAR-Agent

Example invocations:

BIRD:
```
python sql_verifier.py
```
Spider 2.0-Snow:
```
python sql_verifier_sf.py
```
Spider 2.0-Snow-0713:
```
python sql_verifier_sf.py --old_sf
```

Re-evaluation of open-source agents

We include 16 open-source agents from the BIRD leaderboard under text_to_sql_agents. For convenience, we also include the generated SQL outputs used in our study.

Use released outputs

Each agent folder under text_to_sql_agents contains a results/ subfolder with generated queries for both the original and corrected subsets (see each agent’s README for file names).

To compute execution accuracy, run the evaluation script, for example:

python evaluate.py \
  --pred ./CHESS/results/dev.json \
  --gold ./data/dev/dev.json \
  --db_path ./data/dev/dev_databases

Run agents yourself (optional)

Data setup

Download the original and corrected BIRD Dev subsets https://drive.google.com/file/d/1lhWvaI15UnAa7Mjs1dtKiBEfLRzJJpEz/view?usp=sharing

gdown 1lhWvaI15UnAa7Mjs1dtKiBEfLRzJJpEz

Run agents on the original and corrected Dev subset

Each agent folder contains a README with environment setup, checkpoints, and run commands.
You can follow their README files to run agents yourself.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
SAR-Agent		SAR-Agent
materials		materials
text_to_sql_agents		text_to_sql_agents
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

SAR-Agent

Data setup

Run SAR-Agent

Re-evaluation of open-source agents

Use released outputs

Run agents yourself (optional)

Data setup

Run agents on the original and corrected Dev subset

About

Uh oh!

Releases

Packages

Contributors 2

Languages

uiuc-kang-lab/text_to_sql_benchmarks

Folders and files

Latest commit

History

Repository files navigation

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

SAR-Agent

Data setup

Run SAR-Agent

Re-evaluation of open-source agents

Use released outputs

Run agents yourself (optional)

Data setup

Run agents on the original and corrected Dev subset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages