Skip to content

uiuc-kang-lab/text_to_sql_benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

This repository contains the code and data for the paper “Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards.”

We introduce SAR-Agent, the first AI agent for detecting annotation errors in text-to-SQL benchmarks via multi-turn interaction with the database. Using SAR-Agent and expert analysis, we find that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively.

We corrected 100 examples sampled from the BIRD Dev set and re-evaluated all 16 open-source agents from the BIRD leaderboard. We observe performance changes of −7% to +31% (relative) and ranking changes from −9 to +9.

Execution accuracy of agents on original and corrected BIRD Dev subsets
Figure 1: Execution accuracy of agents on original and corrected BIRD Dev subsets.

Agent ranking changes from original to corrected BIRD Dev subset
Figure 2: Agent ranking changes from original to corrected BIRD Dev subset.

Repository layout

  • SAR-Agent: implementation of SAR-Agent
  • text_to_sql_agents: code and outputs for the 16 open-source agents we re-evaluated

SAR-Agent

We used OpenAI’s o3 model in our experiments.

Prerequisites

  • Python 3.12

Install

cd SAR-Agent
pip install -r requirements.txt

Set credentials

  • OpenAI:
export OPENAI_API_KEY='<your_api_key>'

Data setup

BIRD

Spider 2.0-Snow

gdown 1H_CoROs_rr-11cNg7UeP_cN9PxvGR1qf
unzip bird.zip

Run SAR-Agent

Example invocations:

  • BIRD:

    python sql_verifier.py
    
  • Spider 2.0-Snow:

    python sql_verifier_sf.py
    
  • Spider 2.0-Snow-0713:

    python sql_verifier_sf.py --old_sf
    

Re-evaluation of open-source agents

We include 16 open-source agents from the BIRD leaderboard under text_to_sql_agents. For convenience, we also include the generated SQL outputs used in our study.

Use released outputs

  • Each agent folder under text_to_sql_agents contains a results/ subfolder with generated queries for both the original and corrected subsets (see each agent’s README for file names).
  • To compute execution accuracy, run the evaluation script, for example:
    python evaluate.py \
      --pred ./CHESS/results/dev.json \
      --gold ./data/dev/dev.json \
      --db_path ./data/dev/dev_databases
    

Run agents yourself (optional)

Data setup

gdown 1lhWvaI15UnAa7Mjs1dtKiBEfLRzJJpEz

Run agents on the original and corrected Dev subset

  • Each agent folder contains a README with environment setup, checkpoints, and run commands.
  • You can follow their README files to run agents yourself.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published