CRED-SQL

This is the official repository for the paper "CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description". CRED-SQL is a Text-to-SQL framework designed for large-scale databases that addresses the semantic mismatch between natural language questions and SQL queries through Cluster Retrieval and Execution Description.

Core Challenges

Traditional Text-to-SQL approaches face two major challenges in real-world large-scale scenarios: Schema Mismatch: In databases with thousands of tables, existing retrieval techniques struggle to accurately identify relevant tables and columns Semantic Deviation: Direct mapping from natural language questions to SQL queries involves significant semantic gaps

Innovative Solutions

CRED-SQL overcomes these limitations through a two-stage approach:

🔄 Cluster-based Large-scale Schema Retrieval (CLSR)

Clusters tables and columns based on semantic similarity
Dynamic attribute-weighting strategy that boosts relevant attributes
Significantly improves schema selection accuracy in large-scale databases

📝 Execution Description Language (EDL)

Novel natural language intermediate representation describing SQL execution intent
Decomposes Text-to-SQL into Text-to-EDL and EDL-to-SQL subtasks
Better leverages LLMs' general reasoning capabilities while reducing semantic deviation

CRED-SQL

Environment

Clone the repository and install required packages:

pip install -r requirements.txt

Run

Install Vector Database Weaviate（ https://weaviate.io/ ）

cd /weaviate
sh docker_pull.sh
sh docker_run.sh

Download bge-m3 and copy it to /var/rag/models

cd /models
mkdir -p /var/rag/models
python download_bge_m3.py
cp -r ./bge-m3 /var/rag/models

Start CLSR schema retrieval

cd /CLSR/schema_retrieval/evaluation
sh run_evaluation.sh
cd /CLSR/schema-choose/src/
python llm_schema_choose.py

NLQ-to-EDL and EDL-to-SQL

Through few-shot or fine-tuning LLMs, generate EDL first and then generate SQL based on the selected database schema and question.

The link to the EDL dataset: https://huggingface.co/datasets/ZR00/Spider_EDL, and https://huggingface.co/datasets/ZR00/Bird_EDL

The best-performing fine-tuned LLM on the Spider dataset is open-sourced as follows https://huggingface.co/ZR00/spider_qwen32b_train_q_to_edl_orischema and https://huggingface.co/ZR00/spider_qwen32b_train_edl_to_sql.

cd /CLSR/EDL-generation/
python generate_spider.py
cd /CLSR/sql_mapping/
python generate_spider_edl_to_sql.py

Cite

@misc{duan2025credsqlenhancingrealworldlarge,
      title={CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description}, 
      author={Shaoming Duan and Zirui Wang and Chuanyi Liu and Zhibin Zhu and Yuhao Zhang and Peiyi Han and Liang Yan and Zewu Penge},
      year={2025},
      eprint={2508.12769},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.12769}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
CLSR		CLSR
EDL-generation		EDL-generation
bird_evaluation		bird_evaluation
images		images
spider_evaluation		spider_evaluation
sql_mapping		sql_mapping
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CRED-SQL

Core Challenges

Innovative Solutions

🔄 Cluster-based Large-scale Schema Retrieval (CLSR)

📝 Execution Description Language (EDL)

CRED-SQL

Environment

Run

Cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

smduan/CRED-SQL

Folders and files

Latest commit

History

Repository files navigation

CRED-SQL

Core Challenges

Innovative Solutions

🔄 Cluster-based Large-scale Schema Retrieval (CLSR)

📝 Execution Description Language (EDL)

CRED-SQL

Environment

Run

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages