This is the official repository for the paper "CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description". CRED-SQL is a Text-to-SQL framework designed for large-scale databases that addresses the semantic mismatch between natural language questions and SQL queries through Cluster Retrieval and Execution Description.
Traditional Text-to-SQL approaches face two major challenges in real-world large-scale scenarios: Schema Mismatch: In databases with thousands of tables, existing retrieval techniques struggle to accurately identify relevant tables and columns Semantic Deviation: Direct mapping from natural language questions to SQL queries involves significant semantic gaps
CRED-SQL overcomes these limitations through a two-stage approach:
- Clusters tables and columns based on semantic similarity
- Dynamic attribute-weighting strategy that boosts relevant attributes
- Significantly improves schema selection accuracy in large-scale databases
- Novel natural language intermediate representation describing SQL execution intent
- Decomposes Text-to-SQL into Text-to-EDL and EDL-to-SQL subtasks
- Better leverages LLMs' general reasoning capabilities while reducing semantic deviation
Clone the repository and install required packages:
pip install -r requirements.txt
- Install Vector Database Weaviate( https://weaviate.io/ )
cd /weaviate
sh docker_pull.sh
sh docker_run.sh
- Download bge-m3 and copy it to /var/rag/models
cd /models
mkdir -p /var/rag/models
python download_bge_m3.py
cp -r ./bge-m3 /var/rag/models
- Start CLSR schema retrieval
cd /CLSR/schema_retrieval/evaluation
sh run_evaluation.sh
cd /CLSR/schema-choose/src/
python llm_schema_choose.py
- NLQ-to-EDL and EDL-to-SQL
Through few-shot or fine-tuning LLMs, generate EDL first and then generate SQL based on the selected database schema and question.
The link to the EDL dataset: https://huggingface.co/datasets/ZR00/Spider_EDL, and https://huggingface.co/datasets/ZR00/Bird_EDL
The best-performing fine-tuned LLM on the Spider dataset is open-sourced as follows https://huggingface.co/ZR00/spider_qwen32b_train_q_to_edl_orischema and https://huggingface.co/ZR00/spider_qwen32b_train_edl_to_sql.
cd /CLSR/EDL-generation/
python generate_spider.py
cd /CLSR/sql_mapping/
python generate_spider_edl_to_sql.py
@misc{duan2025credsqlenhancingrealworldlarge,
title={CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description},
author={Shaoming Duan and Zirui Wang and Chuanyi Liu and Zhibin Zhu and Yuhao Zhang and Peiyi Han and Liang Yan and Zewu Penge},
year={2025},
eprint={2508.12769},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.12769},
}