This repository contains the code for the following paper,
LEAP: LLM-powered End-to-end Automatic Library for Processing Social Science Queries on Unstructured Data (VLDB'25)
Chuxuan Hu, Austin Peters, and Daniel Kang
In this paper, we introduce LEAP, an end-to-end library designed to support social science research by automatically analyzing user-collected unstructured data in response to their natural language queries. Along with LEAP, we present QUIET-ML, a dataset that comprehensively covers 120 popular queries in social science research. LEAP achieves 100% pass @ 3 and 92% pass @ 1 with an average cost being $1.06 per query on QUIET-ML. Please refer to our paper for more details.
You can install LEAP and QUIET-ML from PyPI with pip:
pip install autopipeline==0.1.393After importing the library, you should set up your openai key and organization ID:
import autopipeline
autopipeline.api_key = "your-openai-api-key"
autopipeline.organization = "your-openai-organization"You can execute LEAP using one function:
from autopipeline.Interactive import leap
result, table = leap(query, data, description)Here, query is user query in natural language, data contains the unstructured data, description contains the description for the unstructured data, which can be provided by the user, or generated with a provided helper function:
from autopipeline.util import formalize_desc
desc_dict = {"doc_id": "the pdf document id", "pdf_orig": "the pdf document"}
description = formalize_desc(desc_dict)result is the response to query on data, and table contains all necessary information to answer query.
We provide the following examples to demonstrate the features and use cases of LEAP:
General introduction (non-vague queries w/o unspecified numerical values)
Non-vague queries with unspecified numerical values
If you didn't find the ML functions you need in LEAP's internal function list, feel free to Add UDFs.
We invite a legal researcher to utilize LEAP in their research. Below are representative examples that have proven to be helpful:
Named entity recognition (NER)
QUIET-ML consists of 40 real-world social science research questions, the corresponding unstructured data, and the answers to these questions. You can load the QUIET-ML dataset from the library:
from autopipeline.data import QUIET_ML
dataset = QUIET_ML()You can access a specific query with its qid (ranging from 1 to 40):
query_struct = dataset.query(qid)query_struct is a dictionary where query_struct["query"] is the social science query in natural language, query_struct["data"] contains the unstructured data, and query_struct["desc"] contains decriptions to the unstructured data.
To reproduce our quantitative results, we provide the following example codes to load and test queries in QUIET-ML using LEAP:
