Skip to content

[VLDB'2025] LEAP: LLM-powered End-to-end Automatic Library for Processing Social Science Queries on Unstructured Data

Notifications You must be signed in to change notification settings

uiuc-kang-lab/leap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LEAP and QUIET-ML

arXiv

This repository contains the code for the following paper,

LEAP: LLM-powered End-to-end Automatic Library for Processing Social Science Queries on Unstructured Data (VLDB'25)
Chuxuan Hu, Austin Peters, and Daniel Kang

In this paper, we introduce LEAP, an end-to-end library designed to support social science research by automatically analyzing user-collected unstructured data in response to their natural language queries. Along with LEAP, we present QUIET-ML, a dataset that comprehensively covers 120 popular queries in social science research. LEAP achieves 100% pass @ 3 and 92% pass @ 1 with an average cost being $1.06 per query on QUIET-ML. Please refer to our paper for more details.

Figure Not Found

Installing and Importing

You can install LEAP and QUIET-ML from PyPI with pip:

pip install autopipeline==0.1.393

After importing the library, you should set up your openai key and organization ID:

import autopipeline
autopipeline.api_key = "your-openai-api-key"
autopipeline.organization = "your-openai-organization"

LEAP

You can execute LEAP using one function:

from autopipeline.Interactive import leap
result, table = leap(query, data, description)

Here, query is user query in natural language, data contains the unstructured data, description contains the description for the unstructured data, which can be provided by the user, or generated with a provided helper function:

from autopipeline.util import formalize_desc
desc_dict = {"doc_id": "the pdf document id", "pdf_orig": "the pdf document"}
description = formalize_desc(desc_dict)

result is the response to query on data, and table contains all necessary information to answer query.

LEAP's General Use Case Demo

We provide the following examples to demonstrate the features and use cases of LEAP:

General introduction (non-vague queries w/o unspecified numerical values)

Vague queries

Non-vague queries with unspecified numerical values

If you didn't find the ML functions you need in LEAP's internal function list, feel free to Add UDFs.

Using LEAP in Legal Research

We invite a legal researcher to utilize LEAP in their research. Below are representative examples that have proven to be helpful:

Named entity recognition (NER)

Part of speech (POS)

Legal text summary

Legal document analysis

QUIET-ML

QUIET-ML consists of 40 real-world social science research questions, the corresponding unstructured data, and the answers to these questions. You can load the QUIET-ML dataset from the library:

from autopipeline.data import QUIET_ML
dataset = QUIET_ML()

You can access a specific query with its qid (ranging from 1 to 40):

query_struct = dataset.query(qid)

query_struct is a dictionary where query_struct["query"] is the social science query in natural language, query_struct["data"] contains the unstructured data, and query_struct["desc"] contains decriptions to the unstructured data.

Reproduction

To reproduce our quantitative results, we provide the following example codes to load and test queries in QUIET-ML using LEAP:

Testing QUIET-ML

About

[VLDB'2025] LEAP: LLM-powered End-to-end Automatic Library for Processing Social Science Queries on Unstructured Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages