feat: add self-query functionality #163

Robbe-Superlinear · 2025-10-13T11:52:02Z

Self-Query: Automatic Metadata Filter Extraction

This pull request introduces a self-query feature, enabling automatic extraction of metadata filters from natural language queries using an LLM. This enhancement allows users to search more intuitively without manually specifying metadata filters.

Key Features

🔍 Self-Query Functionality

Automatic metadata extraction: Extracts metadata filters directly from natural language queries
Context-aware filtering: The Metadata table provides the LLM with available metadata fields and their possible values, ensuring generated filters are valid and grounded
Integration: Works with both vector_search and keyword_search methods
Configurable: Enable via RAGLiteConfig(self_query=True) (disabled by default)

📊 Metadata Management System

Normalized storage: All document and chunk metadata values are stored now as lists via _adapt_metadata utility
Metadata tracking: New Metadata table tracks all metadata fields and their allowed unique values, providing a catalog of available filters for self-query
Automatic aggregation: Metadata table updated during document insertion

Performance Benchmarks

Dataset: CUAD (Contract Understanding Atticus Dataset)
Settings: Default RAGLite benchmarking configuration

Metric	Self-Query	Main
Exact Matching MAP	0.6330	0.6202
Exact Transformed Matching MAP	0.6342	0.6212
Answers Found Ratio	0.7641	0.8653
Average Rank of Found	2.2100	2.2409
Std Dev Rank of Found	2.0487	2.0765
Median Rank of Found	1.0	1.0
Mean Reciprocal Rank (MRR)	0.5526	0.6212

Self-query doesn’t add much value when it works, and performance drops when it doesn’t. In CUAD, every chunk begins with a header like # <Agreement Category> between <Company A> and <Company B>, which already captures key metadata-like information and reduces the added value of self-query. Having this clear header that states the agreement type and companies involved gives standard RAG a strong advantage in retrieving relevant chunks. Self-query might prove more useful in cases where chunks do not include such descriptive headers carrying document-level metadata.

Each self-query–generated filter was compared to the metadata of the golden chunk. A filter was considered correct when the predicted categories and companies exactly matched the ground truth. If the filter included extra categories or companies beyond the ground truth, it was labeled overspecified. If it missed some expected elements, it was marked underspecified. When the filter didn’t align as either a subset or a superset of the ground truth, it was categorized as a mismatch. Mismatch and overspecified cases are the most critical, as they can lead to retrieving zero relevant chunks and severely impact performance.

Here we can see where self-query fails. The confusion matrices show that:

Category errors often come from predicting Distributorship Agreement instead of Distributor Agreement and similar errors.
Company errors are more varied, sometimes due to near-duplicates (e.g., Gridiron BioNutrients vs Gridiron BioNutrients, Inc.).

Usage Example

from raglite import Document, RAGLiteConfig, insert_documents, rag
from raglite._search import _self_query
# Configure with self-query enabled
my_config = RAGLiteConfig(
    db_url="duckdb:///raglite.db",
    llm="gpt-4.1-nano",
    embedder="text-embedding-3-small",
    self_query=True,  # Enable automatic metadata extraction
)
# Insert documents with metadata
car_docs = [
    Document.from_text(
        "# Audi e-tron\nThe Audi e-tron is a fully electric mid-size luxury crossover SUV.",
        manufacturer="Audi",
        year=2022,
        type="electric",
    ),
    Document.from_text(
        "# Honda Civic\nThe Honda Civic is a line of cars manufactured by Honda since 1972.",
        manufacturer="Honda",
        year=2023,
        type="sedan",
    ),
    Document.from_text(
        "# Chevrolet Silverado\nThe Chevrolet Silverado is a range of trucks by General Motors.",
        manufacturer="Chevrolet",
        year=2015,
        type="truck",
    ),
]
insert_documents(car_docs, config=my_config)
# Query naturally - metadata filters extracted automatically
query = "What car does Audi offer?"
metadata_filter = _self_query(query, config=my_config)
print(metadata_filter)  # {'manufacturer': ['Audi']}
# Use in RAG pipeline
messages = [{"role": "user", "content": query}]
chunk_spans = []
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
for update in stream:
    print(update, end="")

src/raglite/_insert.py

src/raglite/_rag.py

Copilot

Pull Request Overview

This PR adds self-query functionality to the RAGLite library, enabling automatic extraction of metadata filters from natural language queries to improve search precision.

Implements _self_query function that uses LLM to extract metadata filters from queries
Adds metadata tracking in the database with a new Metadata table
Integrates self-query capability into the retrieval pipeline with a configurable flag

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/raglite/_config.py	Adds `self_query` boolean flag to RAGLiteConfig
src/raglite/_database.py	Defines new Metadata table for tracking available metadata values
src/raglite/_insert.py	Implements metadata aggregation and database updates during document insertion
src/raglite/_rag.py	Adds core self-query functionality and integrates it into retrieval pipeline
tests/test_insert.py	Tests metadata tracking functionality
tests/test_rag.py	Tests self-query extraction and retrieval integration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/raglite/_rag.py

tests/test_rag.py

emilradix · 2025-10-15T12:00:56Z

Could you add a PR description? you can edit the first comment of Robbe to put it. @jirastorza

Co-authored-by: Copilot <[email protected]>

Robbe-Superlinear

Some small comments, but one big topic. I propose to have a sync, when you have the time, to align.

src/raglite/_search.py

src/raglite/_insert.py

src/raglite/_search.py

…a storage, simplified self_query and insert logic.

Robbe-Superlinear

LGTM

emilradix · 2025-10-22T08:55:01Z

Probably we will need a more difficult dataset for benchmarking this to see gain in performance. I think we are just finding all chunks from the right document, and it is just a matter of ordering them, which the filter wont help with. So not so surprising in my opinion.

Also could you make sure the PR description is up to date? F.e. did you incorporate this change of ensuring all metadata used for filtering have the values stored as a list? if so it should be mentioned in description. @jirastorza

…hars

…action

…n self_query

…action

jirastorza added 3 commits October 9, 2025 14:16

feat: add self-query functionality

11a3850

fix: modified self_query_prompt

f954bca

fix: modified self_query_prompt

f0e66da

Robbe-Superlinear commented Oct 13, 2025

View reviewed changes

Robbe-Superlinear assigned Robbe-Superlinear and jirastorza Oct 13, 2025

jirastorza added 2 commits October 14, 2025 07:17

fix: code simplification

2e6c436

fix: test rag

3507ad5

jirastorza marked this pull request as ready for review October 14, 2025 13:01

jirastorza marked this pull request as draft October 14, 2025 13:03

fix: add self_query option to config and update tool calling logic.

238d3a1

jirastorza marked this pull request as ready for review October 14, 2025 13:56

emilradix requested a review from Copilot October 15, 2025 10:46

Copilot AI reviewed Oct 15, 2025

View reviewed changes

src/raglite/_rag.py Outdated Show resolved Hide resolved

src/raglite/_rag.py Outdated Show resolved Hide resolved

tests/test_rag.py Outdated Show resolved Hide resolved

jirastorza and others added 8 commits October 15, 2025 14:02

fix: corret logger

b8055da

Co-authored-by: Copilot <[email protected]>

fix: linting

9e32790

fix: simplify rag test.

c8e4fa9

fix: remove repetitive self_query call.

ff97cd2

fix: move self_query to _search.py

e12ed5b

fix: modify test structure.

752ea2b

fix: allow list metadata values.

b0b46a6

fix: allow list type metadata handling.

5d575e9

Robbe-Superlinear commented Oct 17, 2025

View reviewed changes

jirastorza added 5 commits October 17, 2025 11:40

fix: reduce MetadataValues to hashable types, modify document metadat…

b32f070

…a storage, simplified self_query and insert logic.

fix: adapt test.

f937fe6

fix: adapt test case to changes.

f68d1c7

fix: additional test fix.

ecbcae2

fix: database chunk and document metadata.

fb5a01b

Robbe-Superlinear removed their assignment Oct 22, 2025

Robbe-Superlinear commented Oct 22, 2025

View reviewed changes

jirastorza added 4 commits October 22, 2025 11:27

fix: update README.

15a6000

Merge remote-tracking branch 'origin/main' into self-query

f20c512

fix: ensure metadata is stored as proper JSON without escape characters

1e10550

fix: handle hex byte escape sequences in metadata filter values

723931d

jirastorza closed this Oct 29, 2025

jirastorza reopened this Oct 29, 2025

jirastorza added 8 commits October 30, 2025 08:31

fix: sanitize LLM metadata output to remove NULs and decode escaped c…

775cae3

…hars

docs: clarify comment explaining why LLM output is cleaned after extr…

1e3cb2d

…action

fix: remove metadata filter decoding

ed8558e

fix: decode escaped Unicode sequences in metadata_filter

5586c08

fix: encode query with ensure_ascii for consistent Unicode handling i…

f83e57a

…n self_query

feat: use ID-based metadata mapping for more reliable self-query extr…

dc4e62a

…action

feat: use ID-based metadata mapping for more reliable self-query extr…

1e2c9a0

…action

fix: update self_query template for small model extraction

f8225f5

feat: add self-query functionality #163

Are you sure you want to change the base?

feat: add self-query functionality #163

Uh oh!

Conversation

Robbe-Superlinear commented Oct 13, 2025 • edited by jirastorza Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Self-Query: Automatic Metadata Filter Extraction

Key Features

🔍 Self-Query Functionality

📊 Metadata Management System

Performance Benchmarks

Usage Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emilradix commented Oct 15, 2025

Uh oh!

Robbe-Superlinear left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Robbe-Superlinear left a comment

Choose a reason for hiding this comment

Uh oh!

emilradix commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Robbe-Superlinear commented Oct 13, 2025 •

edited by jirastorza

Loading

emilradix commented Oct 22, 2025 •

edited

Loading