Skip to content

Conversation

@Robbe-Superlinear
Copy link

@Robbe-Superlinear Robbe-Superlinear commented Oct 13, 2025

Self-Query: Automatic Metadata Filter Extraction

This pull request introduces a self-query feature, enabling automatic extraction of metadata filters from natural language queries using an LLM. This enhancement allows users to search more intuitively without manually specifying metadata filters.

Key Features

🔍 Self-Query Functionality

  • Automatic metadata extraction: Extracts metadata filters directly from natural language queries
  • Context-aware filtering: The Metadata table provides the LLM with available metadata fields and their possible values, ensuring generated filters are valid and grounded
  • Integration: Works with both vector_search and keyword_search methods
  • Configurable: Enable via RAGLiteConfig(self_query=True) (disabled by default)

📊 Metadata Management System

  • Normalized storage: All document and chunk metadata values are stored now as lists via _adapt_metadata utility
  • Metadata tracking: New Metadata table tracks all metadata fields and their allowed unique values, providing a catalog of available filters for self-query
  • Automatic aggregation: Metadata table updated during document insertion

Performance Benchmarks

Dataset: CUAD (Contract Understanding Atticus Dataset)
Settings: Default RAGLite benchmarking configuration

Metric Self-Query Main
Exact Matching MAP 0.6330 0.6202
Exact Transformed Matching MAP 0.6342 0.6212
Answers Found Ratio 0.7641 0.8653
Average Rank of Found 2.2100 2.2409
Std Dev Rank of Found 2.0487 2.0765
Median Rank of Found 1.0 1.0
Mean Reciprocal Rank (MRR) 0.5526 0.6212

Self-query doesn’t add much value when it works, and performance drops when it doesn’t. In CUAD, every chunk begins with a header like # <Agreement Category> between <Company A> and <Company B>, which already captures key metadata-like information and reduces the added value of self-query. Having this clear header that states the agreement type and companies involved gives standard RAG a strong advantage in retrieving relevant chunks. Self-query might prove more useful in cases where chunks do not include such descriptive headers carrying document-level metadata.

tag_distribution

Each self-query–generated filter was compared to the metadata of the golden chunk. A filter was considered correct when the predicted categories and companies exactly matched the ground truth. If the filter included extra categories or companies beyond the ground truth, it was labeled overspecified. If it missed some expected elements, it was marked underspecified. When the filter didn’t align as either a subset or a superset of the ground truth, it was categorized as a mismatch. Mismatch and overspecified cases are the most critical, as they can lead to retrieving zero relevant chunks and severely impact performance.
category_confusion_matrix
companies_confusion_matrix

Here we can see where self-query fails. The confusion matrices show that:

  • Category errors often come from predicting Distributorship Agreement instead of Distributor Agreement and similar errors.
  • Company errors are more varied, sometimes due to near-duplicates (e.g., Gridiron BioNutrients vs Gridiron BioNutrients, Inc.).

Usage Example

from raglite import Document, RAGLiteConfig, insert_documents, rag
from raglite._search import _self_query
# Configure with self-query enabled
my_config = RAGLiteConfig(
    db_url="duckdb:///raglite.db",
    llm="gpt-4.1-nano",
    embedder="text-embedding-3-small",
    self_query=True,  # Enable automatic metadata extraction
)
# Insert documents with metadata
car_docs = [
    Document.from_text(
        "# Audi e-tron\nThe Audi e-tron is a fully electric mid-size luxury crossover SUV.",
        manufacturer="Audi",
        year=2022,
        type="electric",
    ),
    Document.from_text(
        "# Honda Civic\nThe Honda Civic is a line of cars manufactured by Honda since 1972.",
        manufacturer="Honda",
        year=2023,
        type="sedan",
    ),
    Document.from_text(
        "# Chevrolet Silverado\nThe Chevrolet Silverado is a range of trucks by General Motors.",
        manufacturer="Chevrolet",
        year=2015,
        type="truck",
    ),
]
insert_documents(car_docs, config=my_config)
# Query naturally - metadata filters extracted automatically
query = "What car does Audi offer?"
metadata_filter = _self_query(query, config=my_config)
print(metadata_filter)  # {'manufacturer': ['Audi']}
# Use in RAG pipeline
messages = [{"role": "user", "content": query}]
chunk_spans = []
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
for update in stream:
    print(update, end="")

@jirastorza jirastorza marked this pull request as ready for review October 14, 2025 13:01
@jirastorza jirastorza marked this pull request as draft October 14, 2025 13:03
@jirastorza jirastorza marked this pull request as ready for review October 14, 2025 13:56
@emilradix emilradix requested a review from Copilot October 15, 2025 10:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds self-query functionality to the RAGLite library, enabling automatic extraction of metadata filters from natural language queries to improve search precision.

  • Implements _self_query function that uses LLM to extract metadata filters from queries
  • Adds metadata tracking in the database with a new Metadata table
  • Integrates self-query capability into the retrieval pipeline with a configurable flag

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/raglite/_config.py Adds self_query boolean flag to RAGLiteConfig
src/raglite/_database.py Defines new Metadata table for tracking available metadata values
src/raglite/_insert.py Implements metadata aggregation and database updates during document insertion
src/raglite/_rag.py Adds core self-query functionality and integrates it into retrieval pipeline
tests/test_insert.py Tests metadata tracking functionality
tests/test_rag.py Tests self-query extraction and retrieval integration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@emilradix
Copy link
Contributor

Could you add a PR description? you can edit the first comment of Robbe to put it. @jirastorza

Copy link
Author

@Robbe-Superlinear Robbe-Superlinear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, but one big topic. I propose to have a sync, when you have the time, to align.

@Robbe-Superlinear Robbe-Superlinear removed their assignment Oct 22, 2025
Copy link
Author

@Robbe-Superlinear Robbe-Superlinear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@emilradix
Copy link
Contributor

emilradix commented Oct 22, 2025

Probably we will need a more difficult dataset for benchmarking this to see gain in performance. I think we are just finding all chunks from the right document, and it is just a matter of ordering them, which the filter wont help with. So not so surprising in my opinion.

Also could you make sure the PR description is up to date? F.e. did you incorporate this change of ensuring all metadata used for filtering have the values stored as a list? if so it should be mentioned in description. @jirastorza

@jirastorza jirastorza closed this Oct 29, 2025
@jirastorza jirastorza reopened this Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants