-
Notifications
You must be signed in to change notification settings - Fork 100
feat: add self-query functionality #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds self-query functionality to the RAGLite library, enabling automatic extraction of metadata filters from natural language queries to improve search precision.
- Implements
_self_queryfunction that uses LLM to extract metadata filters from queries - Adds metadata tracking in the database with a new
Metadatatable - Integrates self-query capability into the retrieval pipeline with a configurable flag
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/raglite/_config.py | Adds self_query boolean flag to RAGLiteConfig |
| src/raglite/_database.py | Defines new Metadata table for tracking available metadata values |
| src/raglite/_insert.py | Implements metadata aggregation and database updates during document insertion |
| src/raglite/_rag.py | Adds core self-query functionality and integrates it into retrieval pipeline |
| tests/test_insert.py | Tests metadata tracking functionality |
| tests/test_rag.py | Tests self-query extraction and retrieval integration |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Could you add a PR description? you can edit the first comment of Robbe to put it. @jirastorza |
Co-authored-by: Copilot <[email protected]>
Robbe-Superlinear
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments, but one big topic. I propose to have a sync, when you have the time, to align.
…a storage, simplified self_query and insert logic.
Robbe-Superlinear
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Probably we will need a more difficult dataset for benchmarking this to see gain in performance. I think we are just finding all chunks from the right document, and it is just a matter of ordering them, which the filter wont help with. So not so surprising in my opinion. Also could you make sure the PR description is up to date? F.e. did you incorporate this change of ensuring all metadata used for filtering have the values stored as a list? if so it should be mentioned in description. @jirastorza |
Self-Query: Automatic Metadata Filter Extraction
This pull request introduces a self-query feature, enabling automatic extraction of metadata filters from natural language queries using an LLM. This enhancement allows users to search more intuitively without manually specifying metadata filters.
Key Features
🔍 Self-Query Functionality
Metadatatable provides the LLM with available metadata fields and their possible values, ensuring generated filters are valid and groundedvector_searchandkeyword_searchmethodsRAGLiteConfig(self_query=True)(disabled by default)📊 Metadata Management System
_adapt_metadatautilityMetadatatable tracks all metadata fields and their allowed unique values, providing a catalog of available filters for self-queryPerformance Benchmarks
Dataset: CUAD (Contract Understanding Atticus Dataset)
Settings: Default RAGLite benchmarking configuration
Self-query doesn’t add much value when it works, and performance drops when it doesn’t. In CUAD, every chunk begins with a header like
# <Agreement Category> between <Company A> and <Company B>, which already captures key metadata-like information and reduces the added value of self-query. Having this clear header that states the agreement type and companies involved gives standard RAG a strong advantage in retrieving relevant chunks. Self-query might prove more useful in cases where chunks do not include such descriptive headers carrying document-level metadata.Each self-query–generated filter was compared to the metadata of the golden chunk. A filter was considered correct when the predicted categories and companies exactly matched the ground truth. If the filter included extra categories or companies beyond the ground truth, it was labeled overspecified. If it missed some expected elements, it was marked underspecified. When the filter didn’t align as either a subset or a superset of the ground truth, it was categorized as a mismatch. Mismatch and overspecified cases are the most critical, as they can lead to retrieving zero relevant chunks and severely impact performance.


Here we can see where self-query fails. The confusion matrices show that:
Usage Example