-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for preprocessing user prompts and metadata filter for retrieval and answering #531
Comments
A lot of what I have read around query preprocessing seems to reduce to sending the prompt to an LLM and/or ML model for reformulation. (This is separate to the LLM which is part of the 'G' stage of RAG.) For example, sending the user prompt to some form of classifier model that returns a bunch of relevant metadata tags e.g. prompt is 'Legal'. Or sending the user prompt to an LLM that contains a system prompt with a number of examples of how an optimal query should be written. The LLM then tries to rewrite the user query in this optimal way. I don't know if this aligns with what @pmeier has written above e.g. this may be something like an agentic workflow - it would need to be confirmed with @pmeier before proceeding any further. There are also more advanced techniques for query reformulation and retrieval e.g. HyDE. I'm not aware of how many such techniques realistically exist. All I know is that the techniques that get all the attention on X and Reddit tend to involve sending to an LLM and or ML model. Some relevant links: https://x.com/andrew_n_carr/status/1731058196801077409?t=GnGM2txliCJi-8svc-bX4g&s=03 |
I am in agreement with @nenb. It's hard to see how you could take text input and efficiently preprocess it without using an llm. That being said, I think the code can be written in such a way that leaves the door open for non-llm preprocessors if someone wanted to go go down that road. I gave this discussion to claude and it suggested this as a starting point, which looks pretty good to me: from dataclasses import dataclass
from typing import Optional, List, Union
from ragna.core import MetadataFilter
@dataclass
class QueryProcessingStep:
"""Represents a single preprocessing step"""
original_query: str
processed_query: str
metadata_filter: Optional[MetadataFilter] = None
processor_name: str = ""
metadata: dict = field(default_factory=dict)
@dataclass
class ProcessedQuery:
"""Results of query preprocessing"""
retrieval_query: str
answer_query: str
metadata_filter: MetadataFilter
processing_history: List[QueryProcessingStep] = field(default_factory=list)
class QueryPreprocessor(abc.ABC):
"""Abstract base class for query preprocessors"""
@abc.abstractmethod
def process(
self,
query: str,
metadata_filter: Optional[MetadataFilter] = None
) -> ProcessedQuery:
"""Process a query before retrieval"""
pass
@classmethod
def display_name(cls) -> str:
return cls.__name__
# Example implementations
class DefaultQueryPreprocessor(QueryPreprocessor):
"""Default no-op preprocessor"""
def process(self, query: str, metadata_filter: Optional[MetadataFilter] = None) -> ProcessedQuery:
return ProcessedQuery(
retrieval_query=query,
answer_query=query,
metadata_filter=metadata_filter or MetadataFilter(),
processing_history=[
QueryProcessingStep(query, query, metadata_filter, self.display_name())
]
)
class LLMQueryPreprocessor(QueryPreprocessor):
"""Uses an LLM to preprocess queries"""
def __init__(self, llm_assistant: Assistant):
self.llm = llm_assistant
def process(self, query: str, metadata_filter: Optional[MetadataFilter] = None) -> ProcessedQuery:
# Use the LLM to generate improved retrieval and answer queries
# Track processing steps
# Return processed query Also similar to one of @pmeier points above, Do we want to enforce prompt and metadata filter agreement or should this be handled by the user? For example if the prompt is something like "what happened 2 years ago" but the metadata filter is something like |
Thanks for the additional points Nick!
That is good, because than we are all in agreement 🙂 . I'm aware that most preprocessors will use LLMs and that is fine. I just don't want to build it in a way that requires LLMs. This is similar to the source storages that we have. Initially everyone was sure that you need a vector DB to do RAG. But over time the community realized that this is not a requirement and other techniques or a combination of multiple ones works better. Meaning, I don't want to lock us in even if right now it looks like there is only one type of solution.
I don't think we realistically can given that there is no required metadata field. Checking if a filter got wider "only" requires us to compare it with the original and check if any additional The more I think about it, the more I gravitate to ignoring this requirement and just leave it up to the user to make it more simple for us. And TBH, changing the metadata filter will be a niche thing anyway. |
Overall ImpressionMy takeaway so far is that query preprocessing is hard to do right. Its effectiveness appears to be highly dependent on both the source data and the input prompt. What Have I Explored?I have tried a handful of different querying methods. These methods involve:
First, I was able to demonstrate an example where context injection worked. This experiment could be expanded to better determine what context is needed and only inject that context. Next, I would say the query expansion experiments were generally inconclusive. I found query expansion via document retrieval to be unreliable; however, I suspect that may be at least partially due to limitations in the source data. I was using the Ragna docs, and I think they may not have enough breadth to demonstrate it working. My understanding is that query expansion can be useful when the relationship between the prompt and the source material resembles a subset Venn diagram, with the source material being the outer circle and the prompt similarity being the inner circle. Query expansion should be useful to extend the inner circle. My issue so far with demonstrating this is finding verifiable relationships between the source material and prompt similarity that can be appropriately expanded. Is It Worth Integrating into
|
Currently, Ragna only supports a minimal RAG pattern: during the retrieval stage we passing the prompt without any processing to the source storage and retrieve sources from that. This has several downsides:
As touched on above, a common strategy to improve RAG results, is to preprocess the prompt before passing it on to the source storage and assistant.
I'd like to use this issue as discussion to enable this functionality in Ragna. I'm just going dump my thoughts here that we can use to sort out the proper design for this:
preprocess(prompt: str, metadata_filter: MetadataFilter) -> Prompt
. The returnedPrompt
object should just be a dataclass that carries aretrieval_prompt
, ananswer_prompt
, and the metadata filter. In case no preprocessing is defined, we can just doPrompt(retrieval_prompt=prompt, answer_prompt=prompt, metadata_filter=metadata_filter)
Prompt
object needs to contain a list of intermediate results that can be inserted into our DB, while the actual RAG procedure only moves on with the last entry.The text was updated successfully, but these errors were encountered: