Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for preprocessing user prompts and metadata filter for retrieval and answering #531

Open
pmeier opened this issue Jan 8, 2025 · 4 comments
Labels
type: RFD ⚖️ Decision making

Comments

@pmeier
Copy link
Member

pmeier commented Jan 8, 2025

Currently, Ragna only supports a minimal RAG pattern: during the retrieval stage we passing the prompt without any processing to the source storage and retrieve sources from that. This has several downsides:

  • An embedding of a question might not generate a close match to an embedding of a statement that contains the answer to the question. Although this is technically something that should be solved on the embedding model side, it is usually solved by rephrasing the question before using it to retrieve sources.
  • A question might contain hidden context assumptions that cannot be handled by the embedding model. For example a question like "What happened last year?" likely won't get any close matches. Rephrasing the prompt to "What happened in 2024?" will help here.

As touched on above, a common strategy to improve RAG results, is to preprocess the prompt before passing it on to the source storage and assistant.

I'd like to use this issue as discussion to enable this functionality in Ragna. I'm just going dump my thoughts here that we can use to sort out the proper design for this:

  • Input preprocessing should be optional. As this is an advanced pattern, we don't want to force users into it.
  • Input processing should be rather generic. I don't want to enforce an agentic workflow or the like. IMO, the interface should be something simple like preprocess(prompt: str, metadata_filter: MetadataFilter) -> Prompt. The returned Prompt object should just be a dataclass that carries a retrieval_prompt, an answer_prompt, and the metadata filter. In case no preprocessing is defined, we can just do Prompt(retrieval_prompt=prompt, answer_prompt=prompt, metadata_filter=metadata_filter)
  • While we should allow the preprocessing to alter the metadata filter (imagine the prompt "What happened in this project in 2024?" narrowing down the metadata filter to only include documents from the year 2024) we might need to enforce that the preprocessing does not widen the filter. However, since the processing is user defined, we might also not need to enforce this at all and push the responsibility to the user.
  • To be able to evaluate the preprocessing or the full RAG workflow with the preprocessing, we need to be able to track the individual steps. So maybe the Prompt object needs to contain a list of intermediate results that can be inserted into our DB, while the actual RAG procedure only moves on with the last entry.
@pmeier pmeier added the type: RFD ⚖️ Decision making label Jan 8, 2025
@nenb
Copy link
Contributor

nenb commented Jan 8, 2025

A lot of what I have read around query preprocessing seems to reduce to sending the prompt to an LLM and/or ML model for reformulation. (This is separate to the LLM which is part of the 'G' stage of RAG.)

For example, sending the user prompt to some form of classifier model that returns a bunch of relevant metadata tags e.g. prompt is 'Legal'. Or sending the user prompt to an LLM that contains a system prompt with a number of examples of how an optimal query should be written. The LLM then tries to rewrite the user query in this optimal way.

I don't know if this aligns with what @pmeier has written above e.g. this may be something like an agentic workflow - it would need to be confirmed with @pmeier before proceeding any further.

There are also more advanced techniques for query reformulation and retrieval e.g. HyDE. I'm not aware of how many such techniques realistically exist. All I know is that the techniques that get all the attention on X and Reddit tend to involve sending to an LLM and or ML model.

Some relevant links:
https://aksdesai1998.medium.com/improving-rag-with-query-expansion-reranking-models-31d252856580

https://x.com/andrew_n_carr/status/1731058196801077409?t=GnGM2txliCJi-8svc-bX4g&s=03

https://en.wikipedia.org/wiki/Query_expansion

@andrewfulton9
Copy link

andrewfulton9 commented Jan 9, 2025

I am in agreement with @nenb. It's hard to see how you could take text input and efficiently preprocess it without using an llm. That being said, I think the code can be written in such a way that leaves the door open for non-llm preprocessors if someone wanted to go go down that road.

I gave this discussion to claude and it suggested this as a starting point, which looks pretty good to me:

from dataclasses import dataclass
from typing import Optional, List, Union
from ragna.core import MetadataFilter

@dataclass
class QueryProcessingStep:
    """Represents a single preprocessing step"""
    original_query: str
    processed_query: str
    metadata_filter: Optional[MetadataFilter] = None
    processor_name: str = ""
    metadata: dict = field(default_factory=dict)

@dataclass
class ProcessedQuery:
    """Results of query preprocessing"""
    retrieval_query: str
    answer_query: str
    metadata_filter: MetadataFilter
    processing_history: List[QueryProcessingStep] = field(default_factory=list)

class QueryPreprocessor(abc.ABC):
    """Abstract base class for query preprocessors"""

    @abc.abstractmethod
    def process(
        self,
        query: str,
        metadata_filter: Optional[MetadataFilter] = None
    ) -> ProcessedQuery:
        """Process a query before retrieval"""
        pass

    @classmethod
    def display_name(cls) -> str:
        return cls.__name__

# Example implementations
class DefaultQueryPreprocessor(QueryPreprocessor):
    """Default no-op preprocessor"""

    def process(self, query: str, metadata_filter: Optional[MetadataFilter] = None) -> ProcessedQuery:
        return ProcessedQuery(
            retrieval_query=query,
            answer_query=query,
            metadata_filter=metadata_filter or MetadataFilter(),
            processing_history=[
                QueryProcessingStep(query, query, metadata_filter, self.display_name())
            ]
        )

class LLMQueryPreprocessor(QueryPreprocessor):
    """Uses an LLM to preprocess queries"""

    def __init__(self, llm_assistant: Assistant):
        self.llm = llm_assistant

    def process(self, query: str, metadata_filter: Optional[MetadataFilter] = None) -> ProcessedQuery:
        # Use the LLM to generate improved retrieval and answer queries
        # Track processing steps
        # Return processed query

Also similar to one of @pmeier points above, Do we want to enforce prompt and metadata filter agreement or should this be handled by the user? For example if the prompt is something like "what happened 2 years ago" but the metadata filter is something like MetaDatafilter.gt("year", 2024). Should that raise an error or be passed off to developers writing the processors to be handled?

@pmeier
Copy link
Member Author

pmeier commented Jan 10, 2025

Thanks for the additional points Nick!

Re @andrewfulton9

I am in agreement with @nenb. It's hard to see how you could take text input and efficiently preprocess it without using an llm. That being said, I think the code can be written in such a way that leaves the door open for non-llm preprocessors if someone wanted to go go down that road.

That is good, because than we are all in agreement 🙂 . I'm aware that most preprocessors will use LLMs and that is fine. I just don't want to build it in a way that requires LLMs.

This is similar to the source storages that we have. Initially everyone was sure that you need a vector DB to do RAG. But over time the community realized that this is not a requirement and other techniques or a combination of multiple ones works better.

Meaning, I don't want to lock us in even if right now it looks like there is only one type of solution.

Also similar to one of @pmeier points above, Do we want to enforce prompt and metadata filter agreement or should this be handled by the user? For example if the prompt is something like "what happened 2 years ago" but the metadata filter is something like MetaDatafilter.gt("year", 2024). Should that raise an error or be passed off to developers writing the processors to be handled?

I don't think we realistically can given that there is no required metadata field. Checking if a filter got wider "only" requires us to compare it with the original and check if any additional ORs were introduced. But even that can be hard if the structure of the filter has changed.

The more I think about it, the more I gravitate to ignoring this requirement and just leave it up to the user to make it more simple for us. And TBH, changing the metadata filter will be a niche thing anyway.

@andrewfulton9
Copy link

andrewfulton9 commented Jan 30, 2025

Experimentation Notebook

Draft PR

Overall Impression

My takeaway so far is that query preprocessing is hard to do right. Its effectiveness appears to be highly dependent on both the source data and the input prompt.

What Have I Explored?

I have tried a handful of different querying methods. These methods involve:

  • Rewording prompts with more context
  • Query expansion via document retrieval based on a naïve answer
  • Query expansion with rank limits (same as above but only keeping the n most similar documents to the question)
  • Having an LLM write steps for better query expansion
  • Having an LLM write metadata filters for the question

First, I was able to demonstrate an example where context injection worked. This experiment could be expanded to better determine what context is needed and only inject that context.

Next, I would say the query expansion experiments were generally inconclusive. I found query expansion via document retrieval to be unreliable; however, I suspect that may be at least partially due to limitations in the source data. I was using the Ragna docs, and I think they may not have enough breadth to demonstrate it working.

My understanding is that query expansion can be useful when the relationship between the prompt and the source material resembles a subset Venn diagram, with the source material being the outer circle and the prompt similarity being the inner circle. Query expansion should be useful to extend the inner circle.

My issue so far with demonstrating this is finding verifiable relationships between the source material and prompt similarity that can be appropriately expanded.

Is It Worth Integrating into ragna?

Although the experiments haven't been as conclusive as I would have liked, I would still recommend integrating prompt preprocessing into Ragna. My reasoning is that prompt engineering is a significant factor in LLM effectiveness, and I think it is logical to assume that automating some aspects of prompt engineering must be valuable.

In my opinion, more work should be done to understand where and when query preprocessing may be useful. I think integrating the feature into Ragna could help give more people the opportunity to explore this topic, which I believe is valuable in and of itself.

What More Can We Do?

One thing I find lacking in online resources is true demonstrations of the effectiveness of different query preprocessors. I think that is likely because it can be hard to understand the relationship between source material and prompts.

I believe it could be valuable to find more example datasets and then develop prompts that demonstrate some relationship to the data. Examples of these sorts of relationships would be:

  • A prompt with high similarity scores to many source tokens
  • A prompt with only a few low similarity scores to the source material, etc.

Furthermore, we should explore how agentic workflows can be integrated into preprocessors. I think it would be valuable to give an LLM more tools for improving a query beyond just returning text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: RFD ⚖️ Decision making
Projects
None yet
Development

No branches or pull requests

3 participants