This project provides a set of Python scripts to populate and query a vectorstore database using LangChain's functionalities. The main features include:
- Database Population:
populate_database.py
is used to load documents from a PDF file, split them into manageable chunks, and add them to a Chroma vectorstore and doc store (if enabled). This process involves embeddings, which are numerical representations of text that capture semantic meaning, enabling efficient similarity search and retrieval of relevant content. Additionally, the script handles parent document retrieval, ensuring that the context of each chunk is preserved and that related content can be traced back to its original source for comprehensive data management and search accuracy. - Interactive Querying:
query_data.py
provides an interface to query the database interactively or via command-line arguments. It uses a Large Language Model (LLM) with multiquery capabilities to generate alternative versions of user queries, enhancing the retrieval of relevant documents and generating comprehensive answers. - Testing: Basic tests are provided to validate the querying capabilities using
pytest
.
- Python 3.11
- pipenv: Required to manage dependencies
- Ollama: Required to run the LLM and embeddings locally. Ensures that language model and an embedding function are accessible for processing queries and documents.
By default, llama3.1
is used as the LLM, and nomic-embed-text
is utilized as the embedding function (these settings can be modified in the env.py
file).
- Clone the repository to your local machine.
git clone https://github.com/madslundt/rag-sample
cd rag-sample
- Install the required dependencies using
pipenv
.
pipenv install
- Activate the virtual environment.
pipenv shell
To populate the Chroma vectorstore and docstore with the content of Owners_Manual.pdf
, use populate_database.py
. You can reset the database before loading new documents using the --reset
flag. The embedding process converts text into numerical vectors for efficient similarity search. Each document chunk is assigned a unique ID and a hash value to ensure that duplicates are not added. The script checks these identifiers against the existing vectorstore entries to determine if the document is already present. Additionally, the script supports parent document retrieval, allowing each chunk to be traced back to its original source, preserving context and ensuring accurate and comprehensive search results.
python populate_database.py [--reset]
You can query the database using query_data.py
either by providing a query directly as a command-line argument or interactively.
This script uses the LLM to generate multiple versions of the input query, improving the retrieval of relevant information and generating a well-informed response.
The sources of the retrieved information are displayed, which are extracted from the metadata of the documents stored in the vectorstore.
To query using the command line:
python query_data.py --query_text "Your query here"
To enter interactive mode:
python query_data.py
In interactive mode, you can type your queries directly, and type exit
or q
to quit.
To run tests and validate the querying logic, use pytest
.
pytest
populate_database.py
: Script for clearing, loading, splitting, and adding documents to the vectorstore.query_data.py
: Script for querying the vectorstore database and generating responses based on user questions using the LLM.test_rag.py
: Test cases to verify the functionality of the scripts by using an LLM to check if the results are accurate..env.py
: Contains environment variables and configurations such as paths and model names.
The database is cleared when running populate_database.py
with the --reset
flag.
Customize the paths and configurations in env.py according to your project setup.
This project is licensed under the terms of the MIT license.
Contributions are welcome! Please fork the repository and submit a pull request with your changes.