This project consists of four Python scripts designed for different functionalities:
- Download documents: Downloads and manages documents from remote sources based on a provided configuration file.
- Database and Document Processing: Handles document processing, including splitting and adding documents to a database.
- Query and Retrieval System: Manages querying, interactive loops, and retrieval of relevant documents using a multi-vector retrieval approach.
- Chat and Retrieval System: Manages chatting, interactive loops, and retrieval of relevant documents using a multi-vector retrieval approach.
These scripts work together to manage document processing, querying, and downloading, creating a cohesive system for document management and information retrieval.
- Download documents
- Database and Document Processing
- Query and Retrieval System
- Chat and Retrieval System
- Setup and Installation
- Usage
- Configuration
This script manages the downloading of documents from remote sources. It supports checking if the remote file has been updated before downloading and handles concurrent downloads using threading.
main(file_path: str, download_path: str) -> None
: The entry point for the document download process.load_json_file(file_path: str) -> Optional[dict]
: Loads and parses a JSON configuration file.download_file(...) -> None
: Downloads a file from a specified URL to a local path.is_remote_file_updated(...) -> bool
: Checks if a remote file is newer or different in size compared to the local file.download_docs(...) -> None
: Manages downloading multiple documents based on configuration settings.
This script handles the processing of documents, including splitting them into chunks and managing document hashes. The documents are then stored in a vectorstore and bytestore for efficient retrieval.
main(path: str) -> None
: Handles database reset and document processing.add_document_hash(file_path: str) -> None
: Adds a hash of the document to the database to track changes.load_documents(path: str) -> list[Document]
: Loads PDF documents from the specified path.split_documents(...)
: Splits documents into parent and child chunks based on specified sizes.add_documents_to_store(...)
: Adds documents to vector and bytestore databases.clear_database() -> None
: Clears all stored data in both vector and bytestore databases.
This script provides functionality to perform queries against a stored set of documents using multi-vector retrieval. It allows for both direct queries and an interactive query loop. The retrieval process leverages self-querying and hybrid search methods, combining keyword-based searches with vector similarity to provide more accurate and relevant results.
- Self-Querying: Automatically generates alternative queries from the user's input to enhance retrieval performance.
- Hybrid Search: Utilizes both keyword-based and vector-based similarity searches to maximize the relevance of the retrieved documents.
main() -> None
: Initializes the command-line interface for querying.interactive_query_loop() -> None
: Provides an interactive loop for continuous user queries.query_rag(query_text: str) -> None
: Main function that handles the entire query process, including generating alternative questions, retrieving documents, and generating responses using self-querying and hybrid search techniques.generate_response(...) -> str
: Generates a response based on the context of relevant documents.retrieve_relevant_docs(...) -> tuple[list, list]
: Retrieves relevant documents based on provided questions and sources.
This script provides functionality to chat with a stored set of documents using multi-vector retrieval. The retrieval process leverages self-querying and previous chat messages, using vector similarity to provide more accurate and relevant results.
- Self-Querying: Automatically generates alternative queries from the user's input to enhance retrieval performance.
main() -> None
: Initializes the command-line interface for querying.interactive_query_loop() -> None
: Provides an interactive loop for continuous user chats.
-
Clone the Repository: Clone the repository to your local machine.
git clone https://github.com/madslundt/aws-rag.git cd aws-rag
-
Install Dependencies: Use
pip
orpipenv
to install the necessary packages.pipenv install
-
Activate the Virtual Environment: If you are using
pipenv
, enter the virtual environment shell.pipenv shell
-
Download documents:
python download_docs.py
-
Database and Document Processing:
python populate_database.py [--reset]
-
Query or chat: Query with the RAG in interactive mode:
python query_rag.py
Chat with the RAG in interactive mode:
python chat_rag.py
--reset
: Clears the existing database before processing new documents.--query_text
: Directly queries the system without entering interactive mode.
In interactive mode the following keywords can be used:
q
orexit
: Terminate interactive moder
orreset
: Resets the chat historych
orhistory
orchat_history
: Shows the chat history
CONFIG_PATH
: Path to the configuration JSON file.DOCSTORE_PATH
: Directory path for storing document databases.DOCUMENTS_PATH
: Directory path where documents are downloaded and stored.OLLAMA_MODEL
: Specifies the language model to be used. The default isllama3.1
.EMBEDDING_MODEL
: Specifies the embedding model to be used. The default isnomic-embed-text
via Ollama.
This project is licensed under the MIT License. See the LICENSE
file for more information.