This repository contains the code for setting up a Retrieval-Augmented Generation (RAG) agent using Python, aimed at scraping data from the internet and using it to enhance response generation. The system integrates several technologies including TuneStudio with the llama 3 model, OpenAI for generating embeddings, and Supabase for data storage.
- Scrape web content using BeautifulSoup
- Extract and clean text data from websites
- Generate embeddings using OpenAI API
- Store and query data in Supabase
- Serve results through a FastAPI server
To install the required dependencies, clone the repository and run:
pip install -r requirements.txt
You must provide the following environment variables. Create a .env
file in your project's root directory with your keys:
TUNE_API_KEY={your_tune_api_key}
SUPABASE_URL={your_supabase_url}
SUPABASE_KEY={your_supabase_key}
OPENAI_API_KEY={your_openai_api_key}
Keys can be obtained from:
- Tune API key: Tune Studio
- Supabase URL and Key: Supabase Project Settings
- OpenAI API Key: OpenAI Platform
Use the following SQL commands in Supabase to set up your database:
-- Create vector extension
create extension vector;
-- Create documents table
create table documents (
id bigserial primary key,
content text,
embedding vector(1536),
url text,
title text
);
-- Create function for matching documents
create or replace function match_documents (
query_embedding vector(1536),
match_threshold float,
match_count int
)
returns table (
id bigint,
content text,
url text,
title text,
similarity float
)
language sql stable
as $$
select
documents.id,
documents.content,
documents.url,
documents.title,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where documents.embedding <=> query_embedding < 1 - match_threshold
order by documents.embedding <=> query_embedding
limit match_count;
$$;
To start up the FastAPI server, navigate to the project's root directory and execute:
python main.py
The server will be available at http://localhost:8000
.
The server provides the following endpoints:
- Input: JSON payload with the URL of the website to extract data from.
- Functionality: Processes the website to extract text, generates embeddings, and stores the data in Supabase.
- Example curl:
curl -X POST http://localhost:8000/add_documents \
-H "Content-Type: application/json" \
-d '{"url":"http://example.com"}'
- Input: JSON payload with a user query string.
- Functionality: Queries the database based on embeddings similarity and returns relevant documents.
- Example curl:
curl -X POST http://localhost:8000/prompt \
-H "Content-Type: application/json" \
-d '{"query":"example query"}'
main.py
: FastAPI server setup and endpoints.utils.py
: Utility functions including data scraping, cleaning, embedding generation, and database interaction.
For more details, refer to the inline comments in the code.
Keep Tuning!