A RAG Approach to Disease Outbreak Monitoring
- Enhancing Epidemiological Intelligence
This repository contains the code for my master's thesis project. It is designed to be self-contained and can be run with or without knowledge of the paper's broader context.
The repository contains all required elements for a running RAG application, from data collection, pre-processing through storage, retrieval and generation.
The corresponding text can be found in this Google Doc.
This RAG application consists of various elements. The setup of each is detailed below:
graph TD;
A[Frontend UI] --> |Query| B[Embedding Model]
B --> |Embedded Query| C[Vector Store]
C --> |Retrieved Relevant Documents| D[Enhanced Query with Context]
D --> E[LLM]
E --> |Generated Response| A
For ease of use, this setup guide has been written with Anaconda in mind, however, it should work equally well with other (virtual) Python environments.
On MacOS you must install a native ARM build if you are running on Apple Silicon (M processors). Otherwise, Python will default to x86 builds which will run in Rosetta (i.e. under emulation) and ML will not run at all. See also here.
CONDA_SUBDIR=osx-arm64 conda create --name pg-vector-rag python=3.12 -c conda-forge
conda activate pg-vector-rag
pip install -r requirements.txt
Should you need to remove the environment and start fresh for any reason:
conda env remove --name pg-vector-rag
This project uses the pgvector Postgres extension as a vector store. This allows the data to be stored alongside the embeddings and as such both can be accessed easily through any SQL querying utility.
If you do not wish to run this locally, a cloud-based service such as Supabase could also be used.
There is a docker-compose.yml which sets up a local PGVector instance as a Docker container. Note: Please create an empty subdirectory pgvector_data
before bringing the container up for the first time. This will be mounted as a volume within the Docker container.
mkdir pgvector_data
docker compose up -d
Next, move to the app/db folder and prepare the vector store. Ensure you have a .env
file present (use the provided .example.env for guidance) and run:
python create_db.py
erDiagram
document {
INTEGER id PK
INTEGER batch "nullable"
TEXT contents "nullable"
TEXT summary "nullable"
JSON meta "nullable"
DATETIME event_date "nullable"
TEXT url "nullable"
DATETIME published_at "nullable"
DATETIME created_at "nullable"
}
embedding {
INTEGER id PK
TEXT model "nullable"
INTEGER document_id FK
INTEGER chunk_id
HALFVEC(256) embedding_256 "nullable"
HALFVEC(384) embedding_384 "nullable"
HALFVEC(512) embedding_512 "nullable"
HALFVEC(768) embedding_768 "nullable"
HALFVEC(1024) embedding_1024 "nullable"
HALFVEC(1536) embedding_1536 "nullable"
HALFVEC(3072) embedding_3072 "nullable"
HALFVEC(4096) embedding_4096 "nullable"
HALFVEC(8192) embedding_8192 "nullable"
}
v_doc_embedding {
INTEGER document_id PK
INTEGER embedding_id PK
INTEGER batch "nullable"
TEXT model "nullable"
INTEGER chunk_id "nullable"
TEXT contents "nullable"
HALFVEC embedding "nullable"
TEXT summary "nullable"
JSON meta "nullable"
DATETIME published_at "nullable"
TEXT url "nullable"
}
country_lookup {
INTEGER id PK
TEXT country_code "nullable"
TEXT country_name "nullable"
TEXT region "nullable"
TEXT subregion "nullable"
DATETIME created_at "nullable"
}
document ||--o{ embedding : document_id
This code uses LangChain to abstract away some of the lower-level interactions with our LLMs and data.
At the time of writing (Jan 2025) LangChain is quite far behind in the version of pgvector it supports (v0.2.5 – current version is v0.3.6). There is an open PR for supporting the new features (especially support for the sparse vector type halfvec
).
This version of the code can be installed directly from GitHub:
pip install git+https://github.com/langchain-ai/langchain-postgres@c32f6beb108e37aad615ee3cbd4c6bd4a693a76d
A list of interesting data sources pertaining to malaria and other tropical diseases can be found in the subfolders under ./data-collectio. The code in this repo currently uses data scraped from WHO DONs to populate our RAG knowledgebase. It should be fairly straightforward to adapt it to other sources.
See ./app/who-don-retriever for scripts to scrape and clean the data. In this directory, you'll also find a README outlining the process.
Note: At the time of writing, a non-packaged version of the Markdownify library must be installed. This has better support for tables in Markdown. Some of the DONs contain HTML tables which would otherwise be lost:
pip install git+https://github.com/matthewwithanm/python-markdownify@3026602686f9a77ba0b2e0f6e0cbd42daea978f5
Copy the pre-processed data retrieved in the step above to ./app/data from where it can be loaded into the database. Populate the document store by running:
python load_documents.py
This will read the CSV data file, do some light pre-processing and load the documents into the database.
All components of this application can be run locally without accessing resources in the cloud. This includes the language and embedding models.
Ollama makes it easy to run LLMs locally. Download and run the installer. Once installed, run your model of choice, e.g. llama3.2 3B:
The following are some suggested options for running the model on a separate computer in the same network:
export OLLAMA_HOST=0.0.0.0
export OLLAMA_KEEP_ALIVE=15m
export OLLAMA_FLASH_ATTENTION=true
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama serve
By default, Ollama will expose its API on port 11434.
Also, by default, Ollama will limit its context window to 2048 tokens. This is too low for our use case. Therefore, we should adjust it before running our model or simply create our own model version with an expanded context window. To do so:
ollama run llama3.2
...
>>> /set parameter num_ctx 16768
>>> /save llama3.2_16kctx
>>> /bye
...
By default, this app uses the all-MiniLM-L6-v2 Sentence Transformer model to generate the embeddings for our vector store. An other model which works very well for embeddings is nomic-embed-text-v1.5. Run the following to pull the models into Ollama:
ollama pull all-minilm
...
ollama pull nomic-embed-text
Once the models have been downloaded, the next step is to create the embeddings for our documents. Run ./app/db/load_embeddings.py:
python load_embeddings.py
Note: This will take some time to process – expect at least 15 minutes on a modern Mac laptop.
TODO:
If you've made it this far – great! At this point, run the front-end application (from within the app directory):
flask run
You should now be able to reach the chat-style interface at http://127.0.0.1:5000.
- In which countries is Malaria most prevalent?
- Which diseases are prevalent in Kenya?
- Which were the largest disease outbreaks in the last 20 years?
- Where were outbreaks with the most severe impacts, e.g. deaths?
- Baseline attempt
- DONs were put together from two fields mainly
- Embeddings loaded for nomic and embed-all
- I think this data is now stored in backup tables in the db (check)
- Refined attempt
- Full DONs were pieced together from all relevant fields
- The documents were distilled into Markdown storage for the db
- Vectors were then taken from them for both nomic and embed-all (sometimes exceeding context window)
- Side quest: The Markdown docs were summarized with gpt-4o-mini
- Could make a Batch 2 with embeddings for these summaries
- Alternatively, these could be embedded inline as processing of the requests happens (though embedding is slow)
Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It's a measure of orientation rather than magnitude.
- Range: -1 to 1 (for normalized vectors, which is typical in text embeddings)
- 1: Vectors point in the same direction (most similar)
- 0: Vectors are orthogonal (unrelated)
- -1: Vectors point in opposite directions (most dissimilar)
In pgvector, the <=>
operator computes cosine distance, which is 1 - cosine similarity.
- Range: 0 to 2
- 0: Identical vectors (most similar)
- 1: Orthogonal vectors
- 2: Opposite vectors (most dissimilar)
When you get results from similarity_search:
- Lower distance values indicate higher similarity.
- A distance of 0 would mean exact match (rarely happens with embeddings).
- Distances closer to 0 indicate high similarity.
- Distances around 1 suggest little to no similarity.
- Distances approaching 2 indicate opposite meanings (rare in practice).
- Reliable, fully local RAG agents with LLaMA3.2-3b - Langchain
- Generate LLM Embeddings On Your Local Machine
- Don’t Embed Wrong! - Matt Williams
- Python RAG Tutorial (with Local LLMs): AI For Your PDFs – pixegami
- AI for Good: Defeating Dengue with AI
- Building a High-Performance RAG Solution with Pgvectorscale and Python
- https://github.com/ryogesh/llm-rag-pgvector
- Swiss TPH OpenMalaria Wiki
- technovangelist
- https://github.com/AlbertoFormaggio1/conversational_rag_web_interface
- https://github.com/nlmatics/nlm-ingestor
- https://github.com/nlmatics/llmsherpa
- https://github.com/segment-any-text/wtpsplit
- https://github.com/aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation
- https://github.com/aurelio-labs/semantic-chunkers
- Leveraging computational tools to combat malaria: assessment and development of new therapeutics
- Systematic review on the application of machine learning to quantitative structure–activity relationship modeling against Plasmodium falciparum
- Predicting malaria outbreaks using earth observation measurements and spatiotemporal deep learning modelling: a South Asian case study from 2000 to 2017
- New Study uses AI to predict malaria outbreaks in South Asia
- Load vector embeddings up to 67x faster with pgvector and Amazon Aurora
- TF-IDF and BM25 for RAG— a complete guide
- Chunking Strategies for LLM Applications
- Simplifying RAG with PostgreSQL and PGVector
- Unleashing the power of vector embeddings with PostgreSQL
- PostgreSQL Extensions: Turning PostgreSQL Into a Vector Database With pgvector
- Late Chunking in Long-Context Embedding Models
- Chunk + Document Hybrid Retrieval with Long-Context Embeddings (Together.ai)
- Retrieval Augmented Generation (RAG) for LLMs
- Build your RAG web application with Streamlit
- Auto-Merging: RAG Retrieval Technique