Skip to content

adintegra/rag-for-disease-outbreaks

Repository files navigation

Enhancing Epidemiological Intelligence

A RAG Approach to Disease Outbreak Monitoring

Introduction

This repository contains the code for my master's thesis project. It is designed to be self-contained and can be run with or without knowledge of the paper's broader context.

The repository contains all required elements for a running RAG application, from data collection, pre-processing through storage, retrieval and generation.

The corresponding text can be found in this Google Doc.

Getting Started

This RAG application consists of various elements. The setup of each is detailed below:

graph TD;
    A[Frontend UI] --> |Query| B[Embedding Model]
    B --> |Embedded Query| C[Vector Store]
    C --> |Retrieved Relevant Documents| D[Enhanced Query with Context]
    D --> E[LLM]
    E --> |Generated Response| A
Loading

Python Environment

For ease of use, this setup guide has been written with Anaconda in mind, however, it should work equally well with other (virtual) Python environments.

On MacOS you must install a native ARM build if you are running on Apple Silicon (M processors). Otherwise, Python will default to x86 builds which will run in Rosetta (i.e. under emulation) and ML will not run at all. See also here.

CONDA_SUBDIR=osx-arm64 conda create --name pg-vector-rag python=3.12 -c conda-forge
conda activate pg-vector-rag
pip install -r requirements.txt

Should you need to remove the environment and start fresh for any reason:

conda env remove --name pg-vector-rag

Database Setup

This project uses the pgvector Postgres extension as a vector store. This allows the data to be stored alongside the embeddings and as such both can be accessed easily through any SQL querying utility.

If you do not wish to run this locally, a cloud-based service such as Supabase could also be used.

There is a docker-compose.yml which sets up a local PGVector instance as a Docker container. Note: Please create an empty subdirectory pgvector_data before bringing the container up for the first time. This will be mounted as a volume within the Docker container.

mkdir pgvector_data
docker compose up -d

Next, move to the app/db folder and prepare the vector store. Ensure you have a .env file present (use the provided .example.env for guidance) and run:

python create_db.py

DB Schema

erDiagram
  document {
    INTEGER id PK
    INTEGER batch "nullable"
    TEXT contents "nullable"
    TEXT summary "nullable"
    JSON meta "nullable"
    DATETIME event_date "nullable"
    TEXT url "nullable"
    DATETIME published_at "nullable"
    DATETIME created_at "nullable"
  }

  embedding {
    INTEGER id PK
    TEXT model "nullable"
    INTEGER document_id FK
    INTEGER chunk_id
    HALFVEC(256) embedding_256 "nullable"
    HALFVEC(384) embedding_384 "nullable"
    HALFVEC(512) embedding_512 "nullable"
    HALFVEC(768) embedding_768 "nullable"
    HALFVEC(1024) embedding_1024 "nullable"
    HALFVEC(1536) embedding_1536 "nullable"
    HALFVEC(3072) embedding_3072 "nullable"
    HALFVEC(4096) embedding_4096 "nullable"
    HALFVEC(8192) embedding_8192 "nullable"
  }

  v_doc_embedding {
    INTEGER document_id PK
    INTEGER embedding_id PK
    INTEGER batch "nullable"
    TEXT model "nullable"
    INTEGER chunk_id "nullable"
    TEXT contents "nullable"
    HALFVEC embedding "nullable"
    TEXT summary "nullable"
    JSON meta "nullable"
    DATETIME published_at "nullable"
    TEXT url "nullable"
  }

  country_lookup {
    INTEGER id PK
    TEXT country_code "nullable"
    TEXT country_name "nullable"
    TEXT region "nullable"
    TEXT subregion "nullable"
    DATETIME created_at "nullable"
  }

  document ||--o{ embedding : document_id

Loading

LangChain

This code uses LangChain to abstract away some of the lower-level interactions with our LLMs and data.

At the time of writing (Jan 2025) LangChain is quite far behind in the version of pgvector it supports (v0.2.5 – current version is v0.3.6). There is an open PR for supporting the new features (especially support for the sparse vector type halfvec). This version of the code can be installed directly from GitHub:

pip install git+https://github.com/langchain-ai/langchain-postgres@c32f6beb108e37aad615ee3cbd4c6bd4a693a76d

Data

A list of interesting data sources pertaining to malaria and other tropical diseases can be found in the subfolders under ./data-collectio. The code in this repo currently uses data scraped from WHO DONs to populate our RAG knowledgebase. It should be fairly straightforward to adapt it to other sources.

Acquisition & Pre-Processing

See ./app/who-don-retriever for scripts to scrape and clean the data. In this directory, you'll also find a README outlining the process.

Note: At the time of writing, a non-packaged version of the Markdownify library must be installed. This has better support for tables in Markdown. Some of the DONs contain HTML tables which would otherwise be lost:

pip install git+https://github.com/matthewwithanm/python-markdownify@3026602686f9a77ba0b2e0f6e0cbd42daea978f5

Data Ingestion

Copy the pre-processed data retrieved in the step above to ./app/data from where it can be loaded into the database. Populate the document store by running:

python load_documents.py

This will read the CSV data file, do some light pre-processing and load the documents into the database.

LLM

All components of this application can be run locally without accessing resources in the cloud. This includes the language and embedding models.

Ollama

Ollama makes it easy to run LLMs locally. Download and run the installer. Once installed, run your model of choice, e.g. llama3.2 3B:

The following are some suggested options for running the model on a separate computer in the same network:

export OLLAMA_HOST=0.0.0.0
export OLLAMA_KEEP_ALIVE=15m
export OLLAMA_FLASH_ATTENTION=true
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama serve

By default, Ollama will expose its API on port 11434.

Also, by default, Ollama will limit its context window to 2048 tokens. This is too low for our use case. Therefore, we should adjust it before running our model or simply create our own model version with an expanded context window. To do so:

ollama run llama3.2
...
>>> /set parameter num_ctx 16768
>>> /save llama3.2_16kctx
>>> /bye
...

Embeddings

By default, this app uses the all-MiniLM-L6-v2 Sentence Transformer model to generate the embeddings for our vector store. An other model which works very well for embeddings is nomic-embed-text-v1.5. Run the following to pull the models into Ollama:

ollama pull all-minilm
...
ollama pull nomic-embed-text

Once the models have been downloaded, the next step is to create the embeddings for our documents. Run ./app/db/load_embeddings.py:

python load_embeddings.py

Note: This will take some time to process – expect at least 15 minutes on a modern Mac laptop.

Retrieval

TODO:

UI

If you've made it this far – great! At this point, run the front-end application (from within the app directory):

flask run

You should now be able to reach the chat-style interface at http://127.0.0.1:5000.

./app/static/rag-chat-ui

Evaluation

Sample Questions

  • In which countries is Malaria most prevalent?
  • Which diseases are prevalent in Kenya?
  • Which were the largest disease outbreaks in the last 20 years?
  • Where were outbreaks with the most severe impacts, e.g. deaths?

Notes on Batches

Batch 0

  • Baseline attempt
  • DONs were put together from two fields mainly
  • Embeddings loaded for nomic and embed-all
  • I think this data is now stored in backup tables in the db (check)

Batch 1

  • Refined attempt
  • Full DONs were pieced together from all relevant fields
  • The documents were distilled into Markdown storage for the db
  • Vectors were then taken from them for both nomic and embed-all (sometimes exceeding context window)
  • Side quest: The Markdown docs were summarized with gpt-4o-mini
  • Could make a Batch 2 with embeddings for these summaries
  • Alternatively, these could be embedded inline as processing of the requests happens (though embedding is slow)

Cosine Similarity in Vector Search

What is Cosine Similarity?

Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It's a measure of orientation rather than magnitude.

  • Range: -1 to 1 (for normalized vectors, which is typical in text embeddings)
  • 1: Vectors point in the same direction (most similar)
  • 0: Vectors are orthogonal (unrelated)
  • -1: Vectors point in opposite directions (most dissimilar)

Cosine Distance

In pgvector, the <=> operator computes cosine distance, which is 1 - cosine similarity.

  • Range: 0 to 2
  • 0: Identical vectors (most similar)
  • 1: Orthogonal vectors
  • 2: Opposite vectors (most dissimilar)

Interpreting Results

When you get results from similarity_search:

  • Lower distance values indicate higher similarity.
  • A distance of 0 would mean exact match (rarely happens with embeddings).
  • Distances closer to 0 indicate high similarity.
  • Distances around 1 suggest little to no similarity.
  • Distances approaching 2 indicate opposite meanings (rare in practice).

References

YouTube

Articles

Repos

Data

Scientific Papers

Technical Articles

About

ZHAW Master's Thesis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages