Biomedical RAG Agent

A clinician/researcher-facing agent that answers biomedical questions by retrieving from open-access literature (PubMed/PMC) and returning cited, grounded answers — benchmarked across frontier vs. open-weight models with a rigorous faithfulness eval.

Framing: this is an evidence-retrieval tool for clinicians and researchers, not patient-facing medical advice. Answers are grounded in retrieved literature and the agent abstains when the evidence is insufficient.

Why

It pairs a clickable demo (the agent) with a paper-style evaluation study (the eval harness). The differentiator is not "a RAG chatbot" — it's the measurement: retrieval quality, answer faithfulness, hallucination rate, citation accuracy, and abstention correctness, reported across frontier and open-weight models on public biomedical benchmarks.

Architecture

ingest → chunk → embed → vector store (pgvector) + BM25
      → hybrid retriever (optional reranker)
      → model-agnostic synthesizer → cited answer (+ abstain)
      → eval harness runs the pipeline over benchmarks → metrics report

Layer	Package	Role
Ingestion	`biomed_rag.ingest`	Acquire corpus (PubMedQA contexts + PMC-OA slice), chunk
Retrieval	`biomed_rag.retrieval`	Dense (fastembed/BGE) + BM25 hybrid (RRF) over pgvector
Models	`biomed_rag.models`	Model-agnostic LLM interface — frontier + open, via LLMGateway
Agent	`biomed_rag.agent`	Retrieve → synthesize cited answer → abstain
Eval	`biomed_rag.eval`	Metric suite + benchmark runners + report
API	`api/`	FastAPI service backing the web demo
Frontend	`frontend/`	React + Vite demo (Milestone 6)

Eval metrics

Retrieval: Recall@k, nDCG/MRR
Grounding/faithfulness: claim-level support vs. retrieved context
Hallucination rate: unsupported claims per answer
Citation accuracy: cited passage actually supports the claim
Abstention correctness: declines when evidence is weak — and is that right
Task accuracy: PubMedQA (yes/no/maybe); BioASQ stretch set
Comparison axis: all of the above × {frontier, open} + cost & latency

Demo

FastAPI backend + React/Vite frontend: ask a biomedical question, get a grounded answer with inline citations linked to PubMed, an abstention state when evidence is weak, and a model selector (frontier / open). Run both servers (after docker compose up -d and one ingest):

# backend (port 8000)
PYTHONPATH=src .venv/Scripts/python.exe -m uvicorn api.main:app --port 8000
# frontend (port 5173) — in frontend/
npm install && npm run dev

Status

Milestone 6 — web demo live (above). Milestone 5 — frontier vs. open comparison + grounding ablation. Full results in eval_results/comparison.md. Judge gpt-4.1, hybrid retrieval, 150 questions / 1363-passage corpus:

model	tier	acc	macroF1	faith	halluc	cit_acc	recall@6	abst(off)↑
gpt-4.1-mini	frontier	0.63	0.53	0.94	0.06	0.84	0.82	0.80
claude-haiku-4-5	frontier	0.58	0.53	0.96	0.04	0.87	0.82	1.00
deepseek-v3.2	open	0.60	0.51	0.98	0.02	0.97	0.82	1.00
qwen-flash	open	0.59	0.50	0.91	0.09	0.93	0.82	0.60
gpt-4.1-mini (closed-book)	ablation	0.45	0.37	–	–	–	–	–

Findings:

Retrieval works: closed-book → hybrid RAG lifts accuracy 0.45 → 0.63 (+18 pts).
Open ≥ frontier on grounding: DeepSeek-V3.2 leads faithfulness (0.98), hallucination (0.02), and citation accuracy (0.97), matching frontier accuracy.
Caution differs: Qwen-flash answers aggressively (worst faithfulness); Claude/DeepSeek abstain conservatively. recall@6 is identical across models (harness sanity check).

docker compose up -d                                                  # Postgres + pgvector
PYTHONPATH=src .venv/Scripts/python.exe scripts/run_eval.py --n 40     # single model, full metrics
PYTHONPATH=src .venv/Scripts/python.exe scripts/run_compare.py --n 150 # frontier vs open + ablation
PYTHONPATH=src .venv/Scripts/python.exe scripts/tune_abstain.py        # abstention threshold sweep (free)

(run_m2.py = BM25-only skeleton; run_m3.py = hybrid retrieval diagnostics.) See SPEC.md for milestones.

Setup

python -m venv .venv && . .venv/Scripts/activate   # Windows; use bin/activate on *nix
pip install -e ".[dev]"
cp .env.example .env                               # then fill in keys

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
api		api
data		data
docs		docs
eval_results		eval_results
frontend		frontend
notebooks		notebooks
scripts		scripts
src/biomed_rag		src/biomed_rag
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
SECURITY.md		SECURITY.md
SPEC.md		SPEC.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biomedical RAG Agent

Why

Architecture

Eval metrics

Demo

Status

Setup

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Biomedical RAG Agent

Why

Architecture

Eval metrics

Demo

Status

Setup

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages