Feat/pubmed knowledge base integration by AchiTsa · Pull Request #83 · arvindsis11/Ai-Healthcare-Chatbot

AchiTsa · 2026-05-04T12:17:34Z

Key Changes:

Data Acquisition

New Script: scripts/fetch_pubmed_data.py — An asynchronous tool to query NCBI Entrez API, fetch medical abstracts, and parse XML into structured YAML.
Updated: scripts/ingest_data.py — Orchestrates the ingestion of new PubMed data into the vector database.

Enhanced RAG Pipeline

Intelligent Ingestion (backend/app/rag/data_ingestion.py): Added content-based routing and specialized chunking that preserves sentence boundaries for medical text.
Text Processing (backend/app/rag/text_processing.py): Improved YAML loading for nested structures and added medical term normalization.
Vector DB Compatibility (backend/app/repositories/vector_db.py): Implemented _sanitize_metadata to flatten complex PubMed metadata (e.g., author lists) for ChromaDB.

Documentation

New Guide: docs/KNOWLEDGE_BASE_INTEGRATION.md — Detailed instructions on how to fetch, update, and maintain the PubMed knowledge base.

Quality Assurance

Added comprehensive test suites:
- tests/test_pubmed_fetcher.py: API and XML parsing validation.
- tests/test_pubmed_ingestion.py: Ingestion pipeline verification.
- tests/test_pubmed_api.py: End-to-end integration tests.

Impact
The assistant can now retrieve and cite peer-reviewed medical research, shifting from a general-purpose chatbot to an evidence-based healthcare intelligence tool.

Closes #29

AchiTsa · 2026-05-04T12:18:46Z

@luca55466 could you please review before I remove the Draft.
Fork

luca55466 · 2026-05-17T14:10:07Z

Hey @AchiTsa,

Great work on this PR. Connecting the chatbot to PubMed and grounding responses with peer-reviewed abstracts is a meaningful upgrade — it shifts the whole thing from a rule-based Q&A tool to something that can actually cite evidence. The architecture is clean and fits naturally into the existing RAG pipeline.

I ran this locally end-to-end: the fetcher pulls 30 abstracts across all 6 search terms from NCBI without issues, and the ingestion pipeline lands them correctly in ChromaDB as 43 text_chunk entries alongside the existing 130 Q&A pairs. The stack comes up cleanly.

Here's my review:

Logic & Implementation

Async Fetcher: Splitting the fetcher into dedicated fetch_pubmed_ids, fetch_pubmed_abstracts, and parse_pubmed_xml functions is exactly the right call. Error handling per step means one bad article or a failed request won't silently kill the whole run.
Rate Limiting: The 1-second delay between NCBI requests is correct and shows you actually read the Entrez docs. Without it, the 3 req/sec limit would bite immediately on any meaningful run.
_sanitize_metadata: This is a neat fix for the ChromaDB compatibility issue. Flattening the author list to a comma-separated string is pragmatic and unblocks ingestion without overcomplicating things.
Path Fix in ingest_data.py: Good catch — the old Path(__file__).parent / "data" was pointing to scripts/data/ which doesn't exist. The fix to repo_root / "data" is the correct way to resolve this and one of those bugs that would've been annoying to diagnose cold.
rglob in text_processing.py: Small but necessary change. Without it the loader would never pick up anything inside data/pubmed/. Glad this was caught.
data_ingestion.py refactor: The old conversation processing was parsing raw YAML text manually using - - prefix matching, which was fragile. The new metadata-based routing is much cleaner — text_processing.py now handles splitting per item upstream, and data_ingestion.py just routes by type. Good simplification.
YAML Fixes: Quoting the colon-containing strings across the data/ files and fixing the indentation in headache.yml and fracture.yml are legitimate fixes, not just cleanup. These would've caused silent parse failures.

Suggestions

1. Hardcoded topic in the fetcher

Right now every fetched article gets "topic": "Medical Literature" regardless of which search term produced it. That means the RAG system can't differentiate a fever question from a fracture one at retrieval time — they all land in the same bucket. I confirmed this locally: all 43 PubMed chunks show up under Medical Literature in ChromaDB regardless of their origin. Since term is already in scope in main(), it's a small change to thread it through:

# In parse_pubmed_xml, add the term parameter:
def parse_pubmed_xml(xml_content: str, term: str) -> List[Dict[str, Any]]:

# Then in the metadata block, replace:
"topic": "Medical Literature"
# with:
"topic": term

# And update the call site in main():
articles = parse_pubmed_xml(xml_content, term)

This would make retrieval meaningfully more precise at basically zero cost.

2. Null-pointer risk in parse_pubmed_xml

The title and PMID extraction assumes those XML elements always exist, but .find() returns None when they don't, and chaining .text on that throws AttributeError. The broad except catches it so it won't crash, but articles get silently dropped with no useful log. Worth being explicit:

pmid_el = article_tag.find(".//PMID")
title_el = article_tag.find(".//ArticleTitle")
if pmid_el is None or title_el is None:
    logger.warning("Skipping article with missing PMID or title")
    continue
pmid = pmid_el.text
title = title_el.text

3. Version pins in requirements.txt

Loosening pandas, spacy, and pydantic-settings to >= without an upper bound means a future pip install could pull in a breaking major version without warning. Something like pandas>=2.2.3,<3.0.0 would be safer. Also worth adding a quick comment on the pydantic-settings change specifically — dropping from ==2.7.1 to >=2.5.0 looks like a compatibility fix but it's not obvious why from the diff.

Verdict: The feature does what it says and the core changes are solid. The one thing worth addressing before merging is the hardcoded topic metadata — real impact on retrieval quality, confirmed locally. Everything else is minor polish.

Happy to see this merged once that's tidied up!

AchiTsa and others added 4 commits March 23, 2026 18:28

Fix of fix/dockerignore-issue-59

42aecbc

Merge branch 'arvindsis11:master' into master

5da1401

Merge branch 'arvindsis11:master' into master

23a86b2

implementation of the feature of pubmed knowledge integration

0fec931

AchiTsa and others added 2 commits May 17, 2026 17:48

incoporated suggested changes from review

02c82a2

Merge branch 'master' into feat/pubmed-knowledge-base-integration

e5d6b91

AchiTsa marked this pull request as ready for review May 17, 2026 14:50

AchiTsa requested a review from arvindsis11 May 17, 2026 14:50

AchiTsa and others added 2 commits May 31, 2026 10:26

Merge branch 'master' into feat/pubmed-knowledge-base-integration

75d7436

resolved dependency issue

87ba4d5

AchiTsa marked this pull request as draft June 1, 2026 12:30

AchiTsa marked this pull request as ready for review June 1, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/pubmed knowledge base integration#83

Feat/pubmed knowledge base integration#83
AchiTsa wants to merge 8 commits into
arvindsis11:masterfrom
AchiTsa:feat/pubmed-knowledge-base-integration

AchiTsa commented May 4, 2026 •

edited

Loading

Uh oh!

AchiTsa commented May 4, 2026

Uh oh!

luca55466 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AchiTsa commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AchiTsa commented May 4, 2026

Uh oh!

luca55466 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AchiTsa commented May 4, 2026 •

edited

Loading