Skip to content

Feat/pubmed knowledge base integration#83

Open
AchiTsa wants to merge 8 commits into
arvindsis11:masterfrom
AchiTsa:feat/pubmed-knowledge-base-integration
Open

Feat/pubmed knowledge base integration#83
AchiTsa wants to merge 8 commits into
arvindsis11:masterfrom
AchiTsa:feat/pubmed-knowledge-base-integration

Conversation

@AchiTsa
Copy link
Copy Markdown
Collaborator

@AchiTsa AchiTsa commented May 4, 2026

Key Changes:

Data Acquisition

  • New Script: scripts/fetch_pubmed_data.py — An asynchronous tool to query NCBI Entrez API, fetch medical abstracts, and parse XML into structured YAML.
  • Updated: scripts/ingest_data.py — Orchestrates the ingestion of new PubMed data into the vector database.

Enhanced RAG Pipeline

  • Intelligent Ingestion (backend/app/rag/data_ingestion.py): Added content-based routing and specialized chunking that preserves sentence boundaries for medical text.
  • Text Processing (backend/app/rag/text_processing.py): Improved YAML loading for nested structures and added medical term normalization.
  • Vector DB Compatibility (backend/app/repositories/vector_db.py): Implemented _sanitize_metadata to flatten complex PubMed metadata (e.g., author lists) for ChromaDB.

Documentation

  • New Guide: docs/KNOWLEDGE_BASE_INTEGRATION.md — Detailed instructions on how to fetch, update, and maintain the PubMed knowledge base.

Quality Assurance

  • Added comprehensive test suites:
    • tests/test_pubmed_fetcher.py: API and XML parsing validation.
    • tests/test_pubmed_ingestion.py: Ingestion pipeline verification.
    • tests/test_pubmed_api.py: End-to-end integration tests.

Impact
The assistant can now retrieve and cite peer-reviewed medical research, shifting from a general-purpose chatbot to an evidence-based healthcare intelligence tool.

Closes #29

@AchiTsa
Copy link
Copy Markdown
Collaborator Author

AchiTsa commented May 4, 2026

@luca55466 could you please review before I remove the Draft.
Fork

@luca55466
Copy link
Copy Markdown
Contributor

Hey @AchiTsa,

Great work on this PR. Connecting the chatbot to PubMed and grounding responses with peer-reviewed abstracts is a meaningful upgrade — it shifts the whole thing from a rule-based Q&A tool to something that can actually cite evidence. The architecture is clean and fits naturally into the existing RAG pipeline.

I ran this locally end-to-end: the fetcher pulls 30 abstracts across all 6 search terms from NCBI without issues, and the ingestion pipeline lands them correctly in ChromaDB as 43 text_chunk entries alongside the existing 130 Q&A pairs. The stack comes up cleanly.

Here's my review:

Logic & Implementation

  • Async Fetcher: Splitting the fetcher into dedicated fetch_pubmed_ids, fetch_pubmed_abstracts, and parse_pubmed_xml functions is exactly the right call. Error handling per step means one bad article or a failed request won't silently kill the whole run.
  • Rate Limiting: The 1-second delay between NCBI requests is correct and shows you actually read the Entrez docs. Without it, the 3 req/sec limit would bite immediately on any meaningful run.
  • _sanitize_metadata: This is a neat fix for the ChromaDB compatibility issue. Flattening the author list to a comma-separated string is pragmatic and unblocks ingestion without overcomplicating things.
  • Path Fix in ingest_data.py: Good catch — the old Path(__file__).parent / "data" was pointing to scripts/data/ which doesn't exist. The fix to repo_root / "data" is the correct way to resolve this and one of those bugs that would've been annoying to diagnose cold.
  • rglob in text_processing.py: Small but necessary change. Without it the loader would never pick up anything inside data/pubmed/. Glad this was caught.
  • data_ingestion.py refactor: The old conversation processing was parsing raw YAML text manually using - - prefix matching, which was fragile. The new metadata-based routing is much cleaner — text_processing.py now handles splitting per item upstream, and data_ingestion.py just routes by type. Good simplification.
  • YAML Fixes: Quoting the colon-containing strings across the data/ files and fixing the indentation in headache.yml and fracture.yml are legitimate fixes, not just cleanup. These would've caused silent parse failures.

Suggestions

1. Hardcoded topic in the fetcher

Right now every fetched article gets "topic": "Medical Literature" regardless of which search term produced it. That means the RAG system can't differentiate a fever question from a fracture one at retrieval time — they all land in the same bucket. I confirmed this locally: all 43 PubMed chunks show up under Medical Literature in ChromaDB regardless of their origin. Since term is already in scope in main(), it's a small change to thread it through:

# In parse_pubmed_xml, add the term parameter:
def parse_pubmed_xml(xml_content: str, term: str) -> List[Dict[str, Any]]:

# Then in the metadata block, replace:
"topic": "Medical Literature"
# with:
"topic": term

# And update the call site in main():
articles = parse_pubmed_xml(xml_content, term)

This would make retrieval meaningfully more precise at basically zero cost.

2. Null-pointer risk in parse_pubmed_xml

The title and PMID extraction assumes those XML elements always exist, but .find() returns None when they don't, and chaining .text on that throws AttributeError. The broad except catches it so it won't crash, but articles get silently dropped with no useful log. Worth being explicit:

pmid_el = article_tag.find(".//PMID")
title_el = article_tag.find(".//ArticleTitle")
if pmid_el is None or title_el is None:
    logger.warning("Skipping article with missing PMID or title")
    continue
pmid = pmid_el.text
title = title_el.text

3. Version pins in requirements.txt

Loosening pandas, spacy, and pydantic-settings to >= without an upper bound means a future pip install could pull in a breaking major version without warning. Something like pandas>=2.2.3,<3.0.0 would be safer. Also worth adding a quick comment on the pydantic-settings change specifically — dropping from ==2.7.1 to >=2.5.0 looks like a compatibility fix but it's not obvious why from the diff.

Verdict: The feature does what it says and the core changes are solid. The one thing worth addressing before merging is the hardcoded topic metadata — real impact on retrieval quality, confirmed locally. Everything else is minor polish.

Happy to see this merged once that's tidied up!

@AchiTsa AchiTsa marked this pull request as ready for review May 17, 2026 14:50
@AchiTsa AchiTsa requested a review from arvindsis11 May 17, 2026 14:50
@AchiTsa AchiTsa marked this pull request as draft June 1, 2026 12:30
@AchiTsa AchiTsa marked this pull request as ready for review June 1, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Integrate Medical Knowledge Base for Better AI Responses

2 participants