This project is developed as part of Google Summer of Code 2025, mentored by Google DeepMind.
See my GSoC work record for documentation of my contributions during the program.
A benchmark for evaluating audience-adaptive medical explanation quality in large language models.
MedExplain-Evals provides infrastructure for:
- Evaluating how well LLMs generate medical explanations for different audiences (physicians, nurses, patients, caregivers)
- Measuring explanation quality across six dimensions (accuracy, terminology, completeness, actionability, safety, empathy)
- Using ensemble LLM-as-Judge with weighted scoring from multiple frontier models
- Grounding evaluations against medical knowledge bases (UMLS, RxNorm, SNOMED-CT)
- Generating publication-ready analysis reports and visualizations
- Python 3.10+
- API keys for at least one LLM provider (OpenAI, Anthropic, Google)
- 10 GB+ disk space (datasets and results)
Rule-of-thumb estimates for dense decoder-only models at moderate context length. Long context and batching can dominate VRAM via KV cache.
| Model Size | Minimum VRAM | Recommended | With Quantization |
|---|---|---|---|
| 7-9B | 8 GB | 16 GB | 5 GB |
| 13-14B | 16 GB | 24 GB | 8 GB |
| 70B+ | 40 GB | 80 GB+ | 20 GB |
git clone https://github.com/heilcheng/medexplain-evals.git
cd medexplain-evals
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt┌─────────────────────────────────────────────────────────────────────────────┐
│ MedExplain-Evals Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌──────────────────────────────┐│
│ │ Input │ │ LLM Under │ │ Generated Explanation ││
│ │ Clinical │───▶│ Test │───▶│ for Target Audience ││
│ │ Scenario │ │ (API/Local) │ │ ││
│ └─────────────┘ └─────────────────┘ └──────────────┬───────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Evaluation Engine ││
│ ├─────────────────────────────────────────────────────────────────────────┤│
│ │ ││
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────────────────┐││
│ │ │ Knowledge │ │ Ensemble │ │ Safety │││
│ │ │ Grounding │ │ LLM-as-Judge │ │ Evaluator │││
│ │ │ │ │ │ │ │││
│ │ │ • UMLS Lookup │ │ • GPT-5.2 │ │ • Drug interactions │││
│ │ │ • RxNorm Match │ │ • Claude 4.5 │ │ • Contraindications │││
│ │ │ • SNOMED-CT │ │ • Gemini 3 │ │ • Harm classification │││
│ │ │ • NLI Verify │ │ • DeepSeek-V3.2│ │ • Warning detection │││
│ │ └────────────────┘ └────────────────┘ └────────────────────────────┘││
│ │ ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Dimension Scoring (Weighted) ││
│ ├─────────────────────────────────────────────────────────────────────────┤│
│ │ Factual Accuracy (25%) │ Terminology (15%) │ Completeness (20%) ││
│ │ Actionability (15%) │ Safety (15%) │ Empathy & Tone (10%) ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
MedExplain-Evals grounds factual claims against established medical ontologies:
The UMLS integrates over 200 source vocabularies and provides a comprehensive framework for medical concept normalization.
| Resource | Description | Usage in MedExplain |
|---|---|---|
| Metathesaurus | Unified concepts from 200+ sources | Entity linking, concept validation |
| Semantic Network | Semantic types and relations | Relationship verification |
| SPECIALIST Lexicon | Biomedical language tools | Term normalization |
Citation:
@article{bodenreider2004umls,
author = {Bodenreider, Olivier},
title = {The Unified Medical Language System (UMLS): integrating biomedical terminology},
journal = {Nucleic Acids Research},
volume = {32},
number = {suppl\_1},
pages = {D267--D270},
year = {2004},
doi = {10.1093/nar/gkh061},
url = {https://www.nlm.nih.gov/research/umls/}
}RxNorm provides normalized names and identifiers for clinical drugs, enabling drug safety verification.
| Component | Description | Usage in MedExplain |
|---|---|---|
| RxCUI | Unique drug identifiers | Drug entity validation |
| Drug classes | Therapeutic categories | Drug-condition matching |
| Ingredient links | Active ingredient mapping | Interaction checking |
Citation:
@article{nelson2011rxnorm,
author = {Nelson, Stuart J and Zeng, Kelly and Kilbourne, John and Powell, Tammy and Moore, Robin},
title = {Normalized names for clinical drugs: RxNorm at 6 years},
journal = {Journal of the American Medical Informatics Association},
volume = {18},
number = {4},
pages = {441--448},
year = {2011},
doi = {10.1136/amiajnl-2011-000116},
url = {https://www.nlm.nih.gov/research/umls/rxnorm/}
}SNOMED CT is the most comprehensive clinical terminology, covering diseases, procedures, and clinical findings.
| Hierarchy | Description | Usage in MedExplain |
|---|---|---|
| Clinical finding | Disorders, symptoms | Diagnosis validation |
| Procedure | Medical interventions | Treatment verification |
| Body structure | Anatomical concepts | Anatomical accuracy |
| Pharmaceutical product | Medications | Drug reference checking |
Citation:
@article{donnelly2006snomed,
author = {Donnelly, Kevin},
title = {SNOMED-CT: The advanced terminology and coding system for eHealth},
journal = {Studies in Health Technology and Informatics},
volume = {121},
pages = {279--290},
year = {2006},
url = {https://www.snomed.org/}
}| Resource | Purpose | Citation |
|---|---|---|
| MedDRA | Adverse event terminology | ICH MedDRA |
| ICD-10/11 | Disease classification | WHO ICD |
| LOINC | Lab/clinical observations | Regenstrief LOINC |
| DrugBank | Drug interaction data | DrugBank 5.0 |
Important: MedExplain-Evals does not redistribute any licensed medical terminologies. Users must obtain their own access:
- UMLS: Requires a free UMLS Terminology Services (UTS) account and license acceptance
- SNOMED CT: Licensing requirements apply, especially for deployment outside IHTSDO member territories. See SNOMED International
- DrugBank: Academic use requires acceptance of terms; commercial usage requires a commercial license
The default pipeline can run with open alternatives if licensed resources are not provided.
Model naming convention: This table shows marketing names. For API calls, use provider-specific model IDs (e.g., claude-opus-4-5, gemini-3-pro-preview).
| Provider | Models | Multimodal | Notes |
|---|---|---|---|
| OpenAI | GPT-5.2, GPT-5.1, GPT-5, GPT-4o | ✓ | API ID: gpt-5.2 |
| Anthropic | Claude Opus 4.5, Sonnet 4.5, Haiku 4.5 | ✓ | API ID: claude-opus-4-5 |
| Gemini 3 Pro, Flash | ✓ | API ID: gemini-3-pro-preview |
|
| Meta | Llama 4 Maverick, Scout | ✓ | Behemoth: preview only |
| DeepSeek | DeepSeek-V3.2 | API ID: deepseek-v3.2 |
|
| Alibaba | Qwen3-Max, Qwen3 family | ✓ | |
| Amazon | Nova 2 Pro, Nova 2 Omni | ✓ |
| Dimension | Weight | Description |
|---|---|---|
| Factual Accuracy | 25% | Clinical correctness and evidence alignment |
| Terminological Appropriateness | 15% | Language complexity matching audience needs |
| Explanatory Completeness | 20% | Comprehensive yet accessible coverage |
| Actionability | 15% | Clear, practical guidance |
| Safety | 15% | Appropriate warnings and harm avoidance |
| Empathy & Tone | 10% | Audience-appropriate communication style |
| Audience | Variants | Health Literacy |
|---|---|---|
| Physicians | Specialist, Generalist | Expert |
| Nurses | ICU, General Ward, Specialty | Professional |
| Patients | Low, Medium, High literacy | Variable |
| Caregivers | Family, Professional, Pediatric | Variable |
# Set API keys
export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
# Validate environment
python scripts/validate_environment.py
# Run evaluation
python scripts/run_evaluation.py \
--config configs/evaluation_config.yaml \
--models gpt-5.2 claude-opus-4-5 \
--audiences patient_low_literacy physician_specialist \
--output results/models:
gpt-5.2:
provider: openai
temperature: 0.3
max_tokens: 2048
claude-opus-4-5:
provider: anthropic
temperature: 0.3
audiences:
- patient_low_literacy
- patient_medium_literacy
- physician_specialist
- nurse_general
evaluation:
dimensions:
factual_accuracy: 0.25
terminological_appropriateness: 0.15
explanatory_completeness: 0.20
actionability: 0.15
safety: 0.15
empathy_tone: 0.10
judge:
ensemble:
- model: gpt-5.2
weight: 0.30
- model: claude-opus-4-5
weight: 0.30
- model: gemini-3-pro-preview
weight: 0.25
- model: deepseek-v3.2
weight: 0.15
output:
path: ./results
formats: [json, csv, html]from src import UnifiedModelClient, EnsembleLLMJudge, PersonaFactory
# Initialize
client = UnifiedModelClient()
judge = EnsembleLLMJudge(client)
persona = PersonaFactory.get_predefined_persona("patient_low_literacy")
# Generate explanation
explanation = client.generate(
model="gpt-5.2",
messages=[{"role": "user", "content": "Explain type 2 diabetes simply"}]
)
# Evaluate
score = judge.evaluate(
original="Type 2 diabetes mellitus with HbA1c 8.5%...",
explanation=explanation.content,
audience=persona
)
print(f"Overall: {score.overall:.2f}/5.0")
print(f"Agreement: {score.agreement_score:.2f}")An interactive web interface is available for browser-based evaluation:
# Backend
cd web/backend
pip install -r requirements.txt
uvicorn app.main:app --port 8000
# Frontend
cd web/frontend
npm install && npm run devAccess at http://localhost:3000
Features:
- Dashboard: Overview of evaluation runs and statistics
- Playground: Interactive testing of medical explanations
- Models: Configure API and local model providers
- Audiences: Browse 11 medical audience personas
- Results: Visualize evaluation scores and rankings
results/
├── 20251229_143022/
│ ├── scores.json
│ ├── explanations/
│ │ ├── gpt-5.2/
│ │ └── claude-opus-4-5/
│ ├── analysis/
│ │ ├── dimension_breakdown.png
│ │ ├── audience_comparison.png
│ │ └── model_rankings.png
│ └── report.html
pytest tests/
pytest --cov=src tests/medexplain-evals/
├── src/ # Core Python library
│ ├── clients/ # LLM API clients
│ ├── evaluation/ # Scoring and judging
│ ├── personas/ # Audience modeling
│ ├── data/ # Data loading
│ ├── knowledge/ # UMLS/RxNorm grounding
│ ├── core/ # Shared utilities
│ └── rubrics/ # G-Eval scoring rubrics
├── scripts/ # CLI tools
├── web/ # Web platform (Next.js + FastAPI)
├── docs/ # Sphinx documentation
├── analysis/ # Visualization and reporting
├── configs/ # Configuration templates
├── data/ # Sample datasets
├── tests/ # Test suite
└── examples/ # Usage examples
Full documentation is available at: https://heilcheng.github.io/medexplain-evals/
If you use MedExplain-Evals in your research, please cite:
@software{medexplain2025,
author = {Cheng Hei Lam},
title = {MedExplain-Evals: A Benchmark for Audience-Adaptive Medical Explanation Quality in LLMs},
year = {2025},
url = {https://github.com/heilcheng/medexplain-evals}
}MIT License. See LICENSE for details.
MedExplain-Evals is designed for research evaluation purposes only. Generated explanations should not be used for actual medical advice. Always consult qualified healthcare professionals for medical decisions.