Open-source synthetic healthcare data generation for clinical AI training and evaluation.
CaseCrawler generates validated synthetic healthcare datasets for AI training and evaluation.
It combines grounded medical knowledge retrieval, structured clinical data generation, messy clinical text synthesis, labs, vitals, time-series scaffolding, optional medical imaging hooks, validation, and fine-tuning exports.
The goal is not to simulate a classroom case. The goal is to produce multimodal synthetic records that are ready to inspect, validate, and export as JSONL, FHIR NDJSON, parquet, or model-specific fine-tuning formats.
Healthcare AI teams need training and evaluation data that is clinically rich, auditable, privacy-safe, and exportable without depending on real patient records. CaseCrawler is a dataset engine for that workflow:
- Generate patient timelines with structured EHR facts, labs, vitals, medication history, allergies, orders, time series, clinical notes, radiology reports, and radiology image assets
- Add messy clinical text variants for OCR, message-style, and noisy documentation tasks
- Plug in local, hosted, Hugging Face, diffusers, Synthea, and external command backends without locking the project to one vendor
- Validate records with schema, clinical consistency, privacy, utility, image/report alignment, benchmark, and release-readiness gates
- Export fine-tuning-ready artifacts for SFT, chat, tool use, note-fact extraction, clinical observations, medication reconciliation, multimodal image/report tasks, time series, DPO/RL, FHIR NDJSON, and parquet
Start with the docs hub:
- Documentation index
- Open source roadmap
- Getting started
- BYO-key onboarding — every API key, what it unlocks, where to put it
- Costs and tokens — rough tokens-per-case so you can budget LLM runs
- DPO / RL export quickstart
- Example configs — drop-in starter
config.yamlfiles - Architecture
- Release packages
- Validation and benchmarking
- Reference data and model adapters
- CLI and API guide
- Testing
- Contributing clinical content
- Synthetic healthcare data landscape research
- Implementation plan
CaseCrawler is suitable for synthetic-data research, training-pipeline prototyping, benchmark construction, and release-package experimentation. It is not a medical device, and synthetic outputs require validation before downstream clinical or operational use.
See the open source roadmap for maturity levels, current strengths, and near-term priorities.
git clone https://github.com/txmed82/case-crawler.git
cd case-crawler
pip install -e ".[dev]"
# See available knowledge sources
casecrawler sources
# Ingest medical knowledge for grounding
casecrawler ingest "sepsis"
# Generate synthetic healthcare records without an LLM key
casecrawler generate-dataset "sepsis" --count 10
# Generate, benchmark, export, and verify a multimodal release package
casecrawler generate-release-package "mixed acute care cohort" \
--count 25 \
--max-validation-retries 2 \
--output-dir release-package \
--format multimodal_jsonl
casecrawler verify-split-package --require-multimodal-release release-package
# Search the grounded knowledge base
casecrawler search "sepsis lactate fluid resuscitation"cp .env.example .env
docker compose upThe synthetic dataset path produces SyntheticRecord objects with:
- Structured patient demographics, encounters, diagnoses, medication history, allergy/intolerance safety facts, and clinical orders
- Diagnoses and procedure/code slots
- Labs with units, reference ranges, flags, and timestamps
- Vitals with timestamps
- Clean clinical notes and messy note variants
- Time-series channels for longitudinal vitals, labs, ECG lead II, and pleth waveform-like data
- Imaging asset metadata, optional image-generation backend hooks, and inline image payloads in multimodal exports when image files exist
- Provenance metadata
- Validation reports with schema, clinical consistency, privacy, utility, and modality-alignment scores
The primary workflow is generate-dataset, dataset quality reports, reference benchmarks, human review, and export profiles.
CaseCrawler works with zero API keys using free public sources. Paid keys unlock richer data.
| Source | Key Required | What You Get |
|---|---|---|
| PubMed | None | Biomedical citations and abstracts |
| OpenFDA | None | Drug adverse events and labeling |
| DailyMed | None | Structured drug labels |
| RxNorm | None | Drug names and classes |
| medRxiv | None | Medical preprints |
| ClinicalTrials.gov | None | Trial protocols, eligibility, outcomes |
| Glass Health | GLASS_API_KEY |
Curated clinical reasoning content |
| Anna's Archive | ANNAS_ARCHIVE_API_KEY |
Full-text papers and medical textbooks |
| Firecrawl | FIRECRAWL_API_KEY |
Web scraping for guidelines and unstructured content |
Run casecrawler sources to see what is available with your current keys.
The dataset-first path starts with a no-key deterministic slice and is designed for pluggable model backends:
Topic + GenerationRequest
|
[1. Structured Generator] -> patient, encounter, labs, vitals
|
[2. Text Generator] -> clean and messy clinical notes
|
[3. Validators] -> schema, clinical rules, privacy, utility
|
[4. DatasetStore] -> SQLite synthetic_records
|
[5. Exporters] -> SFT/note-fact/chat/multimodal/RL/FHIR/parquet profiles
Optional backends are intentionally lazy:
casecrawler[hf]for Hugging Face helperscasecrawler[imaging]for diffusers/image validation backends- Imaging model profiles include chest X-ray focused adapters, CheXGenBench Sana (
raman07/CheXGenBench-Models-Sana-e20) for chest-radiograph experiments, andmedisyn(hiesingerlab/MediSyn) for broader text-guided medical image synthesis casecrawler imaging-modelsand/api/datasets/capabilitiesexpose each image profile contract: diffusers command template, prompt inputs, generatedImagingAssetfields, and required image/report validation gates- File-backed
ImagingAssetrecords include per-asset metadata for byte size, MIME type, SHA-256, raster dimensions when available, generation backend, and model profile or external command provenance BiomedCLIPImageValidatorscores generated image/report alignment whencasecrawler[imaging]dependencies are installedMedGemmaImageTextValidatorcan use gated MedGemma multimodal models throughcasecrawler[hf]plus imaging dependencies for report/image consistency checkscasecrawler[parquet]for parquet exports- Time-series model profiles include TimeDiff, RawMed, and MIRA (
MIRA-Mode/MIRA) wrappers for external generation, forecasting, or validation commands casecrawler timeseries-modelsandcasecrawler datasets capabilitiesexpose each external adapter contract: suggested command template, stdin JSON fields, expectedTimeSeriesChannel[]stdout, and validation requirements- Existing OpenAI, Anthropic, OpenRouter, and Ollama providers remain available for model-backed generation
synthetic.clinical_text_backend: llmroutes clinical document drafting through the configured LLM provider while the default deterministic backend remains no-keysynthetic.clinical_text_noise_profilecontrols deterministic messy-note variants:standard,message,ocr, orheavysynthetic.clinical_text_backend: externalwraps local or Hugging Face note generators as stdin/stdout commands and validates their returnedClinicalDocument[]records- Clinical text model profiles include MedGemma (
google/medgemma-4b-it), Meditron (epfl-llm/meditron-7b), and a generic external note-generator contract;casecrawler clinical-text-modelslists the adapter metadata - Registered Hugging Face references include BeTraC/Synth-DoPaCo (
BeTraC/betrac-2026) for doctor-patient transcript to SOAP-note benchmarking
The full CLI/API guide is in docs/api-and-cli.md. Common commands are shown below.
# Knowledge ingestion and search
casecrawler ingest "sepsis"
casecrawler ingest "pulmonary embolism" --sources pubmed,openfda
casecrawler search "elevated lactate septic shock"
casecrawler sources
casecrawler config
# Synthetic healthcare dataset generation
casecrawler generate-dataset "sepsis" --count 25
casecrawler generate-dataset "heart failure exacerbation" --count 100 --complexity complex
casecrawler generate-dataset "pulmonary embolism" \
--count 50 \
--modalities structured_ehr,clinical_text,labs,vitals,time_series,imaging \
--age-min 45 --age-max 85 --sexes female,male
casecrawler generate-dataset "mixed acute care cohort" \
--count 90 \
--topic-mix "sepsis,pneumonia,heart failure exacerbation" \
--modalities structured_ehr,clinical_text,labs,vitals,time_series
casecrawler datasets capabilities
casecrawler reference-datasets
casecrawler import-reference-dataset asclepius --dataset-id ds-asclepius-ref --limit 100
casecrawler import-reference-dataset betrac_2026 --dataset-id ds-betrac-ref --limit 100
casecrawler import-reference-dataset clinical_notes_to_fhir --dataset-id ds-fhir-ref --limit 100
casecrawler import-reference-dataset radiology_report_consistency --dataset-id ds-rad-ref --limit 100
casecrawler import-reference-dataset synthchex_75k --dataset-id ds-synthchex-ref --limit 100
casecrawler import-reference-dataset synthetic_chest_xray_pneumonia --dataset-id ds-cxr-pneumonia-ref --limit 100
casecrawler import-synthea ./synthea/output/fhir --dataset-id ds-synthea-ref
# The Synthea import accepts FHIR JSON bundles, FHIR NDJSON resource directories,
# and standard Synthea CSV directories such as output/csv with patients.csv.
casecrawler run-synthea \
--synthea-executable ./synthea/run_synthea \
--output-dir ./synthea/output/fhir \
--dataset-id ds-synthea-ref \
--population 100
casecrawler import-reference-dataset \
--repo-id org/custom-synthetic-notes \
--dataset-id ds-custom-ref \
--note-field clinical_note \
--question-field prompt \
--answer-field completion \
--split eval \
--limit 100
casecrawler import-reference-dataset \
--repo-id org/custom-image-caption-dataset \
--dataset-id ds-custom-image-ref \
--note-field caption \
--image-field image \
--image-label-field label \
--image-label-map '{"0":"normal","1":"pneumonia"}' \
--split train \
--limit 100
casecrawler import-reference-dataset local-validation-notes \
--path ./validation/local-notes.jsonl \
--dataset-id ds-local-ref \
--note-field clinical_note \
--question-field prompt \
--answer-field completion \
--lab-values-field labs \
--limit 100
casecrawler benchmark-dataset \
--dataset-id <dataset_id> \
--reference-dataset-id ds-asclepius-ref \
--min-overall-score 0.8 \
--min-metric-score 0.5
casecrawler datasets quality <dataset_id>
casecrawler export-dataset \
--dataset-id <dataset_id> \
--reference-dataset-id ds-asclepius-ref \
--min-overall-score 0.8 \
--min-metric-score 0.5 \
--format sft_jsonl \
--output train.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format note_fact_sft_jsonl --output note_facts.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format tool_call_jsonl --output tools.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format time_series_jsonl --output time_series.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format dpo_jsonl --output preference.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format rl_jsonl --output episodes.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format fhir_ndjson --output fhir.ndjson
casecrawler export-dataset --dataset-id <dataset_id> --format parquet --output records.parquet
casecrawler export-dataset-splits \
--dataset-id <dataset_id> \
--format clinical_observation_jsonl \
--reference-dataset-id ds-asclepius-ref \
--min-overall-score 0.8 \
--min-metric-score 0.5 \
--output-dir finetune-package
casecrawler generate-release-package "mixed acute care cohort" \
--count 25 \
--max-validation-retries 2 \
--output-dir release-package \
--format multimodal_jsonl \
--seed casecrawler \
--age-min 45 \
--age-max 88 \
--encounter-count 3
casecrawler verify-split-package --require-multimodal-release release-packageDatasets generated with --require-human-review are blocked from export until
each record is approved through casecrawler reviews mark <record_id> --status approved or the matching REST review endpoint. Quality reports surface missing
human approvals as human_review.missing blockers.
generate-release-package is the shortest offline smoke path for a fine-tuning
ready multimodal package. It generates with the full multimodal acute-care
recipe by default, writes file-backed synthetic radiology images, seeds bundled
reference fixtures for the generated recipe, runs the benchmark gate, writes
dataset/model cards plus quality and benchmark audit reports, copies file-backed
radiology images into the package images/ directory with image artifact
metadata and provenance, exports train/validation/test JSONL splits, and
verifies strict multimodal release readiness. The default benchmark thresholds
are a non-zero bundled-fixture smoke gate: --min-overall-score 0.1 and
--min-metric-score 0.0. Raise them when benchmarking against larger imported
reference datasets.
Benchmark reports compare generated cohorts to stored reference datasets and return explicit pass/fail gates plus failing metric names. They compare across demographics, note types, artifact density, declared-modality artifact coverage, extracted clinical fact targets, labs, vitals, medication history, time-series channels and backend provenance, imaging findings and backend provenance, and approval rates. Export commands and API downloads can require the same benchmark gate by passing a reference dataset id and thresholds, which prevents unbenchmarked or underperforming synthetic data from silently becoming fine-tuning input.
Registered Hugging Face reference datasets include synthetic clinical notes,
doctor-patient dialogue to SOAP-note rows, clinical-note-to-FHIR rows,
radiology consistency rows, de-identification and ICD-coding references such as
Technetium-I, Synthea imports, and image-reference datasets such as
SynthCheX-75K-v2 and synthetic chest X-ray pneumonia. Custom
Hugging Face imports can map text fields, FHIR answer fields, PHI annotations,
diagnosis-code fields, image fields, image-label fields, explicit lab/vital
arrays, medication-history arrays, and time-series channel arrays into the local
SyntheticRecord schema. The same field mapping works for local JSONL, NDJSON,
JSON arrays, or {"rows": [...]} files via --path or the REST import path
field, so private validation sets can stay local. Persisted image-reference
imports also attach per-asset file metadata and Hugging Face license/use-policy
provenance so benchmark images can be audited alongside generated images.
The current external landscape and model/dataset research notes are tracked in
docs/research/2026-05-08-synthetic-healthcare-data-landscape.md.
Start the server with casecrawler serve or docker compose up.
| Endpoint | Method | Description |
|---|---|---|
/api/ingest |
POST | Ingest content for a topic |
/api/ingest/{job_id} |
GET | Poll ingestion status |
/api/search?q=... |
GET | Search the knowledge base |
/api/sources |
GET | List available sources |
/api/datasets/capabilities |
GET | List modalities, export formats, strict release coverage requirements, validators, and model/profile adapters |
/api/datasets/generate |
POST | Generate synthetic healthcare records |
/api/datasets/release-package |
POST | Generate, benchmark, export, and return a multimodal release package zip |
/api/datasets/reference-catalog |
GET | List registered Hugging Face reference datasets |
/api/datasets/reference-import |
POST | Import registered reference datasets into local storage |
/api/datasets/synthea-import |
POST | Import Synthea FHIR JSON bundles, FHIR NDJSON directories, or CSV directories into local storage |
/api/datasets/{dataset_id}/benchmark |
GET | Compare a generated dataset to a reference dataset with configurable pass/fail thresholds |
/api/datasets/{dataset_id}/reference-fixtures |
POST | Seed bundled benchmark fixtures for a generated dataset recipe |
/api/datasets/{dataset_id}/benchmark-plan |
GET | Show recommended reference readiness for a generated dataset |
/api/datasets/{dataset_id}/quality |
GET | Summarize validation and fine-tuning export readiness |
/api/datasets/{dataset_id}/export |
GET | Stream fine-tuning/export records |
/api/datasets/{dataset_id}/export-splits |
GET | Download train/validation/test split package zip |
Example one-call API release package:
curl -X POST http://localhost:8000/api/datasets/release-package \
-H 'Content-Type: application/json' \
-o release-package.zip \
-d '{
"topic": "mixed acute care cohort",
"count": 25,
"recipe": "full_multimodal_acute_care",
"export_format": "multimodal_jsonl",
"seed": "casecrawler"
}'# Optional LLM providers
ANTHROPIC_API_KEY=sk-ant-...
# OPENAI_API_KEY=sk-...
# OPENROUTER_API_KEY=sk-or-...
# Optional paid data sources
# GLASS_API_KEY=
# ANNAS_ARCHIVE_API_KEY=
# FIRECRAWL_API_KEY=
# Optional free-source rate-limit keys
# NCBI_API_KEY=
# OPENFDA_API_KEY=ingestion:
default_limit_per_source: 20
sources:
priority: [pubmed, glass, openfda, annas_archive, dailymed, rxnorm, medrxiv, clinicaltrials, firecrawl]
disabled: []
chunking:
default_chunk_size: 500
overlap: 50
embedding:
model: "all-MiniLM-L6-v2"
storage:
chroma_persist_dir: "./data/chroma"
llm:
provider: "anthropic"
model: "claude-sonnet-4-6"
synthetic:
clinical_text_backend: "deterministic" # or "llm" or "external"
clinical_text_noise_profile: "standard" # standard, message, ocr, or heavy
clinical_text_model_profile: null # e.g. medgemma_4b_it or meditron_7b
clinical_text_command: null # e.g. ["hf-note-sample", "--model", "local-notes"]
# GenerationRequest can override clinical_text_backend, llm_provider,
# llm_model, ollama_base_url, clinical_text_noise_profile,
# clinical_text_model_profile, and clinical_text_command for one dataset run.
# External clinical text commands receive stdin JSON with record and must print
# ClinicalDocument[] or {"documents": ClinicalDocument[]} to stdout.
imaging_backend: "placeholder" # or "diffusers" or "external"
imaging_model_profile: null # e.g. cxr_pneumonia_dreambooth
diffusers_model_id: "stabilityai/stable-diffusion-2-1"
imaging_command: null # e.g. ["hf-image-sample", "--model", "local-cxr"]
# Imaging profiles declare prompt inputs, generated ImagingAsset output fields,
# licensing/use policy, and validation gates for file integrity and alignment.
# External imaging commands receive stdin JSON with output_dir, prompt, modality,
# and body_region. They must print an ImagingAsset JSON object or
# {"asset": ImagingAsset} to stdout.
time_series_backend: "deterministic" # or "external"
time_series_model_profile: null # e.g. timediff or rawmed
time_series_command: null # e.g. ["timediff-sample", "--checkpoint", "local.pt"]
# External commands receive stdin JSON with record, channels, and points.
# They must print a JSON array of TimeSeriesChannel objects or
# {"channels": [TimeSeriesChannel, ...]} to stdout.
synthea_executable: null
# GenerationRequest.cohort_constraints supports age_min, age_max, sexes,
# sex_cycle, topic_mix, and base_time for deterministic cohort composition.
# GenerationRequest can also override imaging_backend, imaging_model_profile,
# and diffusers_model_id for a single dataset generation run.
# It can also override time_series_backend, time_series_model_profile,
# and time_series_command for external EHR time-series adapters.
export_formats:
- raw_jsonl
- sft_jsonl
- note_fact_sft_jsonl
- chat_jsonl
- tool_call_jsonl
- multimodal_jsonl
- time_series_jsonl
- dpo_jsonl
- rl_jsonl
- fhir_ndjson
- parquet
api:
host: "0.0.0.0"
port: 8000pip install -e ".[dev]"
pytest tests/ -v
ruff check src testsSee docs/testing.md for pull-request, optional-backend, UI, network, and release-package smoke test tiers.
src/casecrawler/
sources/ # Public and paid medical source adapters
pipeline/ # Chunking, tagging, embedding, Chroma storage
generation/ # Synthetic dataset generators and backend adapters
validation/ # Synthetic record validation
storage/ # SQLite stores
export/ # Fine-tuning and benchmark-ready export profiles
api/ # FastAPI routes