CaseCrawler

Open-source synthetic healthcare data generation for clinical AI training and evaluation.

CaseCrawler generates validated synthetic healthcare datasets for AI training and evaluation.

It combines grounded medical knowledge retrieval, structured clinical data generation, messy clinical text synthesis, labs, vitals, time-series scaffolding, optional medical imaging hooks, validation, and fine-tuning exports.

The goal is not to simulate a classroom case. The goal is to produce multimodal synthetic records that are ready to inspect, validate, and export as JSONL, FHIR NDJSON, parquet, or model-specific fine-tuning formats.

Why CaseCrawler Exists

Healthcare AI teams need training and evaluation data that is clinically rich, auditable, privacy-safe, and exportable without depending on real patient records. CaseCrawler is a dataset engine for that workflow:

Generate patient timelines with structured EHR facts, labs, vitals, medication history, allergies, orders, time series, clinical notes, radiology reports, and radiology image assets
Add messy clinical text variants for OCR, message-style, and noisy documentation tasks
Plug in local, hosted, Hugging Face, diffusers, Synthea, and external command backends without locking the project to one vendor
Validate records with schema, clinical consistency, privacy, utility, image/report alignment, benchmark, and release-readiness gates
Export fine-tuning-ready artifacts for SFT, chat, tool use, note-fact extraction, clinical observations, medication reconciliation, multimodal image/report tasks, time series, DPO/RL, FHIR NDJSON, and parquet

Documentation

Start with the docs hub:

Documentation index
Open source roadmap
Getting started
BYO-key onboarding — every API key, what it unlocks, where to put it
Costs and tokens — rough tokens-per-case so you can budget LLM runs
DPO / RL export quickstart
Example configs — drop-in starter config.yaml files
Architecture
Release packages
Validation and benchmarking
Reference data and model adapters
CLI and API guide
Testing
Contributing clinical content
Synthetic healthcare data landscape research
Implementation plan

Project Maturity

CaseCrawler is suitable for synthetic-data research, training-pipeline prototyping, benchmark construction, and release-package experimentation. It is not a medical device, and synthetic outputs require validation before downstream clinical or operational use.

See the open source roadmap for maturity levels, current strengths, and near-term priorities.

Quick Start

git clone https://github.com/txmed82/case-crawler.git
cd case-crawler
pip install -e ".[dev]"

# See available knowledge sources
casecrawler sources

# Ingest medical knowledge for grounding
casecrawler ingest "sepsis"

# Generate synthetic healthcare records without an LLM key
casecrawler generate-dataset "sepsis" --count 10

# Generate, benchmark, export, and verify a multimodal release package
casecrawler generate-release-package "mixed acute care cohort" \
  --count 25 \
  --max-validation-retries 2 \
  --output-dir release-package \
  --format multimodal_jsonl

casecrawler verify-split-package --require-multimodal-release release-package

# Search the grounded knowledge base
casecrawler search "sepsis lactate fluid resuscitation"

With Docker

cp .env.example .env
docker compose up

What It Generates

The synthetic dataset path produces SyntheticRecord objects with:

Structured patient demographics, encounters, diagnoses, medication history, allergy/intolerance safety facts, and clinical orders
Diagnoses and procedure/code slots
Labs with units, reference ranges, flags, and timestamps
Vitals with timestamps
Clean clinical notes and messy note variants
Time-series channels for longitudinal vitals, labs, ECG lead II, and pleth waveform-like data
Imaging asset metadata, optional image-generation backend hooks, and inline image payloads in multimodal exports when image files exist
Provenance metadata
Validation reports with schema, clinical consistency, privacy, utility, and modality-alignment scores

The primary workflow is generate-dataset, dataset quality reports, reference benchmarks, human review, and export profiles.

Data Sources

CaseCrawler works with zero API keys using free public sources. Paid keys unlock richer data.

Source	Key Required	What You Get
PubMed	None	Biomedical citations and abstracts
OpenFDA	None	Drug adverse events and labeling
DailyMed	None	Structured drug labels
RxNorm	None	Drug names and classes
medRxiv	None	Medical preprints
ClinicalTrials.gov	None	Trial protocols, eligibility, outcomes
Glass Health	`GLASS_API_KEY`	Curated clinical reasoning content
Anna's Archive	`ANNAS_ARCHIVE_API_KEY`	Full-text papers and medical textbooks
Firecrawl	`FIRECRAWL_API_KEY`	Web scraping for guidelines and unstructured content

Run casecrawler sources to see what is available with your current keys.

Synthetic Generation Pipeline

The dataset-first path starts with a no-key deterministic slice and is designed for pluggable model backends:

Topic + GenerationRequest
      |
[1. Structured Generator]  -> patient, encounter, labs, vitals
      |
[2. Text Generator]        -> clean and messy clinical notes
      |
[3. Validators]            -> schema, clinical rules, privacy, utility
      |
[4. DatasetStore]          -> SQLite synthetic_records
      |
[5. Exporters]             -> SFT/note-fact/chat/multimodal/RL/FHIR/parquet profiles

Optional backends are intentionally lazy:

casecrawler[hf] for Hugging Face helpers
casecrawler[imaging] for diffusers/image validation backends
Imaging model profiles include chest X-ray focused adapters, CheXGenBench Sana (raman07/CheXGenBench-Models-Sana-e20) for chest-radiograph experiments, and medisyn (hiesingerlab/MediSyn) for broader text-guided medical image synthesis
casecrawler imaging-models and /api/datasets/capabilities expose each image profile contract: diffusers command template, prompt inputs, generated ImagingAsset fields, and required image/report validation gates
File-backed ImagingAsset records include per-asset metadata for byte size, MIME type, SHA-256, raster dimensions when available, generation backend, and model profile or external command provenance
BiomedCLIPImageValidator scores generated image/report alignment when casecrawler[imaging] dependencies are installed
MedGemmaImageTextValidator can use gated MedGemma multimodal models through casecrawler[hf] plus imaging dependencies for report/image consistency checks
casecrawler[parquet] for parquet exports
Time-series model profiles include TimeDiff, RawMed, and MIRA (MIRA-Mode/MIRA) wrappers for external generation, forecasting, or validation commands
casecrawler timeseries-models and casecrawler datasets capabilities expose each external adapter contract: suggested command template, stdin JSON fields, expected TimeSeriesChannel[] stdout, and validation requirements
Existing OpenAI, Anthropic, OpenRouter, and Ollama providers remain available for model-backed generation
synthetic.clinical_text_backend: llm routes clinical document drafting through the configured LLM provider while the default deterministic backend remains no-key
synthetic.clinical_text_noise_profile controls deterministic messy-note variants: standard, message, ocr, or heavy
synthetic.clinical_text_backend: external wraps local or Hugging Face note generators as stdin/stdout commands and validates their returned ClinicalDocument[] records
Clinical text model profiles include MedGemma (google/medgemma-4b-it), Meditron (epfl-llm/meditron-7b), and a generic external note-generator contract; casecrawler clinical-text-models lists the adapter metadata
Registered Hugging Face references include BeTraC/Synth-DoPaCo (BeTraC/betrac-2026) for doctor-patient transcript to SOAP-note benchmarking

CLI Reference

The full CLI/API guide is in docs/api-and-cli.md. Common commands are shown below.

# Knowledge ingestion and search
casecrawler ingest "sepsis"
casecrawler ingest "pulmonary embolism" --sources pubmed,openfda
casecrawler search "elevated lactate septic shock"
casecrawler sources
casecrawler config

# Synthetic healthcare dataset generation
casecrawler generate-dataset "sepsis" --count 25
casecrawler generate-dataset "heart failure exacerbation" --count 100 --complexity complex
casecrawler generate-dataset "pulmonary embolism" \
  --count 50 \
  --modalities structured_ehr,clinical_text,labs,vitals,time_series,imaging \
  --age-min 45 --age-max 85 --sexes female,male
casecrawler generate-dataset "mixed acute care cohort" \
  --count 90 \
  --topic-mix "sepsis,pneumonia,heart failure exacerbation" \
  --modalities structured_ehr,clinical_text,labs,vitals,time_series
casecrawler datasets capabilities
casecrawler reference-datasets
casecrawler import-reference-dataset asclepius --dataset-id ds-asclepius-ref --limit 100
casecrawler import-reference-dataset betrac_2026 --dataset-id ds-betrac-ref --limit 100
casecrawler import-reference-dataset clinical_notes_to_fhir --dataset-id ds-fhir-ref --limit 100
casecrawler import-reference-dataset radiology_report_consistency --dataset-id ds-rad-ref --limit 100
casecrawler import-reference-dataset synthchex_75k --dataset-id ds-synthchex-ref --limit 100
casecrawler import-reference-dataset synthetic_chest_xray_pneumonia --dataset-id ds-cxr-pneumonia-ref --limit 100
casecrawler import-synthea ./synthea/output/fhir --dataset-id ds-synthea-ref
# The Synthea import accepts FHIR JSON bundles, FHIR NDJSON resource directories,
# and standard Synthea CSV directories such as output/csv with patients.csv.
casecrawler run-synthea \
  --synthea-executable ./synthea/run_synthea \
  --output-dir ./synthea/output/fhir \
  --dataset-id ds-synthea-ref \
  --population 100
casecrawler import-reference-dataset \
  --repo-id org/custom-synthetic-notes \
  --dataset-id ds-custom-ref \
  --note-field clinical_note \
  --question-field prompt \
  --answer-field completion \
  --split eval \
  --limit 100
casecrawler import-reference-dataset \
  --repo-id org/custom-image-caption-dataset \
  --dataset-id ds-custom-image-ref \
  --note-field caption \
  --image-field image \
  --image-label-field label \
  --image-label-map '{"0":"normal","1":"pneumonia"}' \
  --split train \
  --limit 100
casecrawler import-reference-dataset local-validation-notes \
  --path ./validation/local-notes.jsonl \
  --dataset-id ds-local-ref \
  --note-field clinical_note \
  --question-field prompt \
  --answer-field completion \
  --lab-values-field labs \
  --limit 100
casecrawler benchmark-dataset \
  --dataset-id <dataset_id> \
  --reference-dataset-id ds-asclepius-ref \
  --min-overall-score 0.8 \
  --min-metric-score 0.5
casecrawler datasets quality <dataset_id>
casecrawler export-dataset \
  --dataset-id <dataset_id> \
  --reference-dataset-id ds-asclepius-ref \
  --min-overall-score 0.8 \
  --min-metric-score 0.5 \
  --format sft_jsonl \
  --output train.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format note_fact_sft_jsonl --output note_facts.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format tool_call_jsonl --output tools.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format time_series_jsonl --output time_series.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format dpo_jsonl --output preference.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format rl_jsonl --output episodes.jsonl
casecrawler export-dataset --dataset-id <dataset_id> --format fhir_ndjson --output fhir.ndjson
casecrawler export-dataset --dataset-id <dataset_id> --format parquet --output records.parquet
casecrawler export-dataset-splits \
  --dataset-id <dataset_id> \
  --format clinical_observation_jsonl \
  --reference-dataset-id ds-asclepius-ref \
  --min-overall-score 0.8 \
  --min-metric-score 0.5 \
  --output-dir finetune-package
casecrawler generate-release-package "mixed acute care cohort" \
  --count 25 \
  --max-validation-retries 2 \
  --output-dir release-package \
  --format multimodal_jsonl \
  --seed casecrawler \
  --age-min 45 \
  --age-max 88 \
  --encounter-count 3
casecrawler verify-split-package --require-multimodal-release release-package

Datasets generated with --require-human-review are blocked from export until each record is approved through casecrawler reviews mark <record_id> --status approved or the matching REST review endpoint. Quality reports surface missing human approvals as human_review.missing blockers.

generate-release-package is the shortest offline smoke path for a fine-tuning ready multimodal package. It generates with the full multimodal acute-care recipe by default, writes file-backed synthetic radiology images, seeds bundled reference fixtures for the generated recipe, runs the benchmark gate, writes dataset/model cards plus quality and benchmark audit reports, copies file-backed radiology images into the package images/ directory with image artifact metadata and provenance, exports train/validation/test JSONL splits, and verifies strict multimodal release readiness. The default benchmark thresholds are a non-zero bundled-fixture smoke gate: --min-overall-score 0.1 and --min-metric-score 0.0. Raise them when benchmarking against larger imported reference datasets.

Benchmark reports compare generated cohorts to stored reference datasets and return explicit pass/fail gates plus failing metric names. They compare across demographics, note types, artifact density, declared-modality artifact coverage, extracted clinical fact targets, labs, vitals, medication history, time-series channels and backend provenance, imaging findings and backend provenance, and approval rates. Export commands and API downloads can require the same benchmark gate by passing a reference dataset id and thresholds, which prevents unbenchmarked or underperforming synthetic data from silently becoming fine-tuning input.

Registered Hugging Face reference datasets include synthetic clinical notes, doctor-patient dialogue to SOAP-note rows, clinical-note-to-FHIR rows, radiology consistency rows, de-identification and ICD-coding references such as Technetium-I, Synthea imports, and image-reference datasets such as SynthCheX-75K-v2 and synthetic chest X-ray pneumonia. Custom Hugging Face imports can map text fields, FHIR answer fields, PHI annotations, diagnosis-code fields, image fields, image-label fields, explicit lab/vital arrays, medication-history arrays, and time-series channel arrays into the local SyntheticRecord schema. The same field mapping works for local JSONL, NDJSON, JSON arrays, or {"rows": [...]} files via --path or the REST import path field, so private validation sets can stay local. Persisted image-reference imports also attach per-asset file metadata and Hugging Face license/use-policy provenance so benchmark images can be audited alongside generated images.

The current external landscape and model/dataset research notes are tracked in docs/research/2026-05-08-synthetic-healthcare-data-landscape.md.

REST API

Start the server with casecrawler serve or docker compose up.

Endpoint	Method	Description
`/api/ingest`	POST	Ingest content for a topic
`/api/ingest/{job_id}`	GET	Poll ingestion status
`/api/search?q=...`	GET	Search the knowledge base
`/api/sources`	GET	List available sources
`/api/datasets/capabilities`	GET	List modalities, export formats, strict release coverage requirements, validators, and model/profile adapters
`/api/datasets/generate`	POST	Generate synthetic healthcare records
`/api/datasets/release-package`	POST	Generate, benchmark, export, and return a multimodal release package zip
`/api/datasets/reference-catalog`	GET	List registered Hugging Face reference datasets
`/api/datasets/reference-import`	POST	Import registered reference datasets into local storage
`/api/datasets/synthea-import`	POST	Import Synthea FHIR JSON bundles, FHIR NDJSON directories, or CSV directories into local storage
`/api/datasets/{dataset_id}/benchmark`	GET	Compare a generated dataset to a reference dataset with configurable pass/fail thresholds
`/api/datasets/{dataset_id}/reference-fixtures`	POST	Seed bundled benchmark fixtures for a generated dataset recipe
`/api/datasets/{dataset_id}/benchmark-plan`	GET	Show recommended reference readiness for a generated dataset
`/api/datasets/{dataset_id}/quality`	GET	Summarize validation and fine-tuning export readiness
`/api/datasets/{dataset_id}/export`	GET	Stream fine-tuning/export records
`/api/datasets/{dataset_id}/export-splits`	GET	Download train/validation/test split package zip

Example one-call API release package:

curl -X POST http://localhost:8000/api/datasets/release-package \
  -H 'Content-Type: application/json' \
  -o release-package.zip \
  -d '{
    "topic": "mixed acute care cohort",
    "count": 25,
    "recipe": "full_multimodal_acute_care",
    "export_format": "multimodal_jsonl",
    "seed": "casecrawler"
  }'

Configuration

`.env`

# Optional LLM providers
ANTHROPIC_API_KEY=sk-ant-...
# OPENAI_API_KEY=sk-...
# OPENROUTER_API_KEY=sk-or-...

# Optional paid data sources
# GLASS_API_KEY=
# ANNAS_ARCHIVE_API_KEY=
# FIRECRAWL_API_KEY=

# Optional free-source rate-limit keys
# NCBI_API_KEY=
# OPENFDA_API_KEY=

`config.yaml`

ingestion:
  default_limit_per_source: 20
  sources:
    priority: [pubmed, glass, openfda, annas_archive, dailymed, rxnorm, medrxiv, clinicaltrials, firecrawl]
    disabled: []

chunking:
  default_chunk_size: 500
  overlap: 50

embedding:
  model: "all-MiniLM-L6-v2"

storage:
  chroma_persist_dir: "./data/chroma"

llm:
  provider: "anthropic"
  model: "claude-sonnet-4-6"

synthetic:
  clinical_text_backend: "deterministic" # or "llm" or "external"
  clinical_text_noise_profile: "standard" # standard, message, ocr, or heavy
  clinical_text_model_profile: null # e.g. medgemma_4b_it or meditron_7b
  clinical_text_command: null # e.g. ["hf-note-sample", "--model", "local-notes"]
  # GenerationRequest can override clinical_text_backend, llm_provider,
  # llm_model, ollama_base_url, clinical_text_noise_profile,
  # clinical_text_model_profile, and clinical_text_command for one dataset run.
  # External clinical text commands receive stdin JSON with record and must print
  # ClinicalDocument[] or {"documents": ClinicalDocument[]} to stdout.
  imaging_backend: "placeholder" # or "diffusers" or "external"
  imaging_model_profile: null # e.g. cxr_pneumonia_dreambooth
  diffusers_model_id: "stabilityai/stable-diffusion-2-1"
  imaging_command: null # e.g. ["hf-image-sample", "--model", "local-cxr"]
  # Imaging profiles declare prompt inputs, generated ImagingAsset output fields,
  # licensing/use policy, and validation gates for file integrity and alignment.
  # External imaging commands receive stdin JSON with output_dir, prompt, modality,
  # and body_region. They must print an ImagingAsset JSON object or
  # {"asset": ImagingAsset} to stdout.
  time_series_backend: "deterministic" # or "external"
  time_series_model_profile: null # e.g. timediff or rawmed
  time_series_command: null # e.g. ["timediff-sample", "--checkpoint", "local.pt"]
  # External commands receive stdin JSON with record, channels, and points.
  # They must print a JSON array of TimeSeriesChannel objects or
  # {"channels": [TimeSeriesChannel, ...]} to stdout.
  synthea_executable: null
  # GenerationRequest.cohort_constraints supports age_min, age_max, sexes,
  # sex_cycle, topic_mix, and base_time for deterministic cohort composition.
  # GenerationRequest can also override imaging_backend, imaging_model_profile,
  # and diffusers_model_id for a single dataset generation run.
  # It can also override time_series_backend, time_series_model_profile,
  # and time_series_command for external EHR time-series adapters.
  export_formats:
    - raw_jsonl
    - sft_jsonl
    - note_fact_sft_jsonl
    - chat_jsonl
    - tool_call_jsonl
    - multimodal_jsonl
    - time_series_jsonl
    - dpo_jsonl
    - rl_jsonl
    - fhir_ndjson
    - parquet

api:
  host: "0.0.0.0"
  port: 8000

Development

pip install -e ".[dev]"
pytest tests/ -v
ruff check src tests

See docs/testing.md for pull-request, optional-backend, UI, network, and release-package smoke test tiers.

Project Structure

src/casecrawler/
  sources/       # Public and paid medical source adapters
  pipeline/      # Chunking, tagging, embedding, Chroma storage
  generation/    # Synthetic dataset generators and backend adapters
  validation/    # Synthetic record validation
  storage/       # SQLite stores
  export/        # Fine-tuning and benchmark-ready export profiles
  api/           # FastAPI routes

Name		Name	Last commit message	Last commit date
Latest commit History 465 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/casecrawler		src/casecrawler
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
.gitkeep		.gitkeep
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CaseCrawler

Why CaseCrawler Exists

Documentation

Project Maturity

Quick Start

With Docker

What It Generates

Data Sources

Synthetic Generation Pipeline

CLI Reference

REST API

Configuration

`.env`

`config.yaml`

Development

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CaseCrawler

Why CaseCrawler Exists

Documentation

Project Maturity

Quick Start

With Docker

What It Generates

Data Sources

Synthetic Generation Pipeline

CLI Reference

REST API

Configuration

.env

config.yaml

Development

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`.env`

`config.yaml`

Packages