In this step, we will use the LinkML ecosystem—specifically schema-automator—to automatically infer and generate formal LinkML schemas from our raw, heterogenous source data (Synthea CSV files). This provides a declarative, machine-readable structure of the input data that is essential for AI-augmented mapping.
Create and activate a Python virtual environment, then install project dependencies from requirements.txt:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe data/ directory contains synthetic health records generated by Synthea (like patients.csv, observations.csv, etc.).
Instead of writing data definition models by hand, we leverage schemauto (from the schema-automator package).
generalize-tsv: This tool reads the CSV/TSV headers and data types to infer a robust base LinkML schema.annotate-schema: Optionally, the schema-automator provides annotators and generalizers that can enrich your auto-generated schema with metadata or standard ontology references if appropriately configured.
schemauto --help
schemauto generalize-tsvs --help
schemauto generalize-tsv -s , --schema-name Patient -o patients.yaml data/patients.csvWe have provided a bash script (generate_schemas.sh) to automate this pipeline across all the CSV files in your data/ folder.
-
Make the script executable (if on macOS/Linux):
chmod +x generate_schemas.sh
-
Execute the script:
./generate_schemas.sh
What the script does:
- It uses
schemauto generalize-tsvto parse thedata/patients.csvfile. - It infers an initial LinkML YAML schema (
raw_patients_schema.yaml) with the Patients class and its corresponding attributes. - It runs a Python script (
enrich_schema.py) to post-process the generated schema. This ensures each slot incorporates OMOP-CDM-like metadata (e.g.,description,imported_from,range,identifier, andrequired). - It outputs the final, enriched model to
source_schemas/patients_schema.yaml.
Once the script completes, open source_schemas/patients_schema.yaml in your text editor.
You will see a fully declarative dictionary representing the patient structure: the Patients class and enriched attributes tailored for seamless alignment with the OMOP-CDM schema. This foundational schema allows the "Model Alignment Agent" in subsequent steps to reliably reason over the source data structures.
To infer enriched LinkML schemas for every CSV under data/ (same generalize-tsv → enrich_schema.py pipeline as above), run from the repo root with schemauto on your PATH:
python scripts/profile_synthea_tables.pyOutputs appear as source_schemas/<csv_stem>_schema.yaml (plus raw_<stem>_schema.yaml intermediates).
Next step: use these profiles with the LLM map generator — see LLM_MAP_PIPELINE.md.