Step 1: Generating LinkML Models from Source Data

In this step, we will use the LinkML ecosystem—specifically schema-automator—to automatically infer and generate formal LinkML schemas from our raw, heterogenous source data (Synthea CSV files). This provides a declarative, machine-readable structure of the input data that is essential for AI-augmented mapping.

Prerequisites & Environment

Create and activate a Python virtual environment, then install project dependencies from requirements.txt:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Understanding the Process

The data/ directory contains synthetic health records generated by Synthea (like patients.csv, observations.csv, etc.).

Instead of writing data definition models by hand, we leverage schemauto (from the schema-automator package).

generalize-tsv: This tool reads the CSV/TSV headers and data types to infer a robust base LinkML schema.
annotate-schema: Optionally, the schema-automator provides annotators and generalizers that can enrich your auto-generated schema with metadata or standard ontology references if appropriately configured.

schemauto --help
schemauto generalize-tsvs --help
schemauto generalize-tsv -s , --schema-name Patient -o patients.yaml  data/patients.csv

Running the Schema Generation Script

We have provided a bash script (generate_schemas.sh) to automate this pipeline across all the CSV files in your data/ folder.

Make the script executable (if on macOS/Linux):
```
chmod +x generate_schemas.sh
```
Execute the script:
```
./generate_schemas.sh
```

What the script does:

It uses schemauto generalize-tsv to parse the data/patients.csv file.
It infers an initial LinkML YAML schema (raw_patients_schema.yaml) with the Patients class and its corresponding attributes.
It runs a Python script (enrich_schema.py) to post-process the generated schema. This ensures each slot incorporates OMOP-CDM-like metadata (e.g., description, imported_from, range, identifier, and required).
It outputs the final, enriched model to source_schemas/patients_schema.yaml.

Reviewing the Output

Once the script completes, open source_schemas/patients_schema.yaml in your text editor.

You will see a fully declarative dictionary representing the patient structure: the Patients class and enriched attributes tailored for seamless alignment with the OMOP-CDM schema. This foundational schema allows the "Model Alignment Agent" in subsequent steps to reliably reason over the source data structures.

Profiling all Synthea tables in `data/`

To infer enriched LinkML schemas for every CSV under data/ (same generalize-tsv → enrich_schema.py pipeline as above), run from the repo root with schemauto on your PATH:

python scripts/profile_synthea_tables.py

Outputs appear as source_schemas/<csv_stem>_schema.yaml (plus raw_<stem>_schema.yaml intermediates).

Next step: use these profiles with the LLM map generator — see LLM_MAP_PIPELINE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 1: Generating LinkML Models from Source Data

Prerequisites & Environment

Understanding the Process

Running the Schema Generation Script

Reviewing the Output

Profiling all Synthea tables in `data/`

FilesExpand file tree

PROFILE.md

Latest commit

History

PROFILE.md

File metadata and controls

Step 1: Generating LinkML Models from Source Data

Prerequisites & Environment

Understanding the Process

Running the Schema Generation Script

Reviewing the Output

Profiling all Synthea tables in data/

Profiling all Synthea tables in `data/`