Skip to content

Latest commit

 

History

History
65 lines (44 loc) · 3.1 KB

File metadata and controls

65 lines (44 loc) · 3.1 KB

Step 1: Generating LinkML Models from Source Data

In this step, we will use the LinkML ecosystem—specifically schema-automator—to automatically infer and generate formal LinkML schemas from our raw, heterogenous source data (Synthea CSV files). This provides a declarative, machine-readable structure of the input data that is essential for AI-augmented mapping.

Prerequisites & Environment

Create and activate a Python virtual environment, then install project dependencies from requirements.txt:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Understanding the Process

The data/ directory contains synthetic health records generated by Synthea (like patients.csv, observations.csv, etc.).

Instead of writing data definition models by hand, we leverage schemauto (from the schema-automator package).

  • generalize-tsv: This tool reads the CSV/TSV headers and data types to infer a robust base LinkML schema.
  • annotate-schema: Optionally, the schema-automator provides annotators and generalizers that can enrich your auto-generated schema with metadata or standard ontology references if appropriately configured.
schemauto --help
schemauto generalize-tsvs --help
schemauto generalize-tsv -s , --schema-name Patient -o patients.yaml  data/patients.csv

Running the Schema Generation Script

We have provided a bash script (generate_schemas.sh) to automate this pipeline across all the CSV files in your data/ folder.

  1. Make the script executable (if on macOS/Linux):

    chmod +x generate_schemas.sh
  2. Execute the script:

    ./generate_schemas.sh

What the script does:

  • It uses schemauto generalize-tsv to parse the data/patients.csv file.
  • It infers an initial LinkML YAML schema (raw_patients_schema.yaml) with the Patients class and its corresponding attributes.
  • It runs a Python script (enrich_schema.py) to post-process the generated schema. This ensures each slot incorporates OMOP-CDM-like metadata (e.g., description, imported_from, range, identifier, and required).
  • It outputs the final, enriched model to source_schemas/patients_schema.yaml.

Reviewing the Output

Once the script completes, open source_schemas/patients_schema.yaml in your text editor.

You will see a fully declarative dictionary representing the patient structure: the Patients class and enriched attributes tailored for seamless alignment with the OMOP-CDM schema. This foundational schema allows the "Model Alignment Agent" in subsequent steps to reliably reason over the source data structures.

Profiling all Synthea tables in data/

To infer enriched LinkML schemas for every CSV under data/ (same generalize-tsvenrich_schema.py pipeline as above), run from the repo root with schemauto on your PATH:

python scripts/profile_synthea_tables.py

Outputs appear as source_schemas/<csv_stem>_schema.yaml (plus raw_<stem>_schema.yaml intermediates).

Next step: use these profiles with the LLM map generator — see LLM_MAP_PIPELINE.md.