Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
d4ee629
Add staging models for all PLUTO inputs
alexrichey Feb 12, 2026
e66d864
Fix missed source table references (n58.1.9)
alexrichey Feb 13, 2026
2e8ee82
Integrate dbt into build pipeline (n58.1.10)
alexrichey Feb 13, 2026
dfc11f9
Migrate CSV lookups to dbt seeds (n58.1.11)
alexrichey Feb 13, 2026
8309006
Replace 01_load_local_csvs.sh with dbt seed workflow
alexrichey Feb 19, 2026
9220ba2
Fix dbt profiles and schema config issues
alexrichey Feb 19, 2026
759d0d1
Fix seed column type errors for BBL fields
alexrichey Feb 19, 2026
48046a4
Fix column name case in pluto_input_condolot_descriptiveattributes
alexrichey Feb 19, 2026
0a6df0b
change pluto_input_research to ref
alexrichey Feb 19, 2026
f4eff29
model fixups
alexrichey Feb 19, 2026
d031eb2
Fix seed schema and staging model references
alexrichey Feb 24, 2026
6921766
Remove unnecessary schema override from PLUTO seeds
alexrichey Feb 24, 2026
8fdc51f
Fix case-sensitive column names for pluto_input_condolot_descriptivea…
alexrichey Feb 24, 2026
49633da
Fix case-sensitive column names for pluto_input_condo_bldgclass
alexrichey Feb 24, 2026
328ef62
use recipe cache
alexrichey Feb 24, 2026
d055d0d
Fix pluto_input_bsmtcode column types
alexrichey Feb 24, 2026
2977a42
remove duped recipe sources
alexrichey Mar 23, 2026
7d29cbb
fix staging table references
alexrichey Mar 23, 2026
3442817
downcase the zoning_maxfar seed
alexrichey Mar 23, 2026
1824901
remaining issues
alexrichey Mar 24, 2026
003fd97
fix corrections mismappings
alexrichey Mar 24, 2026
d064af1
Add comparison to night_qa model
alexrichey Mar 27, 2026
26392f1
fluff
alexrichey Mar 27, 2026
674bbcc
Run staging models first and update geocodes references
alexrichey Mar 28, 2026
9c88b92
Materialize stg__pluto_input_geocodes as table for mutation
alexrichey Mar 28, 2026
bf5f256
DBT'ize all the models
alexrichey Mar 27, 2026
a0ee93e
Wave 2: Convert create_allocated.sql to dbt model
alexrichey Mar 30, 2026
de67294
Wave 2: Absorb yearbuiltalt.sql into int__pluto_allocated
alexrichey Mar 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/pluto_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,10 @@ jobs:
- name: Dataloading
working-directory: products/pluto
run: python3 -m dcpy lifecycle builds load load --recipe-path ${{ inputs.recipe_file
}}.lock.yml
}}.lock.yml --cache-schema recipe_cache --cached-entity-type view

- name: Load Local Data
run: ./01_load_local_csvs.sh
- name: Load Local Data and DBT Staging
run: ./01a_dbt_staging.sh

- name: building ...
run: ./02_build.sh
Expand Down
1 change: 0 additions & 1 deletion bash/bin/dcp_load_recipe_dwim
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ fi
echo "📥 Loading recipe to database..."
python3 -m dcpy lifecycle builds load load \
--recipe-path "${RECIPE_FILE}.lock.yml" \
--clear-schema false \
--cache-schema "${RECIPE_SCHEMA}" \
--cached-entity-type view

Expand Down
7 changes: 4 additions & 3 deletions dcpy/lifecycle/builds/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,9 +252,9 @@ def _cli_wrapper_load(
help="Path of recipe lock file to use",
),
clear_pg_schema: bool = typer.Option(
True,
"--clear-schema",
"-x",
False, # Default is now False
"--clear-schema/--no-clear-schema", # Enable/Disable flags
"-x/-X", # Optional short versions
help="Clear the build schema?",
),
cache_schema: str = typer.Option(
Expand All @@ -270,6 +270,7 @@ def _cli_wrapper_load(
help="How to cache datasets: 'view' creates views (read-only), 'copy' creates table copies (modifiable)",
),
):
print(f"clearing schema? {clear_pg_schema}")
recipe_lock_path = recipe_lock_path or (
Path(plan.DEFAULT_RECIPE).parent / "recipe.lock.yml"
)
Expand Down
274 changes: 274 additions & 0 deletions products/pluto/DATA_FLOW_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
# PLUTO Data Flow Analysis

## Big Picture: How PLUTO Gets Built

### Phase 1: Setup (Preparation Tables)
```
01a_dbt_staging.sh
├── Creates staging tables from raw inputs
└── Sets up: pluto_rpad_geo, pluto_allocated, pluto_input_cama, etc.
```

### Phase 2: Main Build (02_build.sh)

#### Step 1: Create Empty PLUTO Table (line 21)
```sql
-- create.sql creates empty table with ~100 columns
CREATE TABLE pluto (...);
```

#### Step 2: INSERT Initial Rows (line 22)
```sql
-- bbl.sql - THIS IS WHERE ROWS ARE INSERTED
INSERT INTO pluto (bbl, borocode, borough, block, lot)
SELECT DISTINCT primebbl, ...
FROM pluto_rpad_geo;
```
**This creates one row per BBL** (the grain of PLUTO)

#### Step 3: First Major Population (line 25)
```sql
-- allocated.sql - Adds ~25 fields from pluto_allocated
UPDATE pluto SET bldgclass, ownername, lotarea, bldgarea, ...
FROM pluto_allocated WHERE pluto.bbl = pluto_allocated.bbl;
```

#### Step 4: Add Geocodes (line 28)
```sql
-- geocodes.sql - Adds ~20 fields from pluto_rpad_geo
UPDATE pluto SET cd, ct2010, schooldist, zipcode, address, ...
FROM pluto_rpad_geo WHERE pluto.bbl = pluto_rpad_geo.primebbl;
```

#### Step 5: Progressive Enhancement (lines 31-90)
Then 60+ SQL files run, each doing UPDATE statements to add/refine fields:
- CAMA fields (bsmttype, lottype, proxcode, easements)
- Zoning fields (splitzone, specialdistrict)
- Geocoding (dtmgeoms, latlong)
- Classifications (bldgclass)
- etc.

#### Step 6: DBT Integration (line 87-90)
```bash
dbt run --select tag:pluto_enrichment pluto_enriched
run_sql_file sql/apply_dbt_enrichments.sql
```
Creates pluto_enriched table (from dbt models), then copies ~28 fields back to pluto

### Phase 3: Corrections (03_corrections.sh)
Research-driven corrections to specific BBLs

---

## Answer to Your Questions

### 1️⃣ When are rows inserted?

**Line 22 of 02_build.sh: `bbl.sql`**

This is the ONLY place where rows are inserted into pluto:
```sql
INSERT INTO pluto (bbl, borocode, borough, block, lot)
SELECT DISTINCT primebbl, ... FROM pluto_rpad_geo;
```

Everything else is UPDATE statements that enrich these rows.

### 2️⃣ DBT Strategy - Two Approaches:

---

## Strategy A: "Big Bang" - Replace PLUTO with Pure DBT

### Concept
Stop doing UPDATE statements entirely. Build pluto as a single dbt model (or series of CTEs).

### Approach
```sql
-- models/product/pluto.sql
WITH base_bbls AS (
SELECT DISTINCT primebbl as bbl, ... FROM pluto_rpad_geo
),

allocated_data AS (
SELECT bbl, bldgclass, ownername, ... FROM pluto_allocated
),

geocode_data AS (
SELECT primebbl as bbl, cd, zipcode, ... FROM pluto_rpad_geo
),

-- Import existing dbt models
far_data AS (SELECT * FROM {{ ref('far') }}),
landuse_data AS (SELECT * FROM {{ ref('landuse') }}),
... all other dbt models ...

-- New dbt models for remaining SQL
cama_data AS (SELECT * FROM {{ ref('cama_bsmttype') }}),
zoning_data AS (SELECT * FROM {{ ref('zoning_splitzone') }}),
...

final AS (
SELECT
base_bbls.*,
allocated_data.* EXCLUDE (bbl),
geocode_data.* EXCLUDE (bbl),
far_data.* EXCLUDE (bbl),
cama_data.* EXCLUDE (bbl),
zoning_data.* EXCLUDE (bbl),
...
FROM base_bbls
LEFT JOIN allocated_data USING (bbl)
LEFT JOIN geocode_data USING (bbl)
LEFT JOIN far_data USING (bbl)
LEFT JOIN cama_data USING (bbl)
LEFT JOIN zoning_data USING (bbl)
...
)

SELECT * FROM final
```

### Pros
- ✅ Pure declarative dbt - no imperative UPDATEs
- ✅ Easy to test, debug, and reason about
- ✅ Can regenerate entire PLUTO from scratch
- ✅ Clear dependency graph in dbt

### Cons
- ❌ Big migration effort - all at once
- ❌ Need to rewrite complex UPDATE logic (e.g., spatial joins, geocoding)
- ❌ May be harder to preserve exact current behavior

---

## Strategy B: "Progressive Migration" (RECOMMENDED)

### Concept
Keep the current UPDATE pattern but gradually move logic to dbt models.

### Current State (What You've Done)
```bash
# Phase 1: SQL creates and populates base pluto
run_sql_file sql/create.sql
run_sql_file sql/bbl.sql # INSERT rows
run_sql_file sql/allocated.sql # UPDATE with allocated data
run_sql_file sql/geocodes.sql # UPDATE with geocode data
... more SQL UPDATEs ...

# Phase 2: DBT enrichment
dbt run --select tag:pluto_enrichment pluto_enriched

# Phase 3: Apply dbt results
run_sql_file sql/apply_dbt_enrichments.sql # Copies from pluto_enriched
```

### Next Steps for Progressive Migration

#### Wave 1: CAMA + Zoning + Classification (SAFE - No dependencies)
Convert these to dbt models:
- `cama_bsmttype.sql` → `models/intermediate/cama/cama_bsmttype.sql`
- `cama_lottype.sql` → `models/intermediate/cama/cama_lottype.sql`
- `cama_proxcode.sql` → `models/intermediate/cama/cama_proxcode.sql`
- `cama_easements.sql` → `models/intermediate/cama/cama_easements.sql`
- `zoning_splitzone.sql` → `models/intermediate/zoning/zoning_splitzone.sql`
- `zoning_specialdistrict.sql` → `models/intermediate/zoning/zoning_specialdistrict.sql`
- `bldgclass.sql` → `models/intermediate/simple/bldgclass.sql`

Then:
1. Add these models to pluto_enriched (add LEFT JOINs)
2. Add fields to apply_dbt_enrichments.sql SET clause
3. Remove run_sql_file calls from 02_build.sh
4. Delete original SQL files

#### Wave 2: Prime BBL + Year Built (Separate tables)
- `primebbl.sql` → Updates dof_pts_propmaster, pluto_rpad_geo
- `yearbuiltalt.sql` → Updates pluto_allocated

These are trickier because they update tables OTHER than pluto.
Options:
a) Leave as SQL (they're prep tables)
b) Move to dbt staging models that materialize these tables

#### Wave 3: Geocoding (Test DBT dependencies)
- `latlong.sql` → Currently runs BEFORE dbt
- `dtmgeoms.sql` → Geometry operations

Need to verify: Do any existing dbt models read centroid/lat/long from pluto?
If yes, either:
a) Keep these as SQL before dbt run
b) Create dbt source for these fields and manage dependencies

#### Wave 4: Remaining
- `plutomapid.sql`, `versions.sql`, `numericfields_geomfields.sql`

---

## Recommended DBT Strategy: Progressive Migration

### Why This Works Better

1. **Incremental risk** - Migrate 7 files in Wave 1, test, then proceed
2. **Preserves current pattern** - Keep the pluto table as mutable state
3. **Easy rollback** - Just revert apply_dbt_enrichments.sql changes
4. **Clear testing** - After each wave, dbt run + 02_build.sh should work

### The DBT Architecture Should Be:

```
models/
├── intermediate/
│ ├── cama/
│ │ ├── cama_bsmttype.sql (NEW - Wave 1)
│ │ ├── cama_lottype.sql (NEW - Wave 1)
│ │ ├── cama_proxcode.sql (NEW - Wave 1)
│ │ └── cama_easements.sql (NEW - Wave 1)
│ ├── zoning/
│ │ ├── zoning_splitzone.sql (NEW - Wave 1)
│ │ └── zoning_specialdistrict.sql (NEW - Wave 1)
│ └── simple/
│ ├── far.sql (DONE)
│ ├── landuse.sql (DONE)
│ ├── bldgclass.sql (NEW - Wave 1)
│ └── ...
└── product/
└── pluto_enriched.sql (Expand with Wave 1 models)
```

### Modified 02_build.sh After Wave 1:
```bash
# Lines 33-40: Remove these run_sql_file calls
# run_sql_file sql/cama_bsmttype.sql ❌ DELETE
# run_sql_file sql/cama_lottype.sql ❌ DELETE
# run_sql_file sql/cama_proxcode.sql ❌ DELETE
# run_sql_file sql/cama_easements.sql ❌ DELETE

# Line 67: Remove this
# run_sql_file sql/zoning_splitzone.sql ❌ DELETE

# Line 63: Remove this
# run_sql_file sql/zoning_specialdistrict.sql ❌ DELETE

# Line 71: Remove this
# run_sql_file sql/bldgclass.sql ❌ DELETE

# Line 87: DBT run now includes Wave 1 models
(cd .. && dbt run --select tag:pluto_enrichment pluto_enriched)

# Line 90: apply_dbt_enrichments.sql now includes Wave 1 fields
run_sql_file sql/apply_dbt_enrichments.sql
```

---

## Key Insight: The Current Pattern Is Actually Good!

The current hybrid approach (SQL UPDATEs + dbt enrichment + apply back) is working well because:

1. **Keeps complex spatial/imperative logic in SQL** where it's easier
2. **Moves declarative transformations to dbt** where they belong
3. **pluto_enriched acts as staging** - dbt builds it clean, SQL applies it
4. **Incremental migration path** - doesn't require rewriting everything

The goal isn't to eliminate the pluto table and UPDATEs entirely - it's to move the **business logic** to dbt while keeping the **table building mechanics** in SQL.

Loading
Loading