A practical DuckDB lakehouse that shows how to build a dev/prod workflow using the medallion architecture (Bronze → Silver → Gold). No clusters or services — just DuckDB, Parquet, and SQL.
- Dev/Prod separation – Work safely in a development catalog while keeping production intact.
- Multiple data access patterns – Views over Parquet files, read‑through views, and physical table copies.
- Medallion layers – Bronze (raw), Silver (cleaned), Gold (aggregated).
- Zero infra – Single‑file databases, no servers, no auth.
The repo includes real SQL scripts, Python utilities, and shell helpers to run common data‑engineering tasks on a single machine.
duckdb-medallion/
├── README.md
├── python/
│ ├── duckdb_init.py # Initialize production warehouse
│ ├── duckdb_parquet_reader.py # Utility examples (Parquet + DuckDB)
│ └── duckdb_disaster_clone.py # Copy a DuckDB file (prod → dev/backup)
├── sql/
│ ├── dev_init_bronze_catalog.sql # Clone PROD bronze → DEV (editable tables)
│ ├── duckdb_views_prod_to_dev.sql # Read-through views in DEV → PROD
│ └── duckdb_clone_db_to_parquet.sql # Views over Parquet lake
│ └── duckdb_clone_db_to_parquet.sql
└── scripts/
├── make_dirs.sh # Create lakehouse dir structure
└── init_env.sh # Export environment variables
Paths and names are examples; adjust to your system. Linux/macOS assumed.
- DuckDB CLI ≥ 0.10
- Python ≥ 3.10
- Python package:
duckdb
pip install duckdbgit clone https://github.com/yourusername/duckdb-medallion.git
cd duckdb-medallion
# Create lakehouse directories
bash scripts/make_dirs.shThis yields a structure like:
/media/ares/data/db/
├── duck/
│ ├── warehouse/ # Production catalog
│ ├── dev/ # Development catalog
│ ├── backups/ # Database backups
│ └── tmp/ # Temp files
└── lake/
├── bronze/ # Raw Parquet files / landing area
├── silver/ # Cleaned data
└── gold/ # Aggregated data
scripts/make_dirs.sh (reference):
#!/usr/bin/env bash
set -euo pipefail
ROOT=/media/ares/data/db
mkdir -p "$ROOT/duck/warehouse" "$ROOT/duck/dev" "$ROOT/duck/backups" "$ROOT/duck/tmp"
mkdir -p "$ROOT/lake/bronze" "$ROOT/lake/silver" "$ROOT/lake/gold"
echo "Created base lakehouse directories under $ROOT"bash scripts/init_env.sh
source ~/.bashrc # or: source ~/.zshrcscripts/init_env.sh (reference):
#!/usr/bin/env bash
cat >> "$HOME/.bashrc" <<'EOF'
export DUCK_WH_DB="/media/ares/data/db/duck/warehouse/sports_ml_warehouse.duckdb"
export DUCK_DEV_DB="/media/ares/data/db/duck/dev/sports_ml_warehouse.duckdb"
export LAKE_ROOT="/media/ares/data/db/lake"
EOF
echo "Environment vars appended to ~/.bashrc: DUCK_WH_DB, DUCK_DEV_DB, LAKE_ROOT"python python/duckdb_init.pyThis script creates the warehouse database, sets up baseline schemas (e.g., bronze), and applies performance settings (threads, memory limits).
duckdb "$DUCK_WH_DB" -c "
PRAGMA database_list;
SELECT schema_name FROM information_schema.schemata ORDER BY 1;
"Create views that read directly from Parquet files. New files are picked up automatically.
duckdb "$DUCK_WH_DB" -init sql/duckdb_views_from_parquet.sqlsql/duckdb_views_from_parquet.sql
PRAGMA threads=8;
CREATE SCHEMA IF NOT EXISTS bronze;
CREATE OR REPLACE VIEW bronze.ufc_fighters AS
SELECT * FROM read_parquet(
'/media/ares/data/db/lake/bronze/ufc/ufc_fighters/**/*.parquet',
hive_partitioning=true,
union_by_name=true
);
CREATE OR REPLACE VIEW bronze.ufc_fights AS
SELECT * FROM read_parquet(
'/media/ares/data/db/lake/bronze/ufc/ufc_fights/**/*.parquet',
hive_partitioning=true,
union_by_name=true
);Query immediately:
duckdb "$DUCK_WH_DB" -c "
SELECT COUNT(*) FROM bronze.ufc_fighters;
SELECT * FROM bronze.ufc_fights LIMIT 5;
"Create DEV views that point at PROD tables. Read‑only and zero storage in DEV.
duckdb "$DUCK_DEV_DB" -init sql/duckdb_views_prod_to_dev.sqlsql/duckdb_views_prod_to_dev.sql
PRAGMA threads=8;
ATTACH '/media/ares/data/db/duck/warehouse/sports_ml_warehouse.duckdb' AS wh (READ_ONLY);
CREATE SCHEMA IF NOT EXISTS bronze_views;
CREATE OR REPLACE VIEW bronze_views.ufc_fighters AS SELECT * FROM wh.bronze.ufc_fighters;
CREATE OR REPLACE VIEW bronze_views.ufc_fights AS SELECT * FROM wh.bronze.ufc_fights;
CREATE OR REPLACE VIEW bronze_views.one_athletes AS SELECT * FROM wh.bronze.one_athletes;
CREATE OR REPLACE VIEW bronze_views.one_fights AS SELECT * FROM wh.bronze.one_fights;Check what's available:
duckdb "$DUCK_DEV_DB" -c "
SELECT table_schema, table_name, 'VIEW' AS table_type
FROM information_schema.views
WHERE table_schema = 'bronze_views'
ORDER BY 1,2;
SELECT COUNT(*) FROM bronze_views.ufc_fighters;
"Clone PROD tables into DEV as materialized, editable copies. Great for experiment‑heavy work.
duckdb "$DUCK_DEV_DB" -init sql/dev_init_bronze_catalog.sqlsql/dev_init_bronze_catalog.sql
PRAGMA threads=8;
ATTACH '/media/ares/data/db/duck/warehouse/sports_ml_warehouse.duckdb' AS wh (READ_ONLY);
CREATE SCHEMA IF NOT EXISTS bronze;
-- Clone with basic deduplication
DROP TABLE IF EXISTS bronze.ufc_fighters;
CREATE TABLE bronze.ufc_fighters AS
SELECT DISTINCT * FROM wh.bronze.ufc_fighters;
CREATE UNIQUE INDEX IF NOT EXISTS pk_ufc_fighters ON bronze.ufc_fighters(fighter_id);
DROP TABLE IF EXISTS bronze.ufc_fights;
CREATE TABLE bronze.ufc_fights AS
SELECT DISTINCT * FROM wh.bronze.ufc_fights;
CREATE UNIQUE INDEX IF NOT EXISTS pk_ufc_fights ON bronze.ufc_fights
(fighter_id, opponent, event, round, time, method, result);
DROP TABLE IF EXISTS bronze.one_athletes;
CREATE TABLE bronze.one_athletes AS
SELECT DISTINCT * FROM wh.bronze.one_athletes;
DROP TABLE IF EXISTS bronze.one_fights;
CREATE TABLE bronze.one_fights AS
SELECT DISTINCT * FROM wh.bronze.one_fights;
CHECKPOINT;Now you have editable tables:
duckdb "$DUCK_DEV_DB" -c "
SELECT table_schema, table_name, table_type
FROM information_schema.tables
WHERE table_schema='bronze'
ORDER BY 1,2;
-- Example edit (DEV only)
UPDATE bronze.ufc_fighters SET weight_class = 'Lightweight' WHERE fighter_id = 123;
"-- In DEV
ATTACH '/media/ares/data/db/duck/warehouse/sports_ml_warehouse.duckdb' AS wh (READ_ONLY);
-- Compare row counts
SELECT 'DEV' AS env, COUNT(*) FROM bronze.ufc_fights
UNION ALL
SELECT 'PROD', COUNT(*) FROM wh.bronze.ufc_fights;
-- Build a trial Silver table in DEV
CREATE SCHEMA IF NOT EXISTS silver;
CREATE OR REPLACE TABLE silver.fight_stats AS
SELECT
fight_id,
event_date,
fighter_a,
fighter_b,
winner,
CAST(round AS INTEGER) AS round_num
FROM bronze.ufc_fights
WHERE event_date >= '2020-01-01';# Re-run the clone script to sync DEV with current PROD
duckdb "$DUCK_DEV_DB" -init sql/dev_init_bronze_catalog.sql
# Verify
duckdb "$DUCK_DEV_DB" -c "
SELECT COUNT(*) AS fighters FROM bronze.ufc_fighters;
SELECT COUNT(*) AS fights FROM bronze.ufc_fights;
"duckdb -c "
SELECT *
FROM read_parquet('/media/ares/data/db/lake/bronze/ufc/ufc_fighters/*.parquet')
LIMIT 10;
"python/duckdb_parquet_reader.py
import duckdb
con = duckdb.connect('/media/ares/data/db/duck/warehouse/sports_ml_warehouse.duckdb')
fighters = con.execute(
"""
SELECT fighter_name, weight_class, wins, losses
FROM bronze.ufc_fighters
WHERE wins > 10
ORDER BY wins DESC
LIMIT 20
"""
).df()
print(fighters)
parquet_data = con.execute(
"""
SELECT * FROM read_parquet('/media/ares/data/db/lake/bronze/ufc/ufc_fighters/*.parquet')
"""
).df()Shell (simple copy)
# Copy prod → backup (DuckDB is a single file)
cp "$DUCK_WH_DB" "/media/ares/data/db/duck/backups/warehouse_$(date +%Y%m%d).duckdb"**Copy From DB to Parquet duckdb_clone_db_to_parquet.sql reference ** export SNAPSHOT_DIR="/media/user/data/db/lake/disaster_recovery/dr_ml_sports_wh_medallion/$(date +%Y%m%d_%H%M%S)"
envsubst < /media/user/data/tomassuarez/Documents/Gitrepos/ml_kuda_sports_lab/src/ml_kuda_sports_lab/dbs/duckdb_clone_prod_to_dr_parquet.sql
| duckdb -batch ":memory:"
python /home/ares/data/gitrepos/ml_kuda_sports_lab/src/ml_kuda_sports_lab/dbs/duckdb_disaster_clone.py
/mnt/data/db/duck/warehouse/sports_ml_warehouse.duckdb
/mnt/data/db/duck/warehouse/dr/dr_sports_ml_wh.duckdb
python /home/ares/data/gitrepos/ml_kuda_sports_lab/src/ml_kuda_sports_lab/dbs/duckdb_disaster_clone.py
/mnt/data/db/duck/warehouse/sports_ml_warehouse.duckdb
/mnt/data/db/duck/warehouse/dr/
python /home/ares/data/gitrepos/ml_kuda_sports_lab/src/ml_kuda_sports_lab/dbs/duckdb_disaster_clone.py /mnt/data/db/duck/warehouse/sports_ml_warehouse.duckdb /mnt/data/db/duck/warehouse/dr/dr_sports_ml_wh.duckdb --force
Python (prod → dev)
python python/duckdb_disaster_clone.py \
/media/ares/data/db/duck/warehouse/sports_ml_warehouse.duckdb \
/media/ares/data/db/duck/warehouse/dev_sports_ml_warehouse.duckdbpython/duckdb_disaster_clone.py (reference):
import shutil, sys
src = sys.argv[1]
dst = sys.argv[2]
shutil.copy2(src, dst)
print(f"Copied {src} → {dst}")Parquet files as ingested from source systems. Immutable, append‑only.
$LAKE_ROOT/bronze/
├── ufc/
│ ├── ufc_fighters/
│ │ └── 2025-10-27.parquet
│ └── ufc_fights/
│ └── 2025-10-27.parquet
└── one_championship/
├── one_athletes.parquet
└── one_fights.parquet
CREATE SCHEMA IF NOT EXISTS silver;
CREATE OR REPLACE TABLE silver.fight_stats AS
SELECT
fight_id,
event_date,
fighter_a_id,
fighter_b_id,
winner_id,
method,
CAST(round AS INTEGER) AS round_num
FROM bronze.ufc_fights
WHERE fight_id IS NOT NULL
AND method IS NOT NULL
AND event_date >= '2020-01-01';CREATE SCHEMA IF NOT EXISTS gold;
CREATE OR REPLACE TABLE gold.fighter_win_rates AS
SELECT
fighter_name,
COUNT(*) AS total_fights,
SUM(CASE WHEN result = 'win' THEN 1 ELSE 0 END) AS wins,
ROUND(100.0 * wins / total_fights, 2) AS win_rate_pct
FROM silver.fight_stats
GROUP BY fighter_name
HAVING total_fights >= 5
ORDER BY win_rate_pct DESC;-- List all schemas
SELECT schema_name FROM information_schema.schemata ORDER BY 1;
-- List all tables and views
SELECT table_schema, table_name, table_type
FROM information_schema.tables
ORDER BY 1,2;
-- Show table definition
.schema bronze.ufc_fighters
-- Check attached databases
PRAGMA database_list;SET memory_limit='4GB';
SET threads=8;
VACUUM;
ANALYZE;
CHECKPOINT;# CSV
duckdb "$DUCK_WH_DB" -c "
COPY (SELECT * FROM bronze.ufc_fighters)
TO '/tmp/fighters.csv' (HEADER, DELIMITER ',');
"
# Parquet
duckdb "$DUCK_WH_DB" -c "
COPY (SELECT * FROM gold.fighter_win_rates)
TO '/tmp/win_rates.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY);
""Conflicting lock is held"
# Close all DuckDB sessions, then check/remove locks (if safe)
ls -la /media/ares/data/db/duck/warehouse/*.lock
rm -f /media/ares/data/db/duck/warehouse/*.lockViews show 0 rows
PRAGMA database_list; -- verify ATTACH path
SELECT * FROM wh.bronze.ufc_fighters LIMIT 1;Parquet views empty
ls -lh /media/ares/data/db/lake/bronze/ufc/ufc_fighters/
duckdb -c "SELECT COUNT(*) FROM read_parquet('/media/.../ufc_fighters.parquet');"Local development
- Test with production‑sized datasets without cloud costs.
- Experiment freely in DEV without touching PROD.
- Instant feedback (DuckDB queries are fast on a single machine).
Small/medium production
- No infra to manage; scales to 100GB+ on one box.
- Simple backups (copy a file).
- Version control your SQL.
Learning data engineering
- See medallion architecture in practice.
- Understand dev/prod separation.
- Learn SQL optimization and modeling patterns.
DuckDB is an embedded analytical database (SQLite for analytics), columnar and vectorized, MIT‑licensed, great for OLAP/ETL and data science.
DuckDB isn’t an OLTP database, a distributed system, or a real‑time streaming engine.
- ML pipelines (feature engineering from Silver/Gold)
- dbt integration (declarative transforms + tests)
- Data quality checks (Great Expectations)
- Scheduling (cron/Airflow)
- Docker packaging
- Optional serverless endpoints (e.g., Azure Functions)
MIT — use however you like.
- Replace example paths with your own.
- Keep
DUCK_WH_DB,DUCK_DEV_DB, andLAKE_ROOTin sync with your directory layout. - If you’re on Windows, use WSL or adjust paths accordingly.