Production-like, end-to-end Data Engineering project: ingest -> warehouse -> transform -> test -> orchestrate -> serve.
Portfolio focus: reproducibility, data quality, schema-as-code, and deployable components.
- Idempotent ingestion into a local analytical warehouse
- Warehouse-first modeling with dbt (
staging -> intermediate -> marts) - Automated data quality (
pytest+dbt test+ data reconciliation tests) - Orchestration-ready workflows (Prefect flows)
- API serving layer for curated datasets (FastAPI)
- Observability hooks (structured logs + basic metrics)
CSV dataset -> Ingestion (Python) -> DuckDB (WAREHOUSE_PATH, default: data/warehouse/ecommerce.duckdb) -> dbt models -> FastAPI endpoints
Quality gates:
pytest(pipeline/unit checks)dbt test(schema + business rules + reconciliation tests)
Orchestration (planned):
- Prefect flows schedule ingestion + dbt
| Layer | Technology | Why it is here |
|---|---|---|
| Language | Python 3.11+ | Pipelines, orchestration, API |
| Warehouse | DuckDB | Portable analytics warehouse |
| Transformations | dbt | SQL modeling + tests + docs |
| Orchestration | Prefect | Scheduling + retries |
| API | FastAPI | Serve curated data products |
| Testing | pytest | Unit/integration checks |
| CI | GitHub Actions | Automated checks on PRs |
| Phase | Status | Deliverable |
|---|---|---|
| 1 - Raw ingestion | Completed | Download + ingest + validation tests |
| 2 - dbt transformations | Completed | staging + intermediate + marts + facts + dbt tests + reconciliation |
| 3 - Orchestration | In progress | Prefect flows + schedules |
| 4 - API serving | In progress | FastAPI endpoints over marts |
| 5 - Observability | In progress | Logs + metrics + basic dashboards |
- Python 3.11+
- Git
- (Optional) DuckDB CLI
git clone https://github.com/Flames4fun/dataops-ecommerce-platform.git
cd dataops-ecommerce-platform
python -m venv .venv
# Linux/Mac
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1
pip install -r requirements.lock
# Optional: override warehouse path (default already works)
# Linux/Mac
export WAREHOUSE_PATH=data/warehouse/ecommerce.duckdb
# Windows (PowerShell)
$env:WAREHOUSE_PATH="data/warehouse/ecommerce.duckdb"# 1) Download dataset into data/raw/
python scripts/download_dataset.py
# 2) Ingest raw CSVs into DuckDB
python scripts/ingest_raw.py
# 3) Validate ingestion
pytest -qExpected:
- warehouse file created at
WAREHOUSE_PATH(default:data/warehouse/ecommerce.duckdb) - 9 tables under the
rawschema - tests pass
# 1) Validate dbt configuration and warehouse connection
dbt debug --project-dir ./dbt --profiles-dir ./dbt
# 2) Run end-to-end dbt pipeline (models + tests)
dbt build --project-dir ./dbt --profiles-dir ./dbt
# 3) Run reconciliation tests only (counts/sums)
dbt test --project-dir ./dbt --profiles-dir ./dbt --select path:tests/test_reconcile_fact_orders_count.sql
dbt test --project-dir ./dbt --profiles-dir ./dbt --select path:tests/test_reconcile_fact_order_items_count.sql
dbt test --project-dir ./dbt --profiles-dir ./dbt --select path:tests/test_reconcile_fact_order_items_sums.sql
# 4) Generate and serve dbt documentation locally
dbt docs generate --project-dir ./dbt --profiles-dir ./dbt
dbt docs serve --project-dir ./dbt --profiles-dir ./dbtImportant:
dbt/target/anddbt/logs/are intentionally gitignored.- Generated docs artifacts (
manifest.json,catalog.json,index.html, etc.) are local build outputs and are not committed.
| Model | Type | Grain |
|---|---|---|
dim_customers |
Dimension | 1 row per customer_unique_id |
dim_products |
Dimension | 1 row per product_id |
dim_sellers |
Dimension | 1 row per seller_id |
dim_dates |
Dimension | 1 row per date_day |
fact_orders |
Fact | 1 row per order_id |
fact_order_items |
Fact | 1 row per order_item_key (derived from order_id + order_item_id) |
# Linux/Mac
duckdb "${WAREHOUSE_PATH:-data/warehouse/ecommerce.duckdb}"
# Windows (PowerShell)
duckdb $env:WAREHOUSE_PATHSHOW TABLES FROM raw;
SELECT order_status, COUNT(*) AS n
FROM raw.orders
GROUP BY order_status
ORDER BY n DESC;from scripts.db_utils import get_connection
with get_connection() as conn:
df = conn.execute(
"""
SELECT order_status, COUNT(*) AS n
FROM raw.orders
GROUP BY order_status
ORDER BY n DESC
"""
).fetchdf()
print(df)dataops-ecommerce-platform/
|-- api/ # FastAPI app (Phase 4)
|-- dbt/ # dbt project (Phase 2)
| |-- models/
| | |-- staging/ # Source cleanup and standardization models
| | |-- intermediate/ # Reusable business logic models
| | `-- marts/ # Final dimensional/fact models for analytics
| |-- tests/ # Custom data reconciliation tests (SQL)
| |-- macros/ # Reusable SQL macros
| `-- target/ # Generated artifacts (gitignored)
|-- docs/ # Documentation (data dictionary, decisions)
|-- pipelines/ # Prefect flows (Phase 3)
|-- scripts/ # Download + ingestion + helpers
|-- tests/ # Python unit/integration tests
`-- data/ # Local data (gitignored)
|-- raw/ # Source CSVs (gitignored)
`-- warehouse/ # DuckDB file: data/warehouse/ecommerce.duckdb (gitignored)
- Data directory:
data/README.md(local layout + verification) - Data dictionary:
docs/data_dictionary.md(raw schema field-level docs) - Business metrics:
docs/business_metrics.md(GMV, AOV, cancel_rate, late_delivery_rate) - dbt docs (local): run
dbt docs generateanddbt docs serveinsidedbt/
- Dataset: Brazilian E-Commerce Public Dataset by Olist
- Source: Kaggle (
olistbr/brazilian-ecommerce) - License: CC BY-NC-SA 4.0 (Non-Commercial) Do not use this dataset for commercial purposes. See the Kaggle dataset page for details.
This is a portfolio project. If you want to suggest improvements, open an issue.
MIT - see LICENSE.