Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
ENVIRONMENT=local

# DuckDB (MVP)
DUCKDB_PATH=./data/warehouse/ecommerce.duckdb
# Warehouse
WAREHOUSE_PATH=data/warehouse/ecommerce.duckdb
WAREHOUSE_TEST_PATH=data/warehouse/ecommerce_test.duckdb

# API
API_HOST=0.0.0.0
Expand Down
62 changes: 62 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
name: ci

on:
pull_request:
push:
branches:
- main

jobs:
validate:
runs-on: ubuntu-latest
env:
WAREHOUSE_PATH: data/warehouse/ecommerce.duckdb
DBT_PROFILES_DIR: .ci_dbt_profiles

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip
cache-dependency-path: requirements.lock

- name: Install dependencies
run: pip install -r requirements.lock

- name: Download raw dataset
run: python scripts/download_dataset.py

- name: Ingest raw data into DuckDB
run: python scripts/ingest_raw.py

- name: Run pytest
run: pytest -q

- name: Create dbt profile for CI
run: |
mkdir -p "${DBT_PROFILES_DIR}"
cat > "${DBT_PROFILES_DIR}/profiles.yml" <<'YAML'
dataops_ecommerce:
target: dev
outputs:
dev:
type: duckdb
path: "{{ env_var('WAREHOUSE_PATH', 'data/warehouse/ecommerce.duckdb') }}"
schema: main
threads: 4
extensions:
- parquet
- json
settings:
memory_limit: "4GB"
YAML

- name: Install dbt packages
run: dbt deps --project-dir ./dbt --profiles-dir "${DBT_PROFILES_DIR}"

- name: Run dbt build
run: dbt build --project-dir ./dbt --profiles-dir "${DBT_PROFILES_DIR}"
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,6 @@ dbt/.user.yml

# OS
.DS_Store

# Personal planning notes (not part of project deliverables)
docs/cv_phase2_plan_upgrade.md
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Production-like, end-to-end **Data Engineering** project: ingest -> warehouse ->

## Architecture

**CSV dataset** -> **Ingestion (Python)** -> **DuckDB (`data/warehouse/ecommerce.duckdb`)** -> **dbt models** -> **FastAPI endpoints**
**CSV dataset** -> **Ingestion (Python)** -> **DuckDB (`WAREHOUSE_PATH`, default: `data/warehouse/ecommerce.duckdb`)** -> **dbt models** -> **FastAPI endpoints**

Quality gates:
- `pytest` (pipeline/unit checks)
Expand Down Expand Up @@ -75,7 +75,13 @@ source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1

pip install -r requirements.txt
pip install -r requirements.lock

# Optional: override warehouse path (default already works)
# Linux/Mac
export WAREHOUSE_PATH=data/warehouse/ecommerce.duckdb
# Windows (PowerShell)
$env:WAREHOUSE_PATH="data/warehouse/ecommerce.duckdb"
```

### Run Phase 1 (download -> ingest -> validate)
Expand All @@ -92,7 +98,7 @@ pytest -q
```

Expected:
- `data/warehouse/ecommerce.duckdb` created
- warehouse file created at `WAREHOUSE_PATH` (default: `data/warehouse/ecommerce.duckdb`)
- 9 tables under the `raw` schema
- tests pass

Expand Down Expand Up @@ -137,7 +143,11 @@ Important:
### DuckDB CLI

```bash
duckdb data/warehouse/ecommerce.duckdb
# Linux/Mac
duckdb "${WAREHOUSE_PATH:-data/warehouse/ecommerce.duckdb}"

# Windows (PowerShell)
duckdb $env:WAREHOUSE_PATH
```

```sql
Expand Down Expand Up @@ -197,6 +207,7 @@ dataops-ecommerce-platform/

- **Data directory**: `data/README.md` (local layout + verification)
- **Data dictionary**: `docs/data_dictionary.md` (raw schema field-level docs)
- **Business metrics**: `docs/business_metrics.md` (GMV, AOV, cancel_rate, late_delivery_rate)
- **dbt docs (local)**: run `dbt docs generate` and `dbt docs serve` inside `dbt/`

---
Expand Down
4 changes: 2 additions & 2 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This directory documents the project's local data layout.
## Structure

- `data/raw/`: source CSV files downloaded from Kaggle (excluded from Git via `.gitignore`)
- `data/warehouse/ecommerce.duckdb`: local DuckDB analytical warehouse file (excluded from Git via `.gitignore`)
- `WAREHOUSE_PATH` (default: `data/warehouse/ecommerce.duckdb`): local DuckDB analytical warehouse file (excluded from Git via `.gitignore`)
- Raw files are not tracked because they are:
- Large (1.5M+ rows combined)
- Reproducible (can be re-downloaded)
Expand All @@ -20,7 +20,7 @@ This directory documents the project's local data layout.

## Local verification (recommended)

After ingestion, validate the date range directly from the warehouse (`data/warehouse/ecommerce.duckdb`):
After ingestion, validate the date range directly from the warehouse file configured by `WAREHOUSE_PATH` (default: `data/warehouse/ecommerce.duckdb`):

```sql
SELECT
Expand Down
95 changes: 95 additions & 0 deletions docs/business_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Business Metrics Definitions

This document defines the core business metrics used in the project for consistent KPI tracking.

## Scope

- Primary fact tables: `marts.fact_orders`, `marts.fact_order_items`
- Default time grain: `order_purchase_date` (daily)
- Currency: BRL

## Metrics

### GMV

- Definition: Gross Merchandise Value of sold items.
- Formula: `SUM(item_total)`
- Grain: Daily (`order_purchase_date`), aggregable to week/month.
- Filters:
- Include only orders with `is_canceled = false`.
- Recommended for "realized GMV": also filter `is_delivered = true`.

SQL reference:

```sql
select
fo.order_purchase_date,
sum(foi.item_total) as gmv
from marts.fact_order_items foi
join marts.fact_orders fo on fo.order_id = foi.order_id
where fo.is_canceled = false
group by 1;
```

### AOV

- Definition: Average Order Value.
- Formula: `GMV / COUNT(DISTINCT order_id)`
- Grain: Daily (`order_purchase_date`), aggregable to week/month.
- Filters:
- Same population as GMV (recommended: non-canceled orders).

SQL reference:

```sql
select
fo.order_purchase_date,
sum(foi.item_total) / nullif(count(distinct fo.order_id), 0) as aov
from marts.fact_order_items foi
join marts.fact_orders fo on fo.order_id = foi.order_id
where fo.is_canceled = false
group by 1;
```

### cancel_rate

- Definition: Share of canceled orders over total orders.
- Formula: `COUNT_IF(is_canceled) / COUNT(order_id)`
- Grain: Daily (`order_purchase_date`), aggregable to week/month.
- Filters:
- Include all orders in denominator.

SQL reference:

```sql
select
order_purchase_date,
avg(case when is_canceled then 1.0 else 0.0 end) as cancel_rate
from marts.fact_orders
group by 1;
```

### late_delivery_rate

- Definition: Share of delivered orders that arrived after estimated date.
- Formula: `COUNT_IF(is_late_delivery) / COUNT_IF(is_delivered)`
- Grain: Daily (`order_purchase_date`), aggregable to week/month.
- Filters:
- Denominator should include delivered orders only.
- Exclude canceled/non-delivered orders from denominator.

SQL reference:

```sql
select
order_purchase_date,
sum(case when is_late_delivery and is_delivered then 1 else 0 end) * 1.0
/ nullif(sum(case when is_delivered then 1 else 0 end), 0) as late_delivery_rate
from marts.fact_orders
group by 1;
```

## Notes

- Keep metric filters identical across dashboards and API endpoints.
- For monthly reporting, aggregate from daily grain rather than recalculating from mixed granularities.
2 changes: 1 addition & 1 deletion docs/data_dictionary.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data Dictionary — Raw Layer (`raw` schema)

Field-level documentation for the **raw ingestion layer** stored in `warehouse/ecommerce.duckdb` under the `raw` schema.
Field-level documentation for the **raw ingestion layer** stored in the DuckDB warehouse configured by `WAREHOUSE_PATH` (default: `data/warehouse/ecommerce.duckdb`) under the `raw` schema.

## Scope

Expand Down
20 changes: 20 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "dataops-ecommerce-platform"
version = "0.2.0"
description = "DataOps e-commerce ELT pipeline with DuckDB + dbt."
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"duckdb>=1.4,<2.0",
"requests>=2.32,<3.0",
"pytest>=9.0,<10.0",
"dbt-core>=1.8,<1.9",
"dbt-duckdb>=1.8,<1.9",
]

[tool.pytest.ini_options]
testpaths = ["tests"]
Binary file added requirements.lock
Binary file not shown.
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
duckdb==1.4.4
requests==2.32.5
pytest==9.0.2
dbt-core==1.8.7
dbt-duckdb==1.8.2
38 changes: 29 additions & 9 deletions scripts/db_utils.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,43 @@
from contextlib import contextmanager
import os
from pathlib import Path
from typing import Generator
from contextlib import contextmanager

import duckdb

DB_PATH = Path("warehouse") / "ecommerce.duckdb"
PROJECT_ROOT = Path(__file__).resolve().parent.parent
DEFAULT_WAREHOUSE_PATH = Path("data") / "warehouse" / "ecommerce.duckdb"


def resolve_warehouse_path() -> Path:
"""
Resolve DuckDB warehouse path from environment variables.

Priority:
1. WAREHOUSE_PATH
2. DUCKDB_PATH (legacy fallback)
3. data/warehouse/ecommerce.duckdb
"""
configured = os.getenv("WAREHOUSE_PATH") or os.getenv("DUCKDB_PATH")
path = Path(configured) if configured else DEFAULT_WAREHOUSE_PATH
if not path.is_absolute():
path = PROJECT_ROOT / path
return path


def init_warehouse() -> None:
"""Asegura que exista el directorio del warehouse."""
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
def init_warehouse(db_path: Path | None = None) -> Path:
"""Ensure warehouse directory exists and return resolved path."""
resolved_path = db_path or resolve_warehouse_path()
resolved_path.parent.mkdir(parents=True, exist_ok=True)
return resolved_path


@contextmanager
def get_connection() -> Generator[duckdb.DuckDBPyConnection, None, None]:
"""Context manager para DuckDB."""
init_warehouse() # asegura carpeta antes de conectar
conn = duckdb.connect(str(DB_PATH))
"""Context manager for DuckDB connections."""
db_path = init_warehouse()
conn = duckdb.connect(str(db_path))
try:
yield conn
finally:
conn.close()
conn.close()
Loading
Loading