DataOps E-commerce Platform

Production-like, end-to-end Data Engineering project: ingest -> warehouse -> transform -> test -> orchestrate -> serve.

Portfolio focus: reproducibility, data quality, schema-as-code, and deployable components.

What this project demonstrates

Idempotent ingestion into a local analytical warehouse
Warehouse-first modeling with dbt (staging -> intermediate -> marts)
Automated data quality (pytest + dbt test + data reconciliation tests)
Orchestration-ready workflows (Prefect flows)
API serving layer for curated datasets (FastAPI)
Observability hooks (structured logs + basic metrics)

Architecture

CSV dataset -> Ingestion (Python) -> DuckDB (WAREHOUSE_PATH, default: data/warehouse/ecommerce.duckdb) -> dbt models -> FastAPI endpoints

Quality gates:

pytest (pipeline/unit checks)
dbt test (schema + business rules + reconciliation tests)

Orchestration (planned):

Prefect flows schedule ingestion + dbt

Tech stack

Layer	Technology	Why it is here
Language	Python 3.11+	Pipelines, orchestration, API
Warehouse	DuckDB	Portable analytics warehouse
Transformations	dbt	SQL modeling + tests + docs
Orchestration	Prefect	Scheduling + retries
API	FastAPI	Serve curated data products
Testing	pytest	Unit/integration checks
CI	GitHub Actions	Automated checks on PRs

Project status (roadmap)

Phase	Status	Deliverable
1 - Raw ingestion	Completed	Download + ingest + validation tests
2 - dbt transformations	Completed	`staging` + `intermediate` + `marts` + `facts` + `dbt tests` + `reconciliation`
3 - Orchestration	In progress	Prefect flows + schedules
4 - API serving	In progress	FastAPI endpoints over marts
5 - Observability	In progress	Logs + metrics + basic dashboards

Quick start

Prerequisites

Python 3.11+
Git
(Optional) DuckDB CLI

Setup

git clone https://github.com/Flames4fun/dataops-ecommerce-platform.git
cd dataops-ecommerce-platform

python -m venv .venv
# Linux/Mac
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1

pip install -r requirements.lock

# Optional: override warehouse path (default already works)
# Linux/Mac
export WAREHOUSE_PATH=data/warehouse/ecommerce.duckdb
# Windows (PowerShell)
$env:WAREHOUSE_PATH="data/warehouse/ecommerce.duckdb"

Run Phase 1 (download -> ingest -> validate)

# 1) Download dataset into data/raw/
python scripts/download_dataset.py

# 2) Ingest raw CSVs into DuckDB
python scripts/ingest_raw.py

# 3) Validate ingestion
pytest -q

Expected:

warehouse file created at WAREHOUSE_PATH (default: data/warehouse/ecommerce.duckdb)
9 tables under the raw schema
tests pass

Run Phase 2 (dbt)

# 1) Validate dbt configuration and warehouse connection
dbt debug --project-dir ./dbt --profiles-dir ./dbt

# 2) Run end-to-end dbt pipeline (models + tests)
dbt build --project-dir ./dbt --profiles-dir ./dbt

# 3) Run reconciliation tests only (counts/sums)
dbt test --project-dir ./dbt --profiles-dir ./dbt --select path:tests/test_reconcile_fact_orders_count.sql
dbt test --project-dir ./dbt --profiles-dir ./dbt --select path:tests/test_reconcile_fact_order_items_count.sql
dbt test --project-dir ./dbt --profiles-dir ./dbt --select path:tests/test_reconcile_fact_order_items_sums.sql

# 4) Generate and serve dbt documentation locally
dbt docs generate --project-dir ./dbt --profiles-dir ./dbt
dbt docs serve --project-dir ./dbt --profiles-dir ./dbt

Important:

dbt/target/ and dbt/logs/ are intentionally gitignored.
Generated docs artifacts (manifest.json, catalog.json, index.html, etc.) are local build outputs and are not committed.

Star schema (marts)

Model	Type	Grain
`dim_customers`	Dimension	1 row per `customer_unique_id`
`dim_products`	Dimension	1 row per `product_id`
`dim_sellers`	Dimension	1 row per `seller_id`
`dim_dates`	Dimension	1 row per `date_day`
`fact_orders`	Fact	1 row per `order_id`
`fact_order_items`	Fact	1 row per `order_item_key` (derived from `order_id` + `order_item_id`)

Query examples

DuckDB CLI

# Linux/Mac
duckdb "${WAREHOUSE_PATH:-data/warehouse/ecommerce.duckdb}"

# Windows (PowerShell)
duckdb $env:WAREHOUSE_PATH

SHOW TABLES FROM raw;

SELECT order_status, COUNT(*) AS n
FROM raw.orders
GROUP BY order_status
ORDER BY n DESC;

Python

from scripts.db_utils import get_connection

with get_connection() as conn:
    df = conn.execute(
        """
        SELECT order_status, COUNT(*) AS n
        FROM raw.orders
        GROUP BY order_status
        ORDER BY n DESC
        """
    ).fetchdf()

print(df)

Repository layout

dataops-ecommerce-platform/
|-- api/                    # FastAPI app (Phase 4)
|-- dbt/                    # dbt project (Phase 2)
|   |-- models/
|   |   |-- staging/        # Source cleanup and standardization models
|   |   |-- intermediate/   # Reusable business logic models
|   |   `-- marts/          # Final dimensional/fact models for analytics
|   |-- tests/              # Custom data reconciliation tests (SQL)
|   |-- macros/             # Reusable SQL macros
|   `-- target/             # Generated artifacts (gitignored)
|-- docs/                   # Documentation (data dictionary, decisions)
|-- pipelines/              # Prefect flows (Phase 3)
|-- scripts/                # Download + ingestion + helpers
|-- tests/                  # Python unit/integration tests
`-- data/                   # Local data (gitignored)
    |-- raw/                # Source CSVs (gitignored)
    `-- warehouse/          # DuckDB file: data/warehouse/ecommerce.duckdb (gitignored)

Documentation

Data directory: data/README.md (local layout + verification)
Data dictionary: docs/data_dictionary.md (raw schema field-level docs)
Business metrics: docs/business_metrics.md (GMV, AOV, cancel_rate, late_delivery_rate)
dbt docs (local): run dbt docs generate and dbt docs serve inside dbt/

Dataset and license

Dataset: Brazilian E-Commerce Public Dataset by Olist
Source: Kaggle (olistbr/brazilian-ecommerce)
License: CC BY-NC-SA 4.0 (Non-Commercial) Do not use this dataset for commercial purposes. See the Kaggle dataset page for details.

Contributing

This is a portfolio project. If you want to suggest improvements, open an issue.

Project license

MIT - see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
data		data
dbt		dbt
docs		docs
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
DISCLAIMER.md		DISCLAIMER.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
pyproject.toml		pyproject.toml
requirements.lock		requirements.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataOps E-commerce Platform

What this project demonstrates

Architecture

Tech stack

Project status (roadmap)

Quick start

Prerequisites

Setup

Run Phase 1 (download -> ingest -> validate)

Run Phase 2 (dbt)

Star schema (marts)

Query examples

DuckDB CLI

Python

Repository layout

Documentation

Dataset and license

Contributing

Project license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataOps E-commerce Platform

What this project demonstrates

Architecture

Tech stack

Project status (roadmap)

Quick start

Prerequisites

Setup

Run Phase 1 (download -> ingest -> validate)

Run Phase 2 (dbt)

Star schema (marts)

Query examples

DuckDB CLI

Python

Repository layout

Documentation

Dataset and license

Contributing

Project license

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages