β
Milestone 1: Weeks 1β2 Environment Setup & Pipeline Design
Configured Airflow Docker environment (docker-compose.yaml, config/airflow.cfg), defined ETL architecture, set up extraction scripts in dags/.
β
Milestone 2: Weeks 3β4 Data Cleaning & Transformation
Implemented pandas-based cleaning rules in data_preprocessing.py, transformations for salesorder.csv β clean_salesorder.csv, pipeline testing via multiple DAG runs.
β
Milestone 3: Weeks 5β6 Orchestration & Monitoring
Added Airflow DAGs (my_etl_dag.py, monitoring_dag.py) with scheduling, logging (logs/), Gmail alerts (EmailOperator).
β
Milestone 4: Weeks 7β8 Dashboards & Deployment
Integrated dashboard framework (Streamlit/Dash ready), tested on production-scale datasets, finalized deployable framework.
This project provides a complete, Dockerized Apache Airflow environment for orchestrating ETL (Extract, Transform, Load) workflows on sales order data. It features a production-ready setup with persistent volumes for DAGs, logs, data, and Airflow metadata.
Key Features:
- Fully containerized Airflow stack (webserver, scheduler, worker)
- Custom ETL DAGs for sales order data processing
- Persistent PostgreSQL metadata database
- Sample data pipeline:
salesorder.csvβclean_salesorder.csv - Scheduled and manual DAG execution with historical logs
- Configurable via
docker-compose.yamlandairflow.cfg
airflow-docker/
βββ docker-compose.yaml # Docker Compose configuration
βββ config/
β βββ airflow.cfg # Airflow configuration
βββ dags/ # Airflow DAGs (Python files)
β βββ my_etl_dag.py # Main ETL DAG
β βββ data_preprocessing.py # Data preprocessing utilities
βββ data/ # Input/output data files
β βββ salesorder.csv # Raw sales order data (input)
β βββ clean_salesorder.csv # Processed sales order data (output)
βββ logs/ # Airflow logs (persistent)
β βββ dag_id=etl/ # Legacy ETL DAG logs
β βββ dag_id=etl_dag/ # Current ETL DAG logs
βββ plugins/ # Custom Airflow plugins (if needed)
βββ README.md # This file
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Local Data β β Airflow UI β β PostgreSQL DB β
β (salesorder.csv)βββββΊβ (localhost:8080) βββββΊβ (Metadata/Log) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
ββββββΌβββββ
β DAGs β extract_data βββΊ transform_data βββΊ load_data
β (dags/) β
ββββββββββββ
β
ββββββΌβββββ
β Logs β (persistent volume)
ββββββββββββ
- Docker & Docker Compose
- 4GB+ RAM (recommended for Airflow + Postgres)
- Windows 11 / Linux / macOS
-
Clone/Navigate to project:
cd c:/Users/ADMIN/airflow-docker -
Start the stack:
docker-compose up -d
-
Initialize Airflow database (first run only):
docker-compose exec airflow-worker airflow db init -
Create Airflow admin user:
docker-compose exec airflow-worker airflow users create \ --username admin \ --firstname Admin \ --lastname User \ --role Admin \ --email admin@example.com \ --password admin -
Access Airflow UI: Open http://localhost:8080 (admin/admin)
-
Enable & Trigger DAGs:
- Navigate to
etl_dagin UI - Toggle ON and trigger manually or wait for schedule
- Navigate to
- Schedule: Daily (
@daily) - Tasks:
wait_for_data: FileSensor waits forsalesorder.csvextract_data: Reads raw CSV (retries=3)transform_data: Cleans/normalizes data (retries=3, structured logging)load_data: Writes cleaned CSVsend_failure_email: Gmail alert on transform failure
Features Added:
- Retry logic with 5min delay
- Gmail notifications (SMTP configured)
- File existence check
- Structured logging
- Tags: ['etl', 'sales']
Data Flow:
salesorder.csv ββ[wait]ββ[extract]βββΊ [transform] βββΊ clean_salesorder.csv [load]
β (fail)
Email Alert
### Monitoring DAG (`monitoring_dag`)
- **Schedule:** Weekly (Monday 9AM)
- **Tasks:**
1. **`health_check`**: Calculates ETL success rate (last 7 days)
2. **`send_alert`**: Email if success rate <95%
**Alert Threshold:** >5% failure rate triggers Gmail notification
### Customization
Edit DAGs in `dags/` and restart scheduler:
```bash
docker-compose restart airflow-scheduler
| Service | Volume Path | Purpose |
|---|---|---|
postgres |
./postgres-data/ |
Metadata database |
airflow |
./logs/ |
Task/DAG logs |
dags |
./dags/ |
DAG Python files |
data |
./data/ |
Input/output datasets |
# View logs
docker-compose logs -f
# Airflow CLI (via worker container)
docker-compose exec airflow-worker airflow dags list
docker-compose exec airflow-worker airflow dags test etl_dag 2026-03-22
# Stop/Reset
docker-compose down -v # Removes volumes (data loss!)
docker-compose down # Keeps volumesdocker-compose exec airflow-worker airflow dags test etl_dag $(date -Iseconds --date='1 days ago')- Airflow Settings:
config/airflow.cfg - Executor: CeleryExecutor (scalable)
- Ports:
- Airflow UI:
8080 - Flower (Celery):
5555 - Postgres:
5432
- Airflow UI:
| Issue | Solution |
|---|---|
DAG not visible |
Check dags/ permissions, restart scheduler |
DB Init failed |
docker-compose down -v && docker-compose up -d |
Out of Memory |
Increase Docker RAM limit |
Tasks failing |
Check ./logs/dag_id=etl_dag/ |
CSV not found |
Verify data/salesorder.csv exists |
Log Locations:
./logs/dag_id=etl_dag/[run_id]/[task_id]/
-
Add new DAG:
dags/my_new_dag.py -
New data source:
data/new_source.csv -
Custom operators: Place in
plugins/
- Secrets: Use Airflow Variables/Connections
- Monitoring: Enable Flower (
docker-compose up flower) - Scaling: Increase
worker_countindocker-compose.yaml - Backup: Regularly backup
./postgres-data/and./logs/
See licence.txt (MIT License)
Production-ready 8-week ETL framework built for sales order workflows.