Superset Data Pipeline

A modular data pipeline for extracting data from Superset API and loading it into PostgreSQL, with Airflow integration for scheduling and monitoring.

Quick Start

1. Install Dependencies

pip install -r requirements.txt

Required dependencies:

psycopg2-binary (>=2.9.0)
PyYAML (>=6.0)
requests (>=2.28.0)
python-dotenv (>=1.0.0)

2. Configuration

You can configure the pipeline using either:

Option A: Configuration File

Copy and customize one of the example config files:

# For JSON configuration
cp config.example.json config.json

# For YAML configuration  
cp config.example.yaml config.yaml

Option B: Environment Variables

Copy and customize the environment file:

cp .env.example .env

Then edit the .env file with your specific configuration values.

3. Database Setup

Run the database setup script to create the required tables:

psql -U your_postgres_user -d your_database -f setup_database.sql

Or manually create the PostgreSQL database and table:

CREATE DATABASE ucs_data;

\c ucs_data;

CREATE TABLE superset_data (
    id VARCHAR(255) PRIMARY KEY,
    value JSONB,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_superset_data_updated_at ON superset_data(updated_at);

4. Run the Pipeline

Option A: Direct Execution

# Using config file
python ucs_data_pipeline.py -c config.yaml

# Using environment variables only
python ucs_data_pipeline.py

# With verbose logging
python ucs_data_pipeline.py -c config.yaml -v

Option B: Using Airflow (Recommended for Production)

Set up Airflow environment:

./setup_airflow.sh

Run the pipeline using Airflow:

./run_pipeline.sh

Check the pipeline status:

./check_dag_status.sh

Configuration Options

Superset Settings

SUPERSET_BASE_URL: Superset API endpoint URL
SUPERSET_USERNAME: Username for user login authentication
SUPERSET_PASSWORD: Password for user login authentication
SUPERSET_CHART_ID: ID of the Superset chart to extract data from
SUPERSET_TIMEOUT: Request timeout in seconds (default: 30)
SUPERSET_RETRIES: Number of retry attempts (default: 3)
SUPERSET_BATCH_SIZE: Batch size for data extraction (default: 5000)

PostgreSQL Settings

PG_HOST: Database host (default: localhost)
PG_DB: Database name
PG_USER: Database user
PG_PASSWORD: Database password
PG_PORT: Database port (default: 5432)
PG_BATCH_SIZE: Batch size for data loading (default: 1000)

Logging Settings

LOG_LEVEL: Log level - DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO)
LOG_FORMAT: Log message format
LOG_FILE: Log file path (default: logs/ucs_pipeline.log)

Pipeline Settings

PIPELINE_PARALLEL: Enable parallel processing (default: false)
PIPELINE_MAX_WORKERS: Maximum worker threads for parallel processing (default: 4)

Airflow Integration

The project includes Airflow integration for scheduled pipeline execution:

DAG ID: ucs_data_pipeline
Schedule: Daily
Start Date: July 1, 2025

Airflow Management Scripts

setup_airflow.sh: Sets up the Airflow environment
run_pipeline.sh: Triggers the pipeline DAG
check_dag_status.sh: Checks the status of the pipeline DAG
reset_airflow.sh: Resets the Airflow database
restart_scheduler.sh: Restarts the Airflow scheduler

Project Structure

ucs_pipeline/
├── config/          # Configuration management
├── extractor/       # Data extraction from Superset
├── transformer/     # Data transformation logic
├── loader/          # Data loading to PostgreSQL
└── utils/           # Utilities (logging, error handling)

airflow/
├── dags/            # Airflow DAG definitions
└── logs/            # Airflow execution logs

logs/                # Pipeline execution logs

Execution Summary

After running the pipeline, an execution summary is displayed with:

Execution time
Records extracted
Records loaded
Any errors encountered

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Superset Data Pipeline

Quick Start

1. Install Dependencies

2. Configuration

Option A: Configuration File

Option B: Environment Variables

3. Database Setup

4. Run the Pipeline

Option A: Direct Execution

Option B: Using Airflow (Recommended for Production)

Configuration Options

Superset Settings

PostgreSQL Settings

Logging Settings

Pipeline Settings

Airflow Integration

Airflow Management Scripts

Project Structure

Execution Summary

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
airflow		airflow
dags		dags
ucs_pipeline		ucs_pipeline
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
airflow-scheduler.service		airflow-scheduler.service
airflow-webserver.service		airflow-webserver.service
check_dag_status.sh		check_dag_status.sh
config.example.json		config.example.json
config.example.yaml		config.example.yaml
requirements.txt		requirements.txt
reset_airflow.sh		reset_airflow.sh
restart_scheduler.sh		restart_scheduler.sh
run_pipeline.sh		run_pipeline.sh
setup_airflow.sh		setup_airflow.sh
ucs_data_pipeline.py		ucs_data_pipeline.py

d-tree-org/superset-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Superset Data Pipeline

Quick Start

1. Install Dependencies

2. Configuration

Option A: Configuration File

Option B: Environment Variables

3. Database Setup

4. Run the Pipeline

Option A: Direct Execution

Option B: Using Airflow (Recommended for Production)

Configuration Options

Superset Settings

PostgreSQL Settings

Logging Settings

Pipeline Settings

Airflow Integration

Airflow Management Scripts

Project Structure

Execution Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages