ETL Pipeline Project

Overview

This ETL (Extract, Transform, Load) pipeline project is designed to automate the process of gathering data from various sources, transforming it into a usable format, and loading it into a SQL database. The project utilizes Python for scripting and includes various data sources such as CSV files, JSON data from APIs, and Google Sheets.

Project Structure

The project is organized into the following directories and files:

etl_pipeline.py: The main script that orchestrates the ETL process.
config/db_config.json: Contains database configuration settings and API keys.
data/: Directory containing sample datasets for extraction:
- sample_data.csv: Example structured dataset.
- sample_weather.json: Weather data sourced from an API.
- google_sheet_sample.csv: Exported Google Sheets data.
scheduler.py: Script for scheduling the ETL pipeline execution.
requirements.txt: Lists the required Python packages for the project.
output/: Directory for processed output data.
- final_cleaned_data.csv: Contains the cleaned data after ETL processing.
load_to_db.py: Handles the loading of cleaned data into the SQL database.
.github/workflows/ci_cd.yml: Configuration for CI/CD using GitHub Actions.
report.pdf: Advanced technical documentation of the project.

Setup Instructions

Clone the Repository: Clone this repository to your local machine using:
```
git clone <repository-url>
```
Install Dependencies: Navigate to the project directory and install the required Python packages:
```
pip install -r requirements.txt
```
Configure Database Settings: Update the config/db_config.json file with your database connection details and API keys.
Run the ETL Pipeline: Execute the main ETL script:
```
python etl_pipeline.py
```
Schedule the Pipeline: Use scheduler.py to set up a schedule for the ETL process to run at specified intervals.

Usage Guidelines

Ensure that the data sources are accessible and correctly formatted.
Modify the transformation logic in etl_pipeline.py as needed to suit your data processing requirements.
Monitor the output directory for the cleaned data file after execution.

CI/CD Integration

This project includes a CI/CD pipeline defined in .github/workflows/ci_cd.yml, which automates testing and deployment processes. Ensure that your GitHub repository is set up to utilize GitHub Actions for continuous integration and delivery.

Conclusion

This ETL pipeline project serves as a robust framework for data processing and integration, leveraging various data sources and automation techniques to streamline workflows. For further details, refer to the report.pdf for technical documentation and design explanations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Pipeline Project

Overview

Project Structure

Setup Instructions

Usage Guidelines

CI/CD Integration

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
config		config
data		data
output		output
ETL_Pipeline_Project_Report.pdf		ETL_Pipeline_Project_Report.pdf
README.md		README.md
etl_pipeline.py		etl_pipeline.py
load_to_db.py		load_to_db.py
report.pdf		report.pdf
requirements.txt		requirements.txt
scheduler.py		scheduler.py

syedmuhammadabid/etl-pipline

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline Project

Overview

Project Structure

Setup Instructions

Usage Guidelines

CI/CD Integration

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages