Skip to content

brmhastra/local-data-platform

 
 

Repository files navigation

Note

Hi Followers, Thank you for taking the time to read me. Let me help you understand the scope and progress with better ease below:

  1. Milestones
  2. README.md (This document)
  3. Enhancements
  4. Documentation (Coming Soon)

Updated At Thu 3 Oct 2024

Local Data Engineering Toolkit in Python

Plan

Milestone Epic Target Date Delivery Date Release Owner Comment
0.1.0 HelloWorld 1st Oct 24 1st Oct 24 @tusharchou Good Start
0.1.1 Ingestion 3rd Oct 24 9th Oct 24 @tusharchou First Sprint
0.1.2 Warehousing 18th Oct 24 TBD @tusharchou Coming Soon
1.0.0 Ready for Production 1st Nov 24 TBD TBD End Game

Milestone

Local Data Platform

Business information systems require fresh data every day organised in a manner that retrival is cost effective. Making a local data platform requires a setup where you can recreate production usecases and develop new pipelines.

Problem Statement

Question Answer
What? a local data platform that can scale up to cloud
Why? save costs on cloud infra and developement time
When? start of product development life cycle
Where? local first
Who? Business who want a product data platform that will run locally and scale up when the time comes.

A python library that uses open source tools to orchestrate a data platform operations locally for development and testing

Components

  1. Orchestrator
    • cron
    • Airflow
  2. Source
    • APIs
    • Files
  3. Target
    • Iceberg
    • DuckDB
    • Space and Time
  4. Catalog
    • Rest

Source

Parquet

Data can be available as single file in the source format. For example New York Yellow taxi data is available to be pulled from here

curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet

local-data-platform/

Target

  1. CSV
  2. Google Sheet
  3. Iceberg

References

iceberg-python near-data-lake duckdb

Self Promotion

Reliable Change Data Capture using Iceberg Introduction to pyiceberg

About

omop for local data platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%