DBT is a data transformation library it is not a orchistraion tool. As such it doesnt handle failing models.
The dbt_basic DAG shows how all models are built inside one task.
To optimise this we split each model and test for each model into individual tasks in dbt_advanced DAG.
With the DAG dbt_selectors_standard_schedule we go one step further and split the model up so we can run different parts of the DBT model at different times and intervals.
The DBT DAGs in this repository are built ontop of this blog post on beginner and advanced implementation concepts at the intersection of dbt and Airflow.
To run these DAGs locally:
- Download the Astro CLI
- Download and run Docker
- Clone this repository and
cd
into it. - Run
astro dev start
to spin up a local Airflow environment and run the accompanying DAGs on your machine.
- once airflow is running, add a dbt profile.yml to /home/astro/.dbt/profiles.yml
- dbt's manifest.json needs to be built each time in airflow docker: /usr/local/airflow/data-cicd/target
- runs some conditional logic to clone dbt repo
- runs dbt without split each dbt model's build
- uses the manifest.json to build out depencies as individual Airflow DAG tasks. Giving greater visability on errors, and bringing retry logic to dbt.
This DAG receives all the DBT tasks based on DBT selectors docs this allows us to take dbt_advanced and break out the DBT model based on selectors. The benefit for this is if we need to run DAGs at different intervals and times.
CICD Tool
- load manifest
- generate_all_model_dependencies
- pickle
Airflow DAG
- loads pickle file based on selectors
- generate DBT tasks based on selectors
This lives in "/home/dave/data-engineering/data-cicd/.github/workflows/dependency_graph.py"
Currently pickle file written out to "/home/dave/data-engineering/data-cicd/dbt_dags/data/*"
This file is a utility script that is run via CircleCI in the deploy step. It is not run via Airflow in any way. The point of this script is to generate a pickle file that contains all of the dependencies between dbt models for each dag (usually corresponding to a different schedule) that we want to run.
- current depencies pickle live in include/data/ will need to move them into s3
example of using great expectations with local data
NOT working provides boiler plate for DBT with great expectations