Formula 1 Data Analysis Project

Overview

This project is designed to analyze Formula 1 data using various tools and technologies. The project includes data ingestion, processing, and analysis of races data from the ergast.com API. The main components of the project are:

SQL Script: To structure the clickhouse database.
dbt Models: To transform the raw data into tables used for analysis
Kestra and its flows: To orchestrate the data workflows and automate tasks. They include several steps for data ingestion and storage with mainly Python scripts.
Metabase: To visualize the data and create dashboards for analysis.
Docker and Docker Compose: To containerize the project and simplify deployment.

They interoperate as follows:

Kestra retrieves the data from the API respecting its limits and introduces it into a ClickHouse database.
dbt Models transform the raw data into tables used for analysis within the ClickHouse database.
Metabase is used for data visualization and dashboard creation

Project Structure

clickhouse: Contains the SQL script for creating the basic ClickHouse database schema.
dbt: Contains the dbt project with models for transforming the raw data into tables used for analysis.
kestra-flows: Contains the flow that ingests and stores normalized data.
metabase: Contains the Metabase backup and restore of dashboards and questions, and also the .jar file for ClickHouse connection.

Project architecture

Database Schema (fact and dimension tables)

dbt lineage

Prerequisites

Python and pip
dbt (installed via pip, preferably in a virtual environment)
Docker and Docker Compose

Setup

Clone the repository:

git clone [email protected]:exalt-it-dojo/candidats/pablo-calvo-flight-radar-c9ea6ea7-ed6a-4ee2-9e62-905ae0c43e4b.git
cd pablo-calvo-flight-radar-c9ea6ea7-ed6a-4ee2-9e62-905ae0c43e4b

Create a virtual environment (optional):

python -m venv .venv
source .venv/bin/activate

Install dependencies (optional):
```
pip install -r requirements.txt
```

Usage

Running with Docker Compose

Start the services:
```
docker-compose up --build -d
```
When the setup has ended, the following services should be running:
- PostgreSQL database for metabase data.
- ClickHouse database for storing the normalized and analytics data.
- Metabase service for data visualization.
- Postgres database for Kestra.
Access the Kestra web interface:
- Open the Kestra web interface at http://localhost:8080.
- Go to the Flows section and start the f1-data.ergast_api_call flow.
Run metabase restore dump:
- Run the following command to create a backup of the metabase database:
```
./metabase/restore_metabase.sh
```
- You can backup the information again by running
```
./metabase/backup_metabase.sh
```
- The backup will be stored in the metabase/ directory.
- If the file doesn't run, you might have to make it executable by running:
```
sudo chmod +x metabase/restore_metabase.sh
```
Run dbt models:
- Go to the folder dbt/f1_ergast_api with the terminal.
```
 cd dbt/f1_ergast_api
```
- Run the following command to run the dbt models:
```
dbt run
```
- You can also test the models by running:
```
dbt test
```
Access the Metabase web interface:
- Open the Metabase web interface at http://localhost:3000.
- Log the credentials I have provided you in my email.
- You could go to Admin, Databases, and sync the ClickHouse database so the dashboards contain the appropriate data.
- You can check the dashboard I have created for you. http://localhost:3000/dashboard/1-f1-data
- The data in the dashboard will update as the Kestra flow runs since dbt models are materialized views.

Next steps

Simplifying Kestra flows:
- The flows could be refactored to make them more modular and easier to maintain, especially the Python scripts.
- The flows could be optimized for performance by reducing unnecessary steps and improving data processing efficiency.
Adding Kestra flow with triggers to load new dbt models:
- A new flow could be created to trigger the dbt models after the data ingestion flow has completed.
Adding more dbt models:
- More dbt models could be added to transform the raw data into additional tables for analysis.
- The existing models could be optimized for performance and efficiency.
Improvement of sensitive variables handling:
- The sensitive variables are currently hardcoded in the Kestra flows and dbt profiles. It would be better to use environment variables or a secrets management tool to store and retrieve these variables securely, in a production environment.
CI/CD Pipeline:
- Set up a CI/CD pipeline using Jenkins, GitLab CI/CD, or GitHub Actions.
- The pipeline should include steps for linting, testing, building, and deploying the project.
- The pipeline should be triggered on every push to the main branch.
- More reduced, focused pipelines could be created for development branches.
- The pipeline should include steps for deploying the project to the production environment, or you could do that locally with Terraform.
Comprehensive testing:
- Implement unit tests, integration tests, and more thorough data tests for the project.
Security:
- Implement security best practices for the project, including encryption, access control, and data protection.
- Use secrets management tools like HashiCorp Vault or AWS Secrets Manager to store sensitive information.
- Implement role-based access control (RBAC) for Kestra and Metabase.
Monitoring and Alerting:
- Set up monitoring and alerting for the project using tools like Prometheus, Grafana, and ELK Stack.
- Monitor key metrics such as data ingestion rate, latency, and error rates.
- Set up alerts for critical issues such as data pipeline failures or database outages.

Project Choices and Explanations

ClickHouse instead of Neon Postgres

Because of being more familiar with ClickHouse than Neon, I decided to go with ClickHouse since it provided also good performance for analytical queries since it is an OLAP database.

Metabase instead of Rill

I chose Metabase because it is more user-friendly and easier to set up than Rill, especially considering that I didn't know Rill beforehand. It also provides a sufficient range of visualization options and dashboards that can be easily shared with other users.

Contact

For any questions or feedback, feel free to reach out to me at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
clickhouse		clickhouse
dbt		dbt
kestra-flows		kestra-flows
metabase		metabase
README.md		README.md
dbt_lineage.png		dbt_lineage.png
docker-compose.yml		docker-compose.yml
ergast-architecture.png		ergast-architecture.png
normalized_data.png		normalized_data.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formula 1 Data Analysis Project

Overview

Project Structure

Project architecture

Database Schema (fact and dimension tables)

dbt lineage

Prerequisites

Setup

Usage

Running with Docker Compose

Next steps

Project Choices and Explanations

ClickHouse instead of Neon Postgres

Metabase instead of Rill

Contact

About

Releases

Packages

Languages

pablomc87/f1-ergast

Folders and files

Latest commit

History

Repository files navigation

Formula 1 Data Analysis Project

Overview

Project Structure

Project architecture

Database Schema (fact and dimension tables)

dbt lineage

Prerequisites

Setup

Usage

Running with Docker Compose

Next steps

Project Choices and Explanations

ClickHouse instead of Neon Postgres

Metabase instead of Rill

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages