Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a dummy operator at the start of parallel pipelines #2197

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

alucryd
Copy link

@alucryd alucryd commented Jan 8, 2025

Description

Used a DummyOperator instead of the first source to parallelize all sources, including the first one.

Related Issues

Additional Context

The first source can take a long time to run, this can make pipelines faster by parallelizing even the first source.

Copy link

netlify bot commented Jan 8, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 7fc82a3
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/677e7bfdb52f8300086fd4d2

@rudolfix rudolfix self-assigned this Jan 14, 2025
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alucryd there's a reason to run first task and then all others in parallel: it will create initial schema in the database and standard dlt tables. all tasks share the same dataset.

if you still want to work on this PR then let's add new option to add_run: ie dummy_task_first and if set to True, do what do right now.

I do not want to change existing behavior, too many deployments that may rely on that are in production

@alucryd
Copy link
Author

alucryd commented Jan 27, 2025

@alucryd there's a reason to run first task and then all others in parallel: it will create initial schema in the database and standard dlt tables. all tasks share the same dataset.

if you still want to work on this PR then let's add new option to add_run: ie dummy_task_first and if set to True, do what do right now.

I do not want to change existing behavior, too many deployments that may rely on that are in production

I see, thanks for the heads up, I don't have the full picture yet but I'm getting there. I ran this change in production and didn't run into any issue with a completely new datasource so I wrongly assumed it would be harmless.

I assume it would be too much work to split the schema and table creations and only run that in the first task?

In any case I'll add the proposed option and default it to false so it doesn't impact anyone.

@rudolfix
Copy link
Collaborator

@alucryd yeah we could think of some "preparatory" task but IMO in that case it is better to just create a callback that receives a DAG from airflow helper and can modify it... we already have on_before_run we could also add on_dag_created where you get this tree of tasks.

but that's a separate ticket I'd say - if you'd like to try to add it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelize all sources in Airflow, including the first one
2 participants