use a dummy operator at the start of parallel pipelines #2197

alucryd · 2025-01-08T09:52:50Z

Description

Used a DummyOperator instead of the first source to parallelize all sources, including the first one.

Related Issues

Fixes Parallelize all sources in Airflow, including the first one #2196

Additional Context

The first source can take a long time to run, this can make pipelines faster by parallelizing even the first source.

netlify · 2025-01-08T09:53:08Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`7fc82a3`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/677e7bfdb52f8300086fd4d2

rudolfix

@alucryd there's a reason to run first task and then all others in parallel: it will create initial schema in the database and standard dlt tables. all tasks share the same dataset.

if you still want to work on this PR then let's add new option to add_run: ie dummy_task_first and if set to True, do what do right now.

I do not want to change existing behavior, too many deployments that may rely on that are in production

alucryd · 2025-01-27T17:35:20Z

@alucryd there's a reason to run first task and then all others in parallel: it will create initial schema in the database and standard dlt tables. all tasks share the same dataset.

if you still want to work on this PR then let's add new option to add_run: ie dummy_task_first and if set to True, do what do right now.

I do not want to change existing behavior, too many deployments that may rely on that are in production

I see, thanks for the heads up, I don't have the full picture yet but I'm getting there. I ran this change in production and didn't run into any issue with a completely new datasource so I wrongly assumed it would be harmless.

I assume it would be too much work to split the schema and table creations and only run that in the first task?

In any case I'll add the proposed option and default it to false so it doesn't impact anyone.

rudolfix · 2025-01-29T17:24:00Z

@alucryd yeah we could think of some "preparatory" task but IMO in that case it is better to just create a callback that receives a DAG from airflow helper and can modify it... we already have on_before_run we could also add on_dag_created where you get this tree of tasks.

but that's a separate ticket I'd say - if you'd like to try to add it

use a dummy operator at the start of parallel pipelines

1f234a5

replace deprecated DummyOperator with EmptyOperator

7fc82a3

rudolfix self-assigned this Jan 14, 2025

rudolfix requested changes Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use a dummy operator at the start of parallel pipelines #2197

use a dummy operator at the start of parallel pipelines #2197

alucryd commented Jan 8, 2025

netlify bot commented Jan 8, 2025 •

edited

Loading

rudolfix left a comment

alucryd commented Jan 27, 2025

rudolfix commented Jan 29, 2025

use a dummy operator at the start of parallel pipelines #2197

Are you sure you want to change the base?

use a dummy operator at the start of parallel pipelines #2197

Conversation

alucryd commented Jan 8, 2025

Description

Related Issues

Additional Context

netlify bot commented Jan 8, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix left a comment

Choose a reason for hiding this comment

alucryd commented Jan 27, 2025

rudolfix commented Jan 29, 2025

netlify bot commented Jan 8, 2025 •

edited

Loading