Support push-down of the filter on coalesce over join keys #18848

vadimpiven · 2025-11-20T14:33:40Z

Which issue does this PR close?

Closes Support push-down of the filter on coalesce over join keys #18830.

Rationale for this change

The standard PushDownFilter optimizer rule does not propagate filters on coalesce. But as far as I can tell, push-down of the filter on coalesce over join keys is correct. So I propose adding such optimization to DataFusion.

What changes are included in this PR?

Added test-case and PushDownCoalesceFilterHelper applying optimization.

Are these changes tested?

Yes, new test-case passes, old test-cases are not affected.

Are there any user-facing changes?

No user-facing changes.

alamb

Can you please also write an end to end test with sqllogictests?

Given that coalesce is rewritten to CASE during simplification

datafusion/datafusion/functions/src/core/coalesce.rs

Lines 97 to 122 in 1833093

    
           fn simplify( 
        
               &self, 
        
               args: Vec<Expr>, 
        
               _info: &dyn SimplifyInfo, 
        
           ) -> Result<ExprSimplifyResult> { 
        
               if args.is_empty() { 
        
                   return plan_err!("coalesce must have at least one argument"); 
        
               } 
        
               if args.len() == 1 { 
        
                   return Ok(ExprSimplifyResult::Simplified( 
        
                       args.into_iter().next().unwrap(), 
        
                   )); 
        
               } 
        
               let n = args.len(); 
        
               let (init, last_elem) = args.split_at(n - 1); 
        
               let whens = init 
        
                   .iter() 
        
                   .map(|x| x.clone().is_not_null()) 
        
                   .collect::<Vec<_>>(); 
        
               let cases = init.to_vec(); 
        
               Ok(ExprSimplifyResult::Simplified( 
        
                   CaseBuilder::new(None, whens, cases, Some(Box::new(last_elem[0].clone()))) 
        
                       .end()?, 
        
               )) 
        
           }

I suspect that this optimization may not be triggered when run end to end

alamb · 2025-11-20T18:42:39Z

datafusion/optimizer/src/push_down_filter.rs

+
+    fn extract_join_columns(&self, expr: &Expr) -> Option<(Column, Column)> {
+        if let Expr::ScalarFunction(ScalarFunction { func, args }) = expr {
+            if func.name() != "coalesce" {


this seems very specific to coalesce and will likely break anyone who provides their own implementation of coalesce that overrides the built in one

Can we formualte this as some more general property of the function that allows pushing down? That way we could mark coalesce as having this property

I think the fact that coalesce is basically identity function over kept side of the join allows such optimization...

I will add end-to-end test tomorrow. I have a problem on my data where I need this optimization, so I will try to reproduce it in end-to-end test.

alamb · 2025-11-20T18:44:18Z

Thank you for this contribution @vadimpiven

alamb · 2025-11-20T19:11:37Z

datafusion/optimizer/src/push_down_filter.rs

+    }
+}
+
+fn push_full_join_coalesce_filters(


to push filters into the inputs of a FULL JOIN , you need to guarantee that the join doens't reintroduce rows (with nulls) that would have been filtered if the filter was applied beforehand

In other words, it is not clear to me that this optimization is correct

Sorry, forgot to change the name. I am using this optimization in my code specifically for chains (up to 50-table long) of FULL OUTER JOINs. I am making joins with a sequence join->project with coalesce over join keys -> alias, like:

let plan = LogicalPlanBuilder::scan("t1", scan.clone(), None)? .join( LogicalPlanBuilder::scan("t2", scan.clone(), None)?.build()?, JoinType::Full, (vec!["a"], vec!["a"]), None, )? .project(vec![ coalesce(vec![col("t1.a"), col("t2.a")]).alias("a"), col("t1.b").alias("b1"), col("t2.b").alias("b2"), ])? .alias("j1")? .build()?;

This way the initial data which looks like

{ "table1": { "1": 100, "2": 200, "3": 300 }, "table2": { "2": 2000, "3": 3000, "4": 4000 }, "table3": { "3": 30000, "4": 40000, "5": 50000 }

is joined into

key table1 table2 table3

1 100 null null

2 200 2000 null

3 300 3000 30000

4 null 4000 40000

5 null null 50000

instead of

key1 key2 key3 table1 table2 table3

1 null null 100 null null

2 2 null 200 2000 null

3 3 3 300 3000 30000

null 4 4 null 4000 40000

null null 5 null null 50000

You can check the illustration https://docs.platforma.bio/guides/vdj-analysis/diversity-analysis/#results-table where different sample properties are joined by Sample Id from different parquet files.

When I apply filter on Key what I effectively want is to replicate this filter to all input tables. And optimization that I provided does exactly that.

I am applying the chain join->project with coalesce over join keys -> alias for each new table, so for 50 tables I would have 49 projections with coalesce. Without my optimization, each optimizer pass has simplification which turns coalesce into CASE and then performs push-down which again turns case to coalesce. So 1 optimizer pass gives me propagation through 1 layer, and for 50 tables I would have to have 49 optimizer passes for full propagation. The optimization in this PR allows to optimize such scenario in 1 optimizer pass.

I realized that this optimization seems correct for any type of join if coalesce is applied to the join keys, so I do not have explicit check for FULL OUTER JOIN in proposed code.

Let me know if you believe that nobody else has the scenario I described, this way we can simply close the PR and issue without further discussion)

github-actions bot added the optimizer Optimizer rules label Nov 20, 2025

vadimpiven force-pushed the main branch from ed3e3f3 to 875a9b4 Compare November 20, 2025 17:22

Support push-down of the filter on coalesce over join keys

be8f642

vadimpiven force-pushed the main branch from 875a9b4 to be8f642 Compare November 20, 2025 17:22

alamb reviewed Nov 20, 2025

View reviewed changes

alamb mentioned this pull request Nov 20, 2025

Support push-down of the filter on coalesce over join keys #18830

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support push-down of the filter on coalesce over join keys #18848

Support push-down of the filter on coalesce over join keys #18848

vadimpiven commented Nov 20, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Nov 20, 2025

Uh oh!

vadimpiven Nov 20, 2025

Uh oh!

alamb commented Nov 20, 2025

Uh oh!

alamb Nov 20, 2025

Uh oh!

vadimpiven Nov 20, 2025

Uh oh!

vadimpiven Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	fn simplify(
	&self,
	args: Vec<Expr>,
	_info: &dyn SimplifyInfo,
	) -> Result<ExprSimplifyResult> {
	if args.is_empty() {
	return plan_err!("coalesce must have at least one argument");
	}
	if args.len() == 1 {
	return Ok(ExprSimplifyResult::Simplified(
	args.into_iter().next().unwrap(),
	));
	}

	let n = args.len();
	let (init, last_elem) = args.split_at(n - 1);
	let whens = init
	.iter()
	.map(\|x\| x.clone().is_not_null())
	.collect::<Vec<_>>();
	let cases = init.to_vec();
	Ok(ExprSimplifyResult::Simplified(
	CaseBuilder::new(None, whens, cases, Some(Box::new(last_elem[0].clone())))
	.end()?,
	))
	}

key	table1	table2	table3
1	100	null	null
2	200	2000	null
3	300	3000	30000
4	null	4000	40000
5	null	null	50000

Support push-down of the filter on coalesce over join keys #18848

Are you sure you want to change the base?

Support push-down of the filter on coalesce over join keys #18848

Conversation

vadimpiven commented Nov 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

vadimpiven Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 20, 2025

Uh oh!

alamb Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

vadimpiven Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

vadimpiven Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants