feat: Prune complex/nested predicates via statistics propagation #19609

2010YOUY01 · 2026-01-02T14:25:12Z

Which issue does this PR close?

The initial work of Proposal: Prune complex predicates by propagating column statistics #19487

Rationale for this change

See the issue for the rationale, and design considerations.

For PR structure, start with datafusion/physical-expr-common/src/physical_expr/pruning.rs 's module-level comment, and follow along.

What changes are included in this PR?

The core change in this PR is around a small few hundreds LoC from estimation, the PR diff is mainly tests and docs.

Defined core APIs/data structures for stat-propagation-based predicate pruning
Implemented statistics pruning on:
- Literals (like 3)
- Column references (like c1)
- Comparison operators >, <, =, >=, <=

And we now support pruning for expressions like

c1 > 1
c1 >= c2

The issue also includes some thoughts on future implementation plans.

Are these changes tested?

UTs

Are there any user-facing changes?

No

adriangb

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs

datafusion/physical-expr-common/src/physical_expr/pruning.rs

datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs

datafusion/physical-expr-common/src/physical_expr/pruning.rs

adriangb · 2026-01-02T17:31:43Z

datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs

+    pub range_stats: Option<RangeStats>,
+    pub null_stats: Option<NullStats>,


I think it's important to point out that if null stats or missing (NullPresence::UnknownOrMixed) we cannot make any inferences from the min/max values, they should be treated as missing as well.

The actual inference logic is more aggressive than the algorithm you have described, it's implemented in https://github.com/apache/datafusion/pull/19609/changes#diff-32f7f18dcd86a268e7e1e0134eae6ae002bd42e61180cfabd60944566b10f6d8R660

I'll add more comments here also.

adriangb · 2026-01-02T17:37:46Z

datafusion/physical-expr/src/expressions/binary.rs

+///
+/// # Errors
+/// Returns Internal Error if unsupported operator is provided.
+fn compare_ranges(


Some unit tests for this method specifically that ensure 100% coverage would be great

I have checked, it has reached 100% test line coverage by

cargo +nightly llvm-cov \ --package datafusion-physical-expr \ --test pruning \ --all-features \ --html \ --open \ -- --nocapture

Though it's covered by higher-level pruning API tests, not the UT directly on it. The benefit is the test coverage won't be lost when we change the implementation to a vectorized compare_ranges_vectorized

2010YOUY01 · 2026-01-03T03:50:30Z

Thank you for the review, those feedbacks make sense to me, I'll batch them later

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Please let me know if anything is unclear. I’m trying to make both the implementation and the documentation clearer, but the logic and edge cases for this feature are admittedly quite tricky.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

Yes — the initial milestone should be reaching coverage equivalent to the existing PruningPredicate implementation, so we can reuse the existing tests and gain more confidence.

alamb · 2026-01-06T21:43:28Z

I plan to review this PR carefully tomorrow. I am sorry for the delay but I have been out for a few days and I am quite backed up

alamb

Thank you @2010YOUY01 -- I think this API is very nice and quite clever and I think it is the right basic approach to generalized vectorized range based analysis.

I have a bunch of API suggestions I think we can iterate on and make the code better / cleaner.

The biggest challenge in this project, in my mind, is that it proposes to adds (yet) another API for some sort of range/statistics propagation.

Even before this PR, we already have 4 APIs on the PhysicalExpr

Some of these methods seem to be unused in the core codebase (e.g. the "propagate" variants, and some of the new V2 statistics API added in #14699 by @Fly-Style, but I don't see any uses of it in the code (https://github.com/apache/datafusion/blob/998f534aafbf55c83daaa6fd4985ba143954b0e0/datafusion/physical-expr/src/statistics/stats_solver.rs#L39-L38). It also has provisions for various statistical distributions for which I still don't understand the usecase

If we are going to add a new API, I think we should deprecate some/all of the others.

datafusion/physical-expr-common/src/physical_expr/pruning.rs

alamb · 2026-01-08T18:54:00Z

datafusion/physical-expr-common/src/physical_expr/pruning.rs

+//! about one container and may track richer distribution details.
+//! Pruning must reason about *all* containers (potentially thousands) to decide
+//! which to skip, so it favors a vectorized, array-backed representation with
+//! lighter-weight stats. These are intentionally separate interfaces.


Is there any fundamental reason they need to be separate interfaces?

Like I am thinking, is there some potential future where we are able to rewrite [PhysicalExpr::evaluate_bounds] to use the new API in this PR?

That way having multiple APIs would be only a temporary, intermediate state as we worked to fill out the rest of the functionality 🤔

After reading this code more, I don't think there is any fundamental difference (other than being vectorized) with evaluate_bounds

datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs

alamb · 2026-01-08T19:05:46Z

datafusion/physical-expr-common/src/physical_expr.rs

        false
    }
+
+    /// Evaluates pruning statistics via propagation. See the pruning module


I realize the primary usecase for this evaluation is pruning, but I think it is a more general concept -- basically propagating statistical information through this expression

What would you think of calling this more like propagate_ranges? (I realize it is getting very similar to evalute_ranges and propagate_constraints...)

How about propagate_pruning_statistics()? Since its covering a rich set of statistics more than just range, and we can use _pruning_ to make it less ambiguous.

If the only thing it will ever be used for is pruning then putting pruning in the name makes sense

I still have (not so) secret hopes, that we can somehow unify these range / expression analysis APIs as they all seem so similar in theory

I changed it to evaluate_statistics_vectorized in 9d256a5, it makes sense to make it more general, and potentially unify those APIs.

alamb · 2026-01-08T19:06:02Z

datafusion/physical-expr-common/src/physical_expr.rs

+    /// implemented pruning; returning `None` signals that no pruning statistics
+    /// are available.
+    ///
+    /// In the future, propagation may expose dedicated APIs such as:


Rather than different APIs, I would recommend a single API propagate_ranges and add the different types of information in the object that is propagated

This is an excellent idea, we should do this in the future.

alamb · 2026-01-08T19:22:15Z

datafusion/physical-expr-common/src/physical_expr/pruning.rs

+    }
+}
+
+/// Pruning intermediate type propagated through `PhysicalExpr` nodes.


I am not sure we need to distinguish between intermediate statistics and intermediate results

Specifically, I think ColumnStats for boolean expressions will be trivially convertible to PruningResults (if the min/max are both true then we know the boolean value is always true. If the min/max are both false then we know the value is always false, etc)

This is similar to how Interval works: https://github.com/apache/datafusion/blob/81512da2b0aaa474f6c4ba205b05eea7b3095176/datafusion/expr-common/src/interval_arithmetic.rs#L182-L181

Yes, this is doable. To encode PruningResult, we only need a single inner Option<BooleanArray>. The key consideration, I think, is which approach leads to a simpler implementation.

The answer is not clear to me yet; I will build a small prototype to confirm.

Using BooleanArray I think will make handling all other expressions easier -- implementations of each expression will not have to pick between results or ranges, they will all use statistics

With some wrappers to interpret BooleanArray I think the APis could be quite nice

I thought about it carefully again, and I don’t think it is actually simpler.

It does reduce the number of lines of code for most arithmetic expressions. However, for predicate expressions, we still have to inspect the semantic meaning inside column statistics. This is not really “eliminating a special case by definition”; rather, it is trying to encode multiple concepts into a single struct. I think the downside of this approach is that it weakens type safety and also makes predicate nodes harder to reason about.

To simplify implementation and avoid repetitive checking enum variant, probably we can add some utility functions.

#[derive(Debug, Clone)] pub struct ColumnStats { pub range_stats: Option<RangeStats>, pub null_stats: Option<NullStats>, pub evaluate_results: Option<BooleanArray>, // <--- Change here. We are adding an implicit constraint that // if this is Some, then this node no longer represents // statistics, and all other stat fields must be None. /// Number of containers. Needed to infer result if all stats types are `None`. pub num_containers: usize, }

Another consideration is that we might want a parent enum in the future for extensibility. It could be used to pass control information if we want to implement certain optimizations, or if we want to unify this PR’s stat propagation API with the existing ones, where additional intermediate stat variants may be needed.

datafusion/physical-expr-common/src/physical_expr/pruning.rs

alamb · 2026-01-08T19:28:48Z

datafusion/physical-expr/src/expressions/is_null.rs

+        };
+        match child {
+            PruningIntermediate::IntermediateStats(stats) => {
+                if let Some(null_stats) = stats.null_stats() {


I swear that @pepijnve recently implemented very similar logic (for the bounds evaluation of IsNull, albiet one row at a time) but now I can't find it...

alamb · 2026-01-08T19:36:47Z

My suggested next step is to Decide if we have the effort / motivation to see it through (I am willing to help review, organized, and build consensus, as I think it is a foundational piece of DataFusion)

If we do want to proceed, I think the first thing we should do is figure out how we will eventually unify the existing APIs that have overlap (specifically PruningPredicate and PhysicalExpr::evalute_bounds, possibly the propagate statistics code too)

I think as long as there is a realistic way towards a unified framework we can then start pounding out the code

alamb · 2026-01-08T19:37:12Z

FYI @ozankabak and @berkaysynnada as you have been instrumental in previous versions of statistics and range analysis

2010YOUY01 · 2026-01-09T05:03:49Z

Let's work through the API design first. After we agree to move forward, I'll apply all minor code review suggestions.

My suggested next step is to Decide if we have the effort / motivation to see it through (I am willing to help review, organized, and build consensus, as I think it is a foundational piece of DataFusion)

I do plan to spend significant effort on the future implementation, as I believe this is a very important optimization feature. Thank you for your help, @alamb!

I have a bunch of API suggestions I think we can iterate on and make the code better / cleaner.

The biggest challenge in this project, in my mind, is that it proposes to adds (yet) another API for some sort of range/statistics propagation.

Even before this PR, we already have 4 APIs on the PhysicalExpr

https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds

https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_statistics

https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_constraints

https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_statistics

Some of these methods seem to be unused in the core codebase (e.g. the "propagate" variants, and some of the new V2 statistics API added in #14699 by @Fly-Style, but I don't see any uses of it in the code (https://github.com/apache/datafusion/blob/998f534aafbf55c83daaa6fd4985ba143954b0e0/datafusion/physical-expr/src/statistics/stats_solver.rs#L39-L38). It also has provisions for various statistical distributions for which I still don't understand the usecase

If we are going to add a new API, I think we should deprecate some/all of the others.

I took a quick look at those APIs; their high-level ideas are:

evaluate_bounds() propagates bounds using the Interval type.
propagate_constraints() performs inverse propagation on Intervals. For example, given the output range of an expression and the original input ranges for its child expressions, it performs inverse propagation to refine the children’s ranges (its comment includes a specific example).
evaluate_statistics() is similar to evaluate_bounds(), but operates on Distribution inputs to capture richer statistical distribution information.
propagate_statistics() is similar to propagate_constraints(), but operates on Distribution inputs.

The API proposed in this PR can replace evaluate_bounds(), because it covers the full functionality of evaluate_bounds(). The only difference is that it is vectorized; we can use a simple adaptor for conversion, and evaluate_bounds() can be deprecated in the future.

propagate_constraints() cannot be replaced, since the statistics propagation for pruning (this PR) does not require inverse propagation.

evaluate_statistics() and propagate_statistics() cannot be replaced by this PR, because their representations do not overlap. Even more advanced pruning will not need the statistical information inside the Distribution struct.

I did some archaeology, but I am not very sure. If evaluate_statistics() and propagate_statistics() are intended to replace evaluate_bounds() and propagate_constraints(), and the latter two are no longer used inside the DataFusion core, should we simply deprecate them?

2010YOUY01 · 2026-01-09T05:21:42Z

If we do want to proceed, I think the first thing we should do is figure out how we will eventually unify the existing APIs that have overlap (specifically PruningPredicate and PhysicalExpr::evalute_bounds, possibly the propagate statistics code too)

Regarding PruningPredicate, I think its major API can eventually remain the same, while its implementation should be replaced with this statistics-propagation-based pruning. Some minor API changes are inevitable.

ozankabak · 2026-01-09T07:19:08Z

If evaluate_statistics() and propagate_statistics() are intended to replace evaluate_bounds() and propagate_constraints(), and the latter two are no longer used inside the DataFusion core, should we simply deprecate them?

They are not. The latter are the simplest building blocks that calculate ranges, the former are designed to use them to generate richer statistical information. Indeed evaluate_statistics() and propagate_statistics() are not used yet in the codebase, because we unfortunately didn't have time to work on it over the last 6 months or so, but have an intention to come back and finish the job.

I didn't have time yet to look at this big PR, but I looked at the issue and design. As a general thought I think replacing the evaluate_bounds/propagate_constraints duo will not be easy. The Interval library makes very careful calculations w.r.t. things like rounding (with floats etc.), because their results are used to take branches in contexts like pruning join hash tables that may operate on data like floats. Support for such rounding-aware calculations were not present in arrow-rs at the time when we created these APIs (and still not here if I'm not missing something).

So, we have two facts about our needs:

Rigor (w.r.t. containment) in bounds is a real need in some use cases (like join pruning).
Vectorization is a real need in other use cases (like Parquet pruning).
Some use cases only need bounds evaluation (upwards traversal of the expression graph) while others also need propagation of constraints (downwards traversal).

A good first step is to add vector_evaluate_bounds, and note these nuances in the docs. Then, if you think it is possible to somehow achieve containment rigor w.r.t. rounding in the vectorized version too, we can try our hand at that, and then implement its propagate_constraints counterpart. Only at that stage we can remove the evaluate_bounds/propagate_constraints duo. Then, evaluate_statistics() and propagate_statistics() can migrate to the new API.

However, my hunch is that this won't be possible for the time being, and we will need to a new API that does vectorized bounds, and make it clear in the docs that we need the two APIs to cater to 1 and 2.

I hope this context helps @2010YOUY01 and @alamb

2010YOUY01 · 2026-01-09T08:00:04Z

I didn't have time yet to look at this big PR, but I looked at the issue and design. As a general thought I think replacing the evaluate_bounds/propagate_constraints duo will not be easy. The Interval library makes very careful calculations w.r.t. things like rounding (with floats etc.), because their results are used to take branches in contexts like pruning join hash tables that may operate on data like floats. Support for such rounding-aware calculations were not present in arrow-rs at the time when we created these APIs (and still not here if I'm not missing something).

This is a really good point. I didn't consider rounding safety so far, I'll make sure to include them in the vectorized version also.

By the way, do you have any references on the high-level ideas behind “join pruning,” and why we need the inverse path (propagate_constraints())? @ozankabak Just out of my curiosity.

Thanks for the context. For now, I think we shouldn’t touch the existing statistics propagation APIs and should introduce a vectorized one in this work. I’ll add more documentation to explain the rationale.

ozankabak · 2026-01-09T09:00:52Z

As an example, when one implements a join on an ordered table with a "sliding window" condition, the propagate_constraints becomes instrumental in finding the upper (or lower) bounds to prune the join hash map.

IMO this blog post explains the ideas well.

alamb · 2026-01-09T16:33:54Z

If we do want to proceed, I think the first thing we should do is figure out how we will eventually unify the existing APIs that have overlap (specifically PruningPredicate and PhysicalExpr::evalute_bounds, possibly the propagate statistics code too)

Regarding PruningPredicate, I think its major API can eventually remain the same, while its implementation should be replaced with this statistics-propagation-based pruning. Some minor API changes are inevitable.

Yes this sounds like a great plan

I still really feel that we can unify these APIs somehow. Starting with vectorized_evaluate_bounds as sugggested by @ozankabak seems like a good step in that direction

Also I shoudl be clear my concern isn't just the multiple implementations are harder to maintain, it is also that we already have significant code and test coverage for single row range analysis -- so replicating it again in vectorized fashion entirely separately won't leverage the past experience, and may result in different behaviors between the two paths

…ized()

2010YOUY01 · 2026-01-12T10:33:22Z

All review feedback has been addressed (except for #19609 (comment), which might need further discussion).

It’s ready for another look.

Statistics propagation based predicate pruning

3cc4e8c

github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 2, 2026

2010YOUY01 mentioned this pull request Jan 2, 2026

Proposal: Prune complex predicates by propagating column statistics #19487

Open

2010YOUY01 changed the title ~~feat: Predicate pruning via statistics propagation~~ feat: Prune complex/nested predicates via statistics propagation Jan 2, 2026

adriangb reviewed Jan 2, 2026

View reviewed changes

2010YOUY01 mentioned this pull request Jan 4, 2026

Fix NULL handling in ScalarValue::partial_cmp (closes #19579) #19587

Closed

alamb mentioned this pull request Jan 5, 2026

Andrew Lamb Weekly-ish Open Source plan - 2026-01-05 #19652

Open

44 tasks

2010YOUY01 mentioned this pull request Jan 6, 2026

POC: Prune complex predicates by propagating column statistics #19488

Closed

alamb reviewed Jan 8, 2026

View reviewed changes

review: simplify RangeStats with ColumnarValue

5140c98

2010YOUY01 added 8 commits January 12, 2026 12:22

review: ensure ~100% test line coverage for evaluate_cmp_pruning()

2ef5172

review: more comments, and renames

6741a2a

review: improve APIs in PruningResults

04f1697

fix CI

683a317

review: rename API evaluate_pruning() --> propagate_statistics_vector…

9d256a5

…ized()

CI

f82049a

review: rename pruning.rs --> statistics_vectorized.rs

c1b9714

Merge branch 'main' into stat-propagation-pruning

918fba1

2010YOUY01 marked this pull request as draft January 12, 2026 10:05

2010YOUY01 marked this pull request as ready for review January 12, 2026 10:33

		pub range_stats: Option<RangeStats>,
		pub null_stats: Option<NullStats>,

feat: Prune complex/nested predicates via statistics propagation #19609

Are you sure you want to change the base?

feat: Prune complex/nested predicates via statistics propagation #19609

Uh oh!

Conversation

2010YOUY01 commented Jan 2, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jan 6, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 8, 2026

Uh oh!

alamb commented Jan 8, 2026

Uh oh!

2010YOUY01 commented Jan 9, 2026

Uh oh!

2010YOUY01 commented Jan 9, 2026

Uh oh!

ozankabak commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2010YOUY01 commented Jan 9, 2026

Uh oh!

2010YOUY01 commented Jan 2, 2026 •

edited by alamb

Loading

2010YOUY01 commented Jan 3, 2026 •

edited

Loading

ozankabak commented Jan 9, 2026 •

edited

Loading