Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Jan 2, 2026

Which issue does this PR close?

Rationale for this change

See the issue for the rationale, and design considerations.

For PR structure, start with datafusion/physical-expr-common/src/physical_expr/pruning.rs 's module-level comment, and follow along.

What changes are included in this PR?

The core change in this PR is around a small few hundreds LoC from estimation, the PR diff is mainly tests and docs.

  • Defined core APIs/data structures for stat-propagation-based predicate pruning
  • Implemented statistics pruning on:
    • Literals (like 3)
    • Column references (like c1)
    • Comparison operators >, <, =, >=, <=

And we now support pruning for expressions like

  • c1 > 1
  • c1 >= c2

The issue also includes some thoughts on future implementation plans.

Are these changes tested?

UTs

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 2, 2026
@2010YOUY01 2010YOUY01 changed the title feat: Predicate pruning via statistics propagation feat: Prune complex/nested predicates via statistics propagation Jan 2, 2026
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

Comment on lines +251 to +252
pub range_stats: Option<RangeStats>,
pub null_stats: Option<NullStats>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to point out that if null stats or missing (NullPresence::UnknownOrMixed) we cannot make any inferences from the min/max values, they should be treated as missing as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual inference logic is more aggressive than the algorithm you have described, it's implemented in https://github.com/apache/datafusion/pull/19609/changes#diff-32f7f18dcd86a268e7e1e0134eae6ae002bd42e61180cfabd60944566b10f6d8R660

I'll add more comments here also.

///
/// # Errors
/// Returns Internal Error if unsupported operator is provided.
fn compare_ranges(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unit tests for this method specifically that ensure 100% coverage would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked, it has reached 100% test line coverage by

cargo +nightly llvm-cov \
                                                                               --package datafusion-physical-expr \
                                                                               --test pruning \
                                                                               --all-features \
                                                                               --html \
                                                                               --open \
                                                                               -- --nocapture

Though it's covered by higher-level pruning API tests, not the UT directly on it. The benefit is the test coverage won't be lost when we change the implementation to a vectorized compare_ranges_vectorized

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Jan 3, 2026

Thank you for the review, those feedbacks make sense to me, I'll batch them later

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Please let me know if anything is unclear. I’m trying to make both the implementation and the documentation clearer, but the logic and edge cases for this feature are admittedly quite tricky.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

Yes — the initial milestone should be reaching coverage equivalent to the existing PruningPredicate implementation, so we can reuse the existing tests and gain more confidence.

@alamb
Copy link
Contributor

alamb commented Jan 6, 2026

I plan to review this PR carefully tomorrow. I am sorry for the delay but I have been out for a few days and I am quite backed up

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 -- I think this API is very nice and quite clever and I think it is the right basic approach to generalized vectorized range based analysis.

I have a bunch of API suggestions I think we can iterate on and make the code better / cleaner.

The biggest challenge in this project, in my mind, is that it proposes to adds (yet) another API for some sort of range/statistics propagation.

Even before this PR, we already have 4 APIs on the PhysicalExpr

Some of these methods seem to be unused in the core codebase (e.g. the "propagate" variants, and some of the new V2 statistics API added in #14699 by @Fly-Style, but I don't see any uses of it in the code (https://github.com/apache/datafusion/blob/998f534aafbf55c83daaa6fd4985ba143954b0e0/datafusion/physical-expr/src/statistics/stats_solver.rs#L39-L38). It also has provisions for various statistical distributions for which I still don't understand the usecase

If we are going to add a new API, I think we should deprecate some/all of the others.

//! about one container and may track richer distribution details.
//! Pruning must reason about *all* containers (potentially thousands) to decide
//! which to skip, so it favors a vectorized, array-backed representation with
//! lighter-weight stats. These are intentionally separate interfaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any fundamental reason they need to be separate interfaces?

Like I am thinking, is there some potential future where we are able to rewrite [PhysicalExpr::evaluate_bounds] to use the new API in this PR?

That way having multiple APIs would be only a temporary, intermediate state as we worked to fill out the rest of the functionality 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this code more, I don't think there is any fundamental difference (other than being vectorized) with evaluate_bounds

false
}

/// Evaluates pruning statistics via propagation. See the pruning module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize the primary usecase for this evaluation is pruning, but I think it is a more general concept -- basically propagating statistical information through this expression

What would you think of calling this more like propagate_ranges? (I realize it is getting very similar to evalute_ranges and propagate_constraints...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about propagate_pruning_statistics()? Since its covering a rich set of statistics more than just range, and we can use _pruning_ to make it less ambiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only thing it will ever be used for is pruning then putting pruning in the name makes sense

I still have (not so) secret hopes, that we can somehow unify these range / expression analysis APIs as they all seem so similar in theory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to evaluate_statistics_vectorized in 9d256a5, it makes sense to make it more general, and potentially unify those APIs.

/// implemented pruning; returning `None` signals that no pruning statistics
/// are available.
///
/// In the future, propagation may expose dedicated APIs such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than different APIs, I would recommend a single API propagate_ranges and add the different types of information in the object that is propagated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent idea, we should do this in the future.

}
}

/// Pruning intermediate type propagated through `PhysicalExpr` nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need to distinguish between intermediate statistics and intermediate results

Specifically, I think ColumnStats for boolean expressions will be trivially convertible to PruningResults (if the min/max are both true then we know the boolean value is always true. If the min/max are both false then we know the value is always false, etc)

This is similar to how Interval works: https://github.com/apache/datafusion/blob/81512da2b0aaa474f6c4ba205b05eea7b3095176/datafusion/expr-common/src/interval_arithmetic.rs#L182-L181

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is doable. To encode PruningResult, we only need a single inner Option<BooleanArray>. The key consideration, I think, is which approach leads to a simpler implementation.

The answer is not clear to me yet; I will build a small prototype to confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using BooleanArray I think will make handling all other expressions easier -- implementations of each expression will not have to pick between results or ranges, they will all use statistics

With some wrappers to interpret BooleanArray I think the APis could be quite nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it carefully again, and I don’t think it is actually simpler.

It does reduce the number of lines of code for most arithmetic expressions. However, for predicate expressions, we still have to inspect the semantic meaning inside column statistics. This is not really “eliminating a special case by definition”; rather, it is trying to encode multiple concepts into a single struct. I think the downside of this approach is that it weakens type safety and also makes predicate nodes harder to reason about.

To simplify implementation and avoid repetitive checking enum variant, probably we can add some utility functions.

#[derive(Debug, Clone)]
pub struct ColumnStats {
    pub range_stats: Option<RangeStats>,
    pub null_stats: Option<NullStats>,
    pub evaluate_results: Option<BooleanArray>, // <--- Change here. We are adding an implicit constraint that
                                                // if this is Some, then this node no longer represents
                                                // statistics, and all other stat fields must be None.
    /// Number of containers. Needed to infer result if all stats types are `None`.
    pub num_containers: usize,
}

Another consideration is that we might want a parent enum in the future for extensibility. It could be used to pass control information if we want to implement certain optimizations, or if we want to unify this PR’s stat propagation API with the existing ones, where additional intermediate stat variants may be needed.

};
match child {
PruningIntermediate::IntermediateStats(stats) => {
if let Some(null_stats) = stats.null_stats() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear that @pepijnve recently implemented very similar logic (for the bounds evaluation of IsNull, albiet one row at a time) but now I can't find it...

@alamb
Copy link
Contributor

alamb commented Jan 8, 2026

My suggested next step is to Decide if we have the effort / motivation to see it through (I am willing to help review, organized, and build consensus, as I think it is a foundational piece of DataFusion)

If we do want to proceed, I think the first thing we should do is figure out how we will eventually unify the existing APIs that have overlap (specifically PruningPredicate and PhysicalExpr::evalute_bounds, possibly the propagate statistics code too)

I think as long as there is a realistic way towards a unified framework we can then start pounding out the code

@alamb
Copy link
Contributor

alamb commented Jan 8, 2026

FYI @ozankabak and @berkaysynnada as you have been instrumental in previous versions of statistics and range analysis

@2010YOUY01
Copy link
Contributor Author

Let's work through the API design first. After we agree to move forward, I'll apply all minor code review suggestions.

My suggested next step is to Decide if we have the effort / motivation to see it through (I am willing to help review, organized, and build consensus, as I think it is a foundational piece of DataFusion)

I do plan to spend significant effort on the future implementation, as I believe this is a very important optimization feature. Thank you for your help, @alamb!

I have a bunch of API suggestions I think we can iterate on and make the code better / cleaner.

The biggest challenge in this project, in my mind, is that it proposes to adds (yet) another API for some sort of range/statistics propagation.

Even before this PR, we already have 4 APIs on the PhysicalExpr

Some of these methods seem to be unused in the core codebase (e.g. the "propagate" variants, and some of the new V2 statistics API added in #14699 by @Fly-Style, but I don't see any uses of it in the code (https://github.com/apache/datafusion/blob/998f534aafbf55c83daaa6fd4985ba143954b0e0/datafusion/physical-expr/src/statistics/stats_solver.rs#L39-L38). It also has provisions for various statistical distributions for which I still don't understand the usecase

If we are going to add a new API, I think we should deprecate some/all of the others.

I took a quick look at those APIs; their high-level ideas are:

  • evaluate_bounds() propagates bounds using the Interval type.
  • propagate_constraints() performs inverse propagation on Intervals. For example, given the output range of an expression and the original input ranges for its child expressions, it performs inverse propagation to refine the children’s ranges (its comment includes a specific example).
  • evaluate_statistics() is similar to evaluate_bounds(), but operates on Distribution inputs to capture richer statistical distribution information.
  • propagate_statistics() is similar to propagate_constraints(), but operates on Distribution inputs.

The API proposed in this PR can replace evaluate_bounds(), because it covers the full functionality of evaluate_bounds(). The only difference is that it is vectorized; we can use a simple adaptor for conversion, and evaluate_bounds() can be deprecated in the future.

propagate_constraints() cannot be replaced, since the statistics propagation for pruning (this PR) does not require inverse propagation.

evaluate_statistics() and propagate_statistics() cannot be replaced by this PR, because their representations do not overlap. Even more advanced pruning will not need the statistical information inside the Distribution struct.

I did some archaeology, but I am not very sure. If evaluate_statistics() and propagate_statistics() are intended to replace evaluate_bounds() and propagate_constraints(), and the latter two are no longer used inside the DataFusion core, should we simply deprecate them?

@2010YOUY01
Copy link
Contributor Author

If we do want to proceed, I think the first thing we should do is figure out how we will eventually unify the existing APIs that have overlap (specifically PruningPredicate and PhysicalExpr::evalute_bounds, possibly the propagate statistics code too)

Regarding PruningPredicate, I think its major API can eventually remain the same, while its implementation should be replaced with this statistics-propagation-based pruning. Some minor API changes are inevitable.

@ozankabak
Copy link
Contributor

ozankabak commented Jan 9, 2026

If evaluate_statistics() and propagate_statistics() are intended to replace evaluate_bounds() and propagate_constraints(), and the latter two are no longer used inside the DataFusion core, should we simply deprecate them?

They are not. The latter are the simplest building blocks that calculate ranges, the former are designed to use them to generate richer statistical information. Indeed evaluate_statistics() and propagate_statistics() are not used yet in the codebase, because we unfortunately didn't have time to work on it over the last 6 months or so, but have an intention to come back and finish the job.

I didn't have time yet to look at this big PR, but I looked at the issue and design. As a general thought I think replacing the evaluate_bounds/propagate_constraints duo will not be easy. The Interval library makes very careful calculations w.r.t. things like rounding (with floats etc.), because their results are used to take branches in contexts like pruning join hash tables that may operate on data like floats. Support for such rounding-aware calculations were not present in arrow-rs at the time when we created these APIs (and still not here if I'm not missing something).

So, we have two facts about our needs:

  1. Rigor (w.r.t. containment) in bounds is a real need in some use cases (like join pruning).
  2. Vectorization is a real need in other use cases (like Parquet pruning).
  3. Some use cases only need bounds evaluation (upwards traversal of the expression graph) while others also need propagation of constraints (downwards traversal).

A good first step is to add vector_evaluate_bounds, and note these nuances in the docs. Then, if you think it is possible to somehow achieve containment rigor w.r.t. rounding in the vectorized version too, we can try our hand at that, and then implement its propagate_constraints counterpart. Only at that stage we can remove the evaluate_bounds/propagate_constraints duo. Then, evaluate_statistics() and propagate_statistics() can migrate to the new API.

However, my hunch is that this won't be possible for the time being, and we will need to a new API that does vectorized bounds, and make it clear in the docs that we need the two APIs to cater to 1 and 2.

I hope this context helps @2010YOUY01 and @alamb

@2010YOUY01
Copy link
Contributor Author

I didn't have time yet to look at this big PR, but I looked at the issue and design. As a general thought I think replacing the evaluate_bounds/propagate_constraints duo will not be easy. The Interval library makes very careful calculations w.r.t. things like rounding (with floats etc.), because their results are used to take branches in contexts like pruning join hash tables that may operate on data like floats. Support for such rounding-aware calculations were not present in arrow-rs at the time when we created these APIs (and still not here if I'm not missing something).

This is a really good point. I didn't consider rounding safety so far, I'll make sure to include them in the vectorized version also.

By the way, do you have any references on the high-level ideas behind “join pruning,” and why we need the inverse path (propagate_constraints())? @ozankabak Just out of my curiosity.

Thanks for the context. For now, I think we shouldn’t touch the existing statistics propagation APIs and should introduce a vectorized one in this work. I’ll add more documentation to explain the rationale.

@ozankabak
Copy link
Contributor

As an example, when one implements a join on an ordered table with a "sliding window" condition, the propagate_constraints becomes instrumental in finding the upper (or lower) bounds to prune the join hash map.

IMO this blog post explains the ideas well.

@alamb
Copy link
Contributor

alamb commented Jan 9, 2026

If we do want to proceed, I think the first thing we should do is figure out how we will eventually unify the existing APIs that have overlap (specifically PruningPredicate and PhysicalExpr::evalute_bounds, possibly the propagate statistics code too)

Regarding PruningPredicate, I think its major API can eventually remain the same, while its implementation should be replaced with this statistics-propagation-based pruning. Some minor API changes are inevitable.

Yes this sounds like a great plan

I still really feel that we can unify these APIs somehow. Starting with vectorized_evaluate_bounds as sugggested by @ozankabak seems like a good step in that direction

Also I shoudl be clear my concern isn't just the multiple implementations are harder to maintain, it is also that we already have significant code and test coverage for single row range analysis -- so replicating it again in vectorized fashion entirely separately won't leverage the past experience, and may result in different behaviors between the two paths

@2010YOUY01 2010YOUY01 marked this pull request as draft January 12, 2026 10:05
@2010YOUY01
Copy link
Contributor Author

All review feedback has been addressed (except for #19609 (comment), which might need further discussion).

It’s ready for another look.

@2010YOUY01 2010YOUY01 marked this pull request as ready for review January 12, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants