-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Prune complex/nested predicates via statistics propagation #19609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.
Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?
datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs
Show resolved
Hide resolved
datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs
Show resolved
Hide resolved
| pub range_stats: Option<RangeStats>, | ||
| pub null_stats: Option<NullStats>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to point out that if null stats or missing (NullPresence::UnknownOrMixed) we cannot make any inferences from the min/max values, they should be treated as missing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual inference logic is more aggressive than the algorithm you have described, it's implemented in https://github.com/apache/datafusion/pull/19609/changes#diff-32f7f18dcd86a268e7e1e0134eae6ae002bd42e61180cfabd60944566b10f6d8R660
I'll add more comments here also.
| /// | ||
| /// # Errors | ||
| /// Returns Internal Error if unsupported operator is provided. | ||
| fn compare_ranges( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some unit tests for this method specifically that ensure 100% coverage would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked, it has reached 100% test line coverage by
cargo +nightly llvm-cov \
--package datafusion-physical-expr \
--test pruning \
--all-features \
--html \
--open \
-- --nocapture
Though it's covered by higher-level pruning API tests, not the UT directly on it. The benefit is the test coverage won't be lost when we change the implementation to a vectorized compare_ranges_vectorized
|
Thank you for the review, those feedbacks make sense to me, I'll batch them later
Please let me know if anything is unclear. I’m trying to make both the implementation and the documentation clearer, but the logic and edge cases for this feature are admittedly quite tricky.
Yes — the initial milestone should be reaching coverage equivalent to the existing |
|
I plan to review this PR carefully tomorrow. I am sorry for the delay but I have been out for a few days and I am quite backed up |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @2010YOUY01 -- I think this API is very nice and quite clever and I think it is the right basic approach to generalized vectorized range based analysis.
I have a bunch of API suggestions I think we can iterate on and make the code better / cleaner.
The biggest challenge in this project, in my mind, is that it proposes to adds (yet) another API for some sort of range/statistics propagation.
Even before this PR, we already have 4 APIs on the PhysicalExpr
- https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds
- https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_statistics
- https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_constraints
- https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_statistics
Some of these methods seem to be unused in the core codebase (e.g. the "propagate" variants, and some of the new V2 statistics API added in #14699 by @Fly-Style, but I don't see any uses of it in the code (https://github.com/apache/datafusion/blob/998f534aafbf55c83daaa6fd4985ba143954b0e0/datafusion/physical-expr/src/statistics/stats_solver.rs#L39-L38). It also has provisions for various statistical distributions for which I still don't understand the usecase
If we are going to add a new API, I think we should deprecate some/all of the others.
| //! about one container and may track richer distribution details. | ||
| //! Pruning must reason about *all* containers (potentially thousands) to decide | ||
| //! which to skip, so it favors a vectorized, array-backed representation with | ||
| //! lighter-weight stats. These are intentionally separate interfaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any fundamental reason they need to be separate interfaces?
Like I am thinking, is there some potential future where we are able to rewrite [PhysicalExpr::evaluate_bounds] to use the new API in this PR?
That way having multiple APIs would be only a temporary, intermediate state as we worked to fill out the rest of the functionality 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading this code more, I don't think there is any fundamental difference (other than being vectorized) with evaluate_bounds
datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs
Show resolved
Hide resolved
datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs
Show resolved
Hide resolved
datafusion/physical-expr-common/src/physical_expr/statistics_vectorized.rs
Show resolved
Hide resolved
| false | ||
| } | ||
|
|
||
| /// Evaluates pruning statistics via propagation. See the pruning module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize the primary usecase for this evaluation is pruning, but I think it is a more general concept -- basically propagating statistical information through this expression
What would you think of calling this more like propagate_ranges? (I realize it is getting very similar to evalute_ranges and propagate_constraints...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about propagate_pruning_statistics()? Since its covering a rich set of statistics more than just range, and we can use _pruning_ to make it less ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only thing it will ever be used for is pruning then putting pruning in the name makes sense
I still have (not so) secret hopes, that we can somehow unify these range / expression analysis APIs as they all seem so similar in theory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to evaluate_statistics_vectorized in 9d256a5, it makes sense to make it more general, and potentially unify those APIs.
| /// implemented pruning; returning `None` signals that no pruning statistics | ||
| /// are available. | ||
| /// | ||
| /// In the future, propagation may expose dedicated APIs such as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than different APIs, I would recommend a single API propagate_ranges and add the different types of information in the object that is propagated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent idea, we should do this in the future.
| } | ||
| } | ||
|
|
||
| /// Pruning intermediate type propagated through `PhysicalExpr` nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we need to distinguish between intermediate statistics and intermediate results
Specifically, I think ColumnStats for boolean expressions will be trivially convertible to PruningResults (if the min/max are both true then we know the boolean value is always true. If the min/max are both false then we know the value is always false, etc)
This is similar to how Interval works: https://github.com/apache/datafusion/blob/81512da2b0aaa474f6c4ba205b05eea7b3095176/datafusion/expr-common/src/interval_arithmetic.rs#L182-L181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is doable. To encode PruningResult, we only need a single inner Option<BooleanArray>. The key consideration, I think, is which approach leads to a simpler implementation.
The answer is not clear to me yet; I will build a small prototype to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using BooleanArray I think will make handling all other expressions easier -- implementations of each expression will not have to pick between results or ranges, they will all use statistics
With some wrappers to interpret BooleanArray I think the APis could be quite nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it carefully again, and I don’t think it is actually simpler.
It does reduce the number of lines of code for most arithmetic expressions. However, for predicate expressions, we still have to inspect the semantic meaning inside column statistics. This is not really “eliminating a special case by definition”; rather, it is trying to encode multiple concepts into a single struct. I think the downside of this approach is that it weakens type safety and also makes predicate nodes harder to reason about.
To simplify implementation and avoid repetitive checking enum variant, probably we can add some utility functions.
#[derive(Debug, Clone)]
pub struct ColumnStats {
pub range_stats: Option<RangeStats>,
pub null_stats: Option<NullStats>,
pub evaluate_results: Option<BooleanArray>, // <--- Change here. We are adding an implicit constraint that
// if this is Some, then this node no longer represents
// statistics, and all other stat fields must be None.
/// Number of containers. Needed to infer result if all stats types are `None`.
pub num_containers: usize,
}Another consideration is that we might want a parent enum in the future for extensibility. It could be used to pass control information if we want to implement certain optimizations, or if we want to unify this PR’s stat propagation API with the existing ones, where additional intermediate stat variants may be needed.
| }; | ||
| match child { | ||
| PruningIntermediate::IntermediateStats(stats) => { | ||
| if let Some(null_stats) = stats.null_stats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swear that @pepijnve recently implemented very similar logic (for the bounds evaluation of IsNull, albiet one row at a time) but now I can't find it...
|
My suggested next step is to Decide if we have the effort / motivation to see it through (I am willing to help review, organized, and build consensus, as I think it is a foundational piece of DataFusion) If we do want to proceed, I think the first thing we should do is figure out how we will eventually unify the existing APIs that have overlap (specifically I think as long as there is a realistic way towards a unified framework we can then start pounding out the code |
|
FYI @ozankabak and @berkaysynnada as you have been instrumental in previous versions of statistics and range analysis |
|
Let's work through the API design first. After we agree to move forward, I'll apply all minor code review suggestions.
I do plan to spend significant effort on the future implementation, as I believe this is a very important optimization feature. Thank you for your help, @alamb!
I took a quick look at those APIs; their high-level ideas are:
The API proposed in this PR can replace
I did some archaeology, but I am not very sure. If |
Regarding |
They are not. The latter are the simplest building blocks that calculate ranges, the former are designed to use them to generate richer statistical information. Indeed I didn't have time yet to look at this big PR, but I looked at the issue and design. As a general thought I think replacing the So, we have two facts about our needs:
A good first step is to add However, my hunch is that this won't be possible for the time being, and we will need to a new API that does vectorized bounds, and make it clear in the docs that we need the two APIs to cater to 1 and 2. I hope this context helps @2010YOUY01 and @alamb |
This is a really good point. I didn't consider rounding safety so far, I'll make sure to include them in the vectorized version also. By the way, do you have any references on the high-level ideas behind “join pruning,” and why we need the inverse path ( Thanks for the context. For now, I think we shouldn’t touch the existing statistics propagation APIs and should introduce a vectorized one in this work. I’ll add more documentation to explain the rationale. |
|
As an example, when one implements a join on an ordered table with a "sliding window" condition, the IMO this blog post explains the ideas well. |
Yes this sounds like a great plan I still really feel that we can unify these APIs somehow. Starting with Also I shoudl be clear my concern isn't just the multiple implementations are harder to maintain, it is also that we already have significant code and test coverage for single row range analysis -- so replicating it again in vectorized fashion entirely separately won't leverage the past experience, and may result in different behaviors between the two paths |
|
All review feedback has been addressed (except for #19609 (comment), which might need further discussion). It’s ready for another look. |
Which issue does this PR close?
Rationale for this change
See the issue for the rationale, and design considerations.
For PR structure, start with datafusion/physical-expr-common/src/physical_expr/pruning.rs 's module-level comment, and follow along.
What changes are included in this PR?
The core change in this PR is around a small few hundreds LoC from estimation, the PR diff is mainly tests and docs.
And we now support pruning for expressions like
The issue also includes some thoughts on future implementation plans.
Are these changes tested?
UTs
Are there any user-facing changes?
No