-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add short circuit evaluation for AND
and OR
#15462
base: main
Are you sure you want to change the base?
Conversation
Some tests failed, so let me take a look at what exactly is going on. |
I think one issue is that the short-circuit logic is not handling cases where the the |
Thank you very much for your hint, it will be very helpful for me to fix these tests! |
After taking a closer look, in fact, the situation you mentioned does not actually lead to the short-circuit optimization logic. |
You're absolutely right, I got my logic wrong there. Embarrasing! |
It's okay. You've also taught me a lot. When I first started writing this, I really didn't consider the case of null |
Hello @alamb, the optimization SQL and documentation related to this PR have been completed, and all tests have passed. We may need to formally verify the performance, but I'm not quite sure how to do that (I can only run it locally). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @acking-you -- this looks really nice.
I think this needs some tests but otherwise it is looking quite nice.
Also, could you please add the new Q6 benchmark in a separate PR so I can more easily run my benchmark scripts before/after your code change?
|
||
### Q6: How many social shares meet complex multi-stage filtering criteria? | ||
**Question**: What is the count of sharing actions from iPhone mobile users on specific social networks, within common timezones, participating in seasonal campaigns, with high screen resolutions and closely matched UTM parameters? | ||
**Important Query Properties**: Simple filter with high-selectivity, Costly string matching, A large number of filters with high overhead are positioned relatively later in the process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { | |||
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { | |||
use arrow::compute::kernels::numeric::*; | |||
|
|||
fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @acking-you -- this looks great
Is there any reason to have this function defined in the evaluate
method? I think we could just make it a normal function and reduce the nesting level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count
/ true_count
-- like for example if the rhs arg is "complex" (not a Column for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to have this function defined in the evaluate method?
There was no particular reason. Maybe I couldn't find a suitable place to write it at the time, haha. Where do you think this function should be placed?
If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example)
I also agree that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the function outside and added some comments.
Okey,I got it.Do you mean that Q6 and its related description in the current branch need to be completed with a separate PR? But this way, it seems that you would still need to cherry-pick Q6 to the corresponding branch when testing. |
027a772
to
5de0f36
Compare
…ultiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]>
…ultiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]>
5de0f36
to
575c3f3
Compare
match arg { | ||
ColumnarValue::Array(array) => { | ||
if let Ok(array) = as_boolean_array(&array) { | ||
return array.false_count() == array.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember this had some overhead (for calculating the counts) from a previous try.
I wonder if it helps to short optimize this expression (e.g. match until we get a chunk of the bitmap != 0
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it helps to short optimize this expression (e.g. match until we get a chunk of the bitmap != 0)
I think the overhead added here should be very small (the compiler optimization should work well), and the test results we discussed before were sometimes fast and sometimes slow (maybe noise).
Your suggestion of making an early judgment and returning false seems like a good idea, but I'm not sure if it will be effective.
The concern I have with this approach is that it requires adding an if
condition inside the for
loop, which will most likely disable the compiler's SIMD instruction optimization (I've encountered a similar situation before, and I had to manually unroll the SIMD...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, we can use bool_and
(https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) and bool_or
which operates on u64
values to test performance changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, we can use
bool_and
(https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) andbool_or
which operates onu64
values to test performance changes.
Thank you for your suggestion. I will try it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be overkill, but one could try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop.
Almost seems like something the compiler could automagically do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be overkill, but one could try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop.
Thank you for your suggestion, but if we're only applying conditional checks to the first few blocks, then I feel this optimization might not be meaningful. If nearly all blocks can be filtered out by the preceding filter, the optimization will no longer be effective.
If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example)
I tend to agree with @alamb's point that if the overhead of verification is somewhat unacceptable, adopting some heuristic approaches would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more carefully at bool_or and I do think it would be faster than this implementation on the case where there are some true values (as it stops as soon as it finds a single non zero): https://docs.rs/arrow/latest/arrow/compute/fn.bool_or.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @acking-you -- I am running the benchmarks now and assuming this shows now performance regressions I think we should merge it in
We can perhaps try @Dandandan 's suggestion for bool_and
etc as a follow on PR / issue (I can file a follow on issue).
I looked for a binary expression and/or micro benchmark and could not find one: https://github.com/apache/datafusion/tree/c929a1cd133590e4944bc2c7900611220450335a/datafusion/physical-expr/benches
Thank you for your suggestions and help. I also noticed this part of the code yesterday. I've had a cold and fever in the past few days, so I haven't tried the method mentioned by @Dandandan for benchmarking yet. I should be able to work on it today. |
I hope you feel better soon I have been running some benchmark numbers (below) and I have gotten some strange results. I definitely see this PR improves performance on the newly added clickbench extended benchmark (almost 3x faster!) However Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is good to go - thank you for the nice improvement @acking-you
There is a very nice 3x improvement in the extended clickbench benchmarks
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ add_short_circuit ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 1914.62ms │ 1954.11ms │ no change │
│ QQuery 1 │ 757.11ms │ 760.32ms │ no change │
│ QQuery 2 │ 1478.24ms │ 1495.54ms │ no change │
│ QQuery 3 │ 700.15ms │ 705.43ms │ no change │
│ QQuery 4 │ 1489.20ms │ 1457.77ms │ no change │
│ QQuery 5 │ 17258.06ms │ 17048.63ms │ no change │
│ QQuery 6 │ 6872.16ms │ 2318.32ms │ +2.96x faster │
└──────────────┴────────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base) │ 30469.53ms │
│ Total Time (add_short_circuit) │ 25740.13ms │
│ Average Time (main_base) │ 4352.79ms │
│ Average Time (add_short_circuit) │ 3677.16ms │
│ Queries Faster │ 1 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 6 │
└──────────────────────────────────┴────────────┘
And I didn't find any reproduceable slowdown for other queries.
I do think it is worth switching this to use bool_or
/ bool_and
to avoid having to count bits (only detect if there is one) but we could do it as a follow on PR as well
match arg { | ||
ColumnarValue::Array(array) => { | ||
if let Ok(array) = as_boolean_array(&array) { | ||
return array.false_count() == array.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more carefully at bool_or and I do think it would be faster than this implementation on the case where there are some true values (as it stops as soon as it finds a single non zero): https://docs.rs/arrow/latest/arrow/compute/fn.bool_or.html
Which issue does this PR close?
BinaryExpr
evaluate lacks optimization forOr
andAnd
scenarios #11212.Rationale for this change
What changes are included in this PR?
BinaryExpr
evaluate lacks optimization forOr
andAnd
scenarios #11212 (comment) .)Below is the performance comparison of running the extended SQL locally. It seems there is also some improvement in Q4 (maybe noise).
At the same time, while creating this SQL, I also discovered a bug — one of the filter caused a panic: #15461
Are these changes tested?
Are there any user-facing changes?