-
Notifications
You must be signed in to change notification settings - Fork 1.8k
add specialized InList implementations for common scalar types #18832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds specialized StaticFilter implementations for common scalar types to optimize IN LIST operations in DataFusion. Previously, only Int32 had a specialized filter; now Int8, Int16, Int64, UInt8, UInt16, UInt32, UInt64, Boolean, Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, and BinaryView all have optimized implementations.
Key changes:
- Introduced two macros (
primitive_static_filter!anddefine_static_filter!) to generate specialized filter implementations, eliminating code duplication - Extended
instantiate_static_filterto route 15 additional data types to their specialized implementations - Refactored the
in_listfunction to useinstantiate_static_filterinstead of defaulting to the genericArrayStaticFilter
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@alamb maybe let's run benchmarks? |
|
Here's what I'm seeing so far:
I think we'd need to add benchmarks for other primitive types. |
|
Thinking about it the trick is probably to avoid the extra |
martin-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just some nits.
| } | ||
| (false, false, false) => { | ||
| // no nulls anywhere, not negated | ||
| BooleanArray::from_iter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BooleanBuffer::collect_bool is faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you I know you or some other reviewer had pointed this out to me before. I am making a mental note to try to not forget again and keep an eye out for it. Thanks for you patience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder why we don't have faster high-level APIs if this is really important. E.g. BooleanArray::new_false, BooleanArray::new_nulls, BooleanArray::new_true and BooleanArray::collect_bool(size, iterator) or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been discussing various improvements:
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
🤖 |
|
🤖: Benchmark completed Details
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb -- this seems like an improvement to me
It would be nice if we could reduce some of the duplication in the tests, but I don't think that is a deal breaker 👍
I do think we should cover the no null cases with tests
Do you also plan to make special InList implementation for Utf8/Utf8View/LargeUtf8?
| } | ||
|
|
||
| #[test] | ||
| fn in_list_int8() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please reduce the duplication in tests here? It seems like we there are like 16 copies of the same test
Reducing the duplication will make it easier to understand what is being covered
| BooleanArray::new(builder.finish(), None) | ||
| } | ||
| (false, false, true) => { | ||
| let values = v.values(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code appears to be uncovered by tests. I tested using
cargo llvm-cov test --html -p datafusion-physical-expr --lib -- in_lis
Here is the whole report in case that is useful llvm-cov.zip
| } | ||
| fn contains(&self, v: &dyn Array, negated: bool) -> Result<BooleanArray> { | ||
| // Handle dictionary arrays by recursing on the values | ||
| downcast_dictionary_array! { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any tests for dictionaries 🤔
| } | ||
| (false, false, false) => { | ||
| // no nulls anywhere, not negated | ||
| BooleanArray::from_iter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been discussing various improvements:
| } | ||
|
|
||
| #[test] | ||
| fn in_list_utf8_view() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR has tests for utf8 but no changes for those types. Is that your intention?
| } | ||
| (true, _, true) | (false, true, true) => { | ||
| // Either needle or haystack has nulls, negated | ||
| BooleanArray::from_iter(v.iter().map(|value| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably would be faster to handle the nulls separately or using set_indices rather than using BooleanArray::from_iter and v.iter etc.
| let values = v.values(); | ||
| let mut builder = BooleanBufferBuilder::new(values.len()); | ||
| for value in values.iter() { | ||
| builder.append(self.values.contains(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unfortunately is slower than collect_bool. I see there is some good discussion on better APIs on apache/arrow-rs#8561
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to get some performance back.
Results seem a bit mixed? |
Closes #18824