-
Notifications
You must be signed in to change notification settings - Fork 962
Description
Is your feature request related to a problem? Please describe.
In Parquet reader, we can use the cuco::bloom_filter_ref
with a custom cuco::bloom_filter_policy
to filter row groups when we have an equality predicate. This would allow us to potentially reduce I/O.
The custom cuco::bloom_filter_policy
would need to implement Arrow's logic for generating the bit pattern, selecting bloom filter blocks and selecting a filter block for a given key and would also be used to write our own bloom filters to Parquet (in the writer's side) in the future.
Describe the solution you'd like
Use cuco::bloom_filter
with a custom cuco::bloom_filter_policy
to implement Arrow's BF logic in Parquet reader to filter row gorups.
Additional context
The 1:1 Arrow BF policy may be implemented directly in cuco or upstreamed later on from cudf for exposure to broader RAPIDS.
Associated Subtasks
Task | PRs | Notes |
---|---|---|
Implement a cuco::bloom_filter_policy to mimic Arrow BF policy |
✅ NVIDIA/cuCollections#625 | ✅ NVIDIA/cuCollections#633 adds bitset validation against Arrow impl |
Add support to read and deserialize BF bitset from Parquet files | ✅ #17289 | ✅ NVIDIA/cuCollections#642 and ✅ #17393 to support cudf types in Bloom Filter |
Use cuco::bloom_filter with the read BF bitset and policy in Parquet reader * check min/max stats and bloom filter simultaneously to prune column chunks * identify which columns have equality conditions * read the bloom filters only for the relevant column chunks |
✅ #17289 | ✅ #17587 simplifies Stats and Bloomfilter AST expression transformers using ast::tree ✅ NVIDIA/cuCollections#654 updates arrow_filter_policy to not rely on xxhash64's member types to be consistent with STL |
Measure number of filtered row groups and return as a part of table_with_metadata |
✅ #17594 | ✅ #17587 simplifies AST expression converter using ast::tree ✅ rapidsai/rapids-cmake#735 bumps cuco to include changes from NVIDIA/cuCollections#654 |
Minor improvements | ✅ #17753 | Only instantiate bloom_filter_query functor for required types |
Fix for compiler seg fault due to bloom filter alignment issues | ✅ #17758 | ✅ #17758 Use aligned_resource_adaptor to allocate bloom filter buffers and use new bloom_filter_ref ctors ✅ NVIDIA/cuCollections#660 New cuco constructors that avoid __trap leading to seg fault (thanks @sleeepyjack and @PointKernel) |