Skip to content

[FEA] Use bloom filters in Parquet reader to filter row groups with equality predicates #17164

@mhaseeb123

Description

@mhaseeb123

Is your feature request related to a problem? Please describe.
In Parquet reader, we can use the cuco::bloom_filter_ref with a custom cuco::bloom_filter_policy to filter row groups when we have an equality predicate. This would allow us to potentially reduce I/O.

The custom cuco::bloom_filter_policy would need to implement Arrow's logic for generating the bit pattern, selecting bloom filter blocks and selecting a filter block for a given key and would also be used to write our own bloom filters to Parquet (in the writer's side) in the future.

Describe the solution you'd like
Use cuco::bloom_filter with a custom cuco::bloom_filter_policy to implement Arrow's BF logic in Parquet reader to filter row gorups.

Additional context
The 1:1 Arrow BF policy may be implemented directly in cuco or upstreamed later on from cudf for exposure to broader RAPIDS.

Associated Subtasks

Task PRs Notes
Implement a cuco::bloom_filter_policy to mimic Arrow BF policy NVIDIA/cuCollections#625 NVIDIA/cuCollections#633 adds bitset validation against Arrow impl
Add support to read and deserialize BF bitset from Parquet files #17289 NVIDIA/cuCollections#642 and ✅ #17393 to support cudf types in Bloom Filter
Use cuco::bloom_filter with the read BF bitset and policy in Parquet reader
* check min/max stats and bloom filter simultaneously to prune column chunks
* identify which columns have equality conditions
* read the bloom filters only for the relevant column chunks
#17289 #17587 simplifies Stats and Bloomfilter AST expression transformers using ast::tree
NVIDIA/cuCollections#654 updates arrow_filter_policy to not rely on xxhash64's member types to be consistent with STL
Measure number of filtered row groups and return as a part of table_with_metadata #17594 #17587 simplifies AST expression converter using ast::tree
rapidsai/rapids-cmake#735 bumps cuco to include changes from NVIDIA/cuCollections#654
Minor improvements #17753 Only instantiate bloom_filter_query functor for required types
Fix for compiler seg fault due to bloom filter alignment issues #17758 #17758 Use aligned_resource_adaptor to allocate bloom filter buffers and use new bloom_filter_ref ctors
NVIDIA/cuCollections#660 New cuco constructors that avoid __trap leading to seg fault (thanks @sleeepyjack and @PointKernel)

Metadata

Metadata

Assignees

Labels

cuIOcuIO issuecucocuCollections related issuefeature requestNew feature or requestimprovementImprovement / enhancement to an existing functionlibcudfAffects libcudf (C++/CUDA) code.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions