8693: Add FilterPredicate::filter_record_batch#7
8693: Add FilterPredicate::filter_record_batch#7martin-augment wants to merge 3 commits intomainfrom
FilterPredicate::filter_record_batch#7Conversation
WalkthroughThe filter functionality in arrow-select is refactored to centralize batch-level filtering logic. A new public method on FilterPredicate applies column filtering and reconstructs RecordBatch objects. The optimizer trigger is broadened to activate when the first column contains multiple internal arrays, not just with multi-column batches. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
CodeRabbit configuration file (
|
Code Review for PR #7: Add
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
arrow-select/src/filter.rs (2)
176-183: Update the comment to reflect the broadened optimization condition.The optimization condition now triggers not only when there are multiple columns, but also when a single column contains multiple internal arrays (e.g., a Struct with multiple fields). The comment on lines 180-181 is outdated and should be revised to accurately describe the new logic.
Apply this diff to update the comment:
let mut filter_builder = FilterBuilder::new(predicate); let num_cols = record_batch.num_columns(); if num_cols > 1 || (num_cols > 0 && multiple_arrays(record_batch.schema_ref().field(0).data_type())) { - // Only optimize if filtering more than one column + // Only optimize if filtering more than one column or if the column contains multiple internal arrays // Otherwise, the overhead of optimization can be more than the benefit filter_builder = filter_builder.optimize(); }
300-323: LGTM! Consider enhancing the safety comment for clarity.The new public API provides a convenient way to filter entire RecordBatches. The use of
unsafewithRecordBatch::new_uncheckedis justified since:
- The filtered arrays maintain the same types as the original schema
- All filtered arrays have length
self.count(produced byfilter_array)- The number of columns matches the schema
Optionally, you could enhance the safety comment to be more explicit about all invariants:
- // SAFETY: we know that the set of filtered arrays will match the schema of the original - // record batch + // SAFETY: The filtered arrays satisfy all RecordBatch invariants: + // - Column count matches schema fields + // - Each array's data type matches the corresponding schema field + // - All arrays have length equal to self.count unsafe {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
arrow-select/src/filter.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
arrow-select/src/filter.rs (3)
arrow-flight/src/sql/metadata/sql_info.rs (1)
record_batch(412-429)arrow-flight/src/sql/metadata/xdbc_info.rs (1)
record_batch(98-106)arrow-array/src/record_batch.rs (1)
new_unchecked(261-271)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: Rustdocs are clean
- GitHub Check: Clippy
- GitHub Check: Build wasm32
- GitHub Check: Clippy
- GitHub Check: Clippy
- GitHub Check: Test on Windows
- GitHub Check: Lint (cargo fmt)
- GitHub Check: MIRI
- GitHub Check: Test
- GitHub Check: Test on Mac
- GitHub Check: Verify MSRV (Minimum Supported Rust Version)
- GitHub Check: Test
- GitHub Check: Build wasm32
- GitHub Check: Test
- GitHub Check: Check Compilation
- GitHub Check: PySpark Integration Test (stable)
- GitHub Check: claude-review
🔇 Additional comments (1)
arrow-select/src/filter.rs (1)
186-186: LGTM! Good refactor to centralize filtering logic.Delegating to the new
FilterPredicate::filter_record_batchmethod reduces code duplication and provides a cleaner API.
|
Findings
Follow-ups
|
|
augment review |
| if num_cols > 1 | ||
| || (num_cols > 0 && multiple_arrays(record_batch.schema_ref().field(0).data_type())) | ||
| { | ||
| // Only optimize if filtering more than one column |
There was a problem hiding this comment.
The comment says to optimize only when filtering >1 column, but the condition also triggers for a single column when the first column’s type spans multiple arrays; consider updating the comment to reflect this (also applies to the following line).
🤖 React with 👍 or 👎 to let us know if the comment was useful.
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback:The AI agent is correct that the old comment became incomplete with the new changes in this PR and it would be good to update it
8693: To review by AI
Summary by CodeRabbit
Release Notes
New Features
Improvements
Refactor