Skip to content

Conversation

@oarap
Copy link
Collaborator

@oarap oarap commented Jan 21, 2026

What problem does this PR solve?

Issue Number: close #164

Type of Change

  • πŸ› Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • πŸš€ Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • πŸ”¨ Refactoring (no logic changes)
  • πŸ”§ Build/CI or Infrastructure changes
  • πŸ“ Documentation only

Description

Refactor HiveDataSource to apply pushed down filters and map key pruning to the scan result using the Expr framework.

  • Replace ad-hoc evaluateRemainingFilter with exec::ExprSet to standardize post-scan evaluation.
  • Introduce postScanExprSet_ to handle both map key pruning (projections) and residual filtering in a single pass.
  • Add rowGroupFilter_ to ScanSpec to explicitly separate filters used for row group pruning from logical filters applied post-scan.
  • Update DwrfData and ParquetData to use rowGroupFilter for IO optimization.
  • Integrate dynamic filters to automatically rebuild the post-scan expression set.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@oarap oarap requested review from frankobe and markjin1990 January 21, 2026 17:17
@oarap oarap force-pushed the oarap_apply_all_filters_post_scan branch 2 times, most recently from 5d557c5 to 5823ac9 Compare January 23, 2026 00:57
…amework

Refactor HiveDataSource to apply pushed down filters and map key pruning to the scan result using the Expr framework.

- Replace ad-hoc `evaluateRemainingFilter` with `exec::ExprSet` to standardize post-scan evaluation.
- Introduce `postScanExprSet_` to handle both map key pruning (projections) and residual filtering in a single pass.
- Add `rowGroupFilter_` to `ScanSpec` to explicitly separate filters used for row group pruning from logical filters applied post-scan.
- Update `DwrfData` and `ParquetData` to use `rowGroupFilter` for IO optimization.
- Integrate dynamic filters to automatically rebuild the post-scan expression set.
@oarap oarap force-pushed the oarap_apply_all_filters_post_scan branch from 5823ac9 to a628218 Compare January 23, 2026 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Post-Scan Filter Application using Expr framework

1 participant