-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support reverse parquet scan and fast parquet order inversion at row group level #18817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3c23790 to
e8588b1
Compare
|
Also cc @suremarc, finally, we're contributing our reversed parquet optimization to upstream, I guess you may be interested in seeing it. |
|
Thank you @xudong963 @suremarc, i do a lot of changes comparing our internal implementation in this PR, but i think in general the major design is similar to our internal version, the row group level reverse. But need to add more follow-up PRs to improve it further, for example, we should support customer output order source, so that we can integrated it with ordered_partiton source, etc. |
|
Thanks -- I'll try and review this tomorrow |
Thank you @alamb ! |
datafusion/common/src/config.rs
Outdated
| /// are read in reverse order to eliminate sort operations. | ||
| /// Note: This buffers one row group at a time (typically ~128MB). | ||
| /// Default: true | ||
| pub enable_reverse_scan: bool, default = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, i default to true for reverse optimization, we can default to false if you think it's risky for some cases.
The key risk is the memory overhead, because it's row group level reverse, so we need to cache the row group level batches, if we setting the row group max size big, it will use high memory.
| } | ||
|
|
||
| /// Remove unnecessary sort based on the logic from EnforceSorting::analyze_immediate_sort_removal | ||
| fn remove_unnecessary_sort( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, i add this to reverse order because after reverse order, we can optimize more to remove the sort, so we don't need to execute the enforce sort optimization again after this optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about putting pushdown_sort before enforce_sorting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xudong963 , i tried this before, but it seems caused other problems for other optimizers issues, i can test again.
The best solution may be that we don't need a new optimizer pushdown_sort, so we can just enhance the existed optimizer to support it. I will try this later.
| ---- | ||
| physical_plan | ||
| 01)SortExec: TopK(fetch=3), expr=[number@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
| 02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/topk/partial_sorted/1.parquet]]}, projection=[number, letter, age], output_ordering=[number@0 DESC, letter@1 ASC NULLS LAST], file_type=parquet, predicate=DynamicFilter [ empty ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverse scan make it don't need to sort here, and it's very fast.
|
Supporting scanning Parquet files in reverse order is an absolutely great idea. I have a few questions. Let me first rephrase it to make sure I understand correctly, this PR does:
This implementation is quite aggressive, I think it can get a bit tricky to tune it right, to avoid excessive caching, or reversing rows batch by batch become too expensive. What if we limit the initial implementation only to reverse the row-group order, similar to what @adriangb is planning to do at file level in #17271
|
Thank you @2010YOUY01 for review and valid concern: And we've been running the full implementation (row group + row-level reversal) in production for very long time with excellent results: 10-100x speedups for time-series queries, well-controlled memory usage (~one row group cached at a time), but we need to note we should not make the row group size big if we enable this feature. And with very small limit, the high memory usage is very short time. Also the reverse time is very small compared to the benefit we remove all sort. And if we want to improve the original scan to support limit the initial implementation only to reverse the row-group order, i think we can add follow-up PRs because this is another optimization which can't remove the sort for optimization so we need to do this in another PR. And regarding native arrow-rs support for page-level reversal: As discussed in arrow-rs#3922, implementing true page-level reverse decoding is
While arrow-rs may eventually support this (as proposed in #17172), it requires Once arrow-rs implements native page-level reversal, we can easily migrate to it What's your opinion? |
|
I haven't looked into all of this discussion and code (I just got tagged). I've been looking into optimizing sorted scanning in DataFusion and IMO where we should land is:
I hope that is helpful. |
Thank you @adriangb , it's helpful for future optimization. I think these approaches are complementary - my PR handles the reverse scan optimization, while your vision provides a framework for broader sorted-scan optimizations using file-level statistics and metadata. Would be great to build toward that architecture incrementally. |
Did you mean 'cannot eliminate the SortExec(TopK)'? Just to confirm there is no global sort, but it is true that we have do a I have a intuition that for this kind of workload, the bottleneck is on the parquet decoding speed, and an extra It makes a lot of sense that it's very hard to implement page/row level reversal in Summary: Perhaps we can start by adding a few end-to-end benchmarks that reflect your typical production workload. If this PR’s approach shows a clear improvement over the naive approach in #18817 (comment) (I'm happy to do a quick prototype), we should definitely move forward. |
Nice point @2010YOUY01 , i agree most time will be decode page, i can change this PR to add the config to implement #18817 (comment) or create another PR for it, so we can have more options to compare, i agree the easier solution is better. And a benchmark is really helpful, thanks! |
|
FYI, I'll start reviewing the PR tomorrow. |
Thanks @xudong963 ! |
My point is that instead of I'm not opposed to this as a step towards that but I'm not sure how helpful it is. Seeing something more concrete w.r.t. how this interacts with the bigger picture would be helpful IMO. |
This is a great idea to have high level sort pushdown @adriangb , and reverse scan is one of the polices, i will refactor this PR to use this way, thanks! Updated, i already changed to high level sort pushdown in this commit: |
| // Successfully pushed down sort, now handle the limit | ||
| let total_fetch = limit_exec.skip() + limit_exec.fetch().unwrap_or(0); | ||
|
|
||
| // Try to push limit down as well if the source supports it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current limit_pushdown physical optimizer rule can do this. So do we still need to distinguish the sort and limit + sort pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add those logics here, because i found the optimizer order make we always need to run some of them more than one times if we remove this logic.
So i will try to add our optimizer to the existed optimizer.
| /// Try to create a new execution plan that satisfies the given sort ordering. | ||
| /// | ||
| /// Default implementation returns `Ok(None)`. | ||
| fn try_pushdown_sort( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add the API for ExecutionPlan? Is it possible to push down sort during the pushdown sort optimizer? Because we need to traverse the plan in the rule, so it looks possible to find the target node and directly give it the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
|
I plan to review this PR carefully tomorrow |
Which issue does this PR close?
Closes #17172
Overview
This PR implements reverse scanning for Parquet files to optimize
ORDER BY ... DESC LIMIT Nqueries on sorted data. When DataFusion detects that reversing the scan order would eliminate the need for a separate sort operation, it can now directly read Parquet files in reverse order.Implementation Note: This PR implements Part 1 of the vision outlined in #17172 (Order Inversion at the DataFusion level).
Current implementation:
Future improvements (requires arrow-rs changes):
takekernel overhead)These enhancements would further optimize memory usage and latency, but the current implementation already provides substantial benefits for most workloads.
Rationale for this change
Motivation
Currently, queries like
SELECT * FROM table ORDER BY sorted_column DESC LIMIT 100require DataFusion to:For files that are already sorted in ascending order, this is inefficient. With this optimization, DataFusion can:
Performance Benefits:
Scope and Limitations
This optimization applies to:
SortPreservingMergeis still requiredORDER BY ... DESCon pre-sorted columnsLIMITclauses (most beneficial for single-partition)This optimization does NOT apply to:
Single-partition vs Multi-partition:
SortPreservingMergeExecis needed to combine streams. Limit cannot be pushed to individual partitions.Performance comparison:
ORDER BY DESC LIMIT N→ Direct reverse scan with limit pushed down to DataSourceORDER BY DESC LIMIT N→ Reverse scan per partition +LocalLimitExec+SortPreservingMergeExecWhile multi-partition scans still require a merge operation, they benefit significantly from:
LocalLimitExecConfiguration
This optimization is enabled by default but can be controlled via:
SQL:
Rust API:
When to disable:
Default: Enabled (true)
Implementation Details
Architecture
The implementation consists of four main components:
1. ParquetSource API (
source.rs)reverse_scan: boolfield toParquetSourcewith_reverse_scan()andreverse_scan()methods2. ParquetOpener (
opener.rs)reverse_scan: boolfieldrow_group_indexes.reverse()reverse_scanflag, creates either:RecordBatchStreamAdapterReversedParquetStreamwith row-group-level buffering3. ReversedParquetStream (
opener.rs)A custom stream implementation that performs two-stage reversal with optional limit support:
Stage 1 - Row Reversal: Reverse rows within each batch using Arrow's
takekernelStage 2 - Batch Reversal: Reverse the order of batches within each row group
Key Properties:
row_groups_reversed,batches_reversed, andreverse_time4. Physical Optimizer (
reverse_order.rs)ReverseOrderoptimization ruleSortExecwith reversible input orderingGlobalLimitExec -> SortExecpatterns (most beneficial case)TreeNodeRewriterto push reverse flag down toParquetSourceDataSourceExecto avoid correctness issues with multi-partition scansWhy Row-Group-Level Buffering?
Row group buffering is necessary for correctness:
This is the minimal buffering granularity that ensures correct results while still being compatible with arrow-rs's existing parquet reader architecture.
Memory Characteristics:
Why this is necessary:
Future Optimization: Page-level reverse scanning in arrow-rs could further reduce memory usage and improve latency by eliminating row-group buffering entirely.
What changes are included in this PR?
Core Implementation:
ParquetSource: Added reverse scan flag and methodsParquetOpener: Row group reversal and stream creation logicReversedParquetStream: Unified stream implementation with optional limit supportPhysical Optimization:
ReverseOrder: New optimizer rule for detecting and applying reverse scan optimizationSortExecandGlobalLimitExec -> SortExecConfiguration:
enable_reverse_scanconfig option (default: true)Metrics:
row_groups_reversed: Count of reversed row groupsbatches_reversed: Count of reversed batchesreverse_time: Time spent reversing dataAre these changes tested?
Yes, comprehensive tests added:
Unit Tests (
opener.rs):Integration Tests (
reverse_order.rs):SQL Logic Tests (
.sltfiles):Are there any user-facing changes?
New Configuration Option:
datafusion.execution.parquet.enable_reverse_scan(default: true)Behavioral Changes:
ORDER BY ... DESC LIMIT Non sorted single-partition Parquet files will automatically use reverse scanning when beneficialBreaking Changes: