Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Nov 19, 2025

Which issue does this PR close?

Closes #17172

Overview

This PR implements reverse scanning for Parquet files to optimize ORDER BY ... DESC LIMIT N queries on sorted data. When DataFusion detects that reversing the scan order would eliminate the need for a separate sort operation, it can now directly read Parquet files in reverse order.

Implementation Note: This PR implements Part 1 of the vision outlined in #17172 (Order Inversion at the DataFusion level).

Current implementation:

  • ✅ Row-group-level reversal (this PR)
  • ✅ Memory bounded by row group size (typically ~128MB)
  • ✅ Significant performance improvements for common use cases

Future improvements (requires arrow-rs changes):

  • Page-level reverse decoding (would reduce memory to ~1MB per page)
  • In-place array flipping (would eliminate take kernel overhead)
  • See issue Fast parquet order inversion #17172 for full details

These enhancements would further optimize memory usage and latency, but the current implementation already provides substantial benefits for most workloads.

Rationale for this change

Motivation

Currently, queries like SELECT * FROM table ORDER BY sorted_column DESC LIMIT 100 require DataFusion to:

  1. Read the entire file in forward order
  2. Sort/reverse all results in memory
  3. Apply the limit

For files that are already sorted in ascending order, this is inefficient. With this optimization, DataFusion can:

  1. Read row groups in reverse order
  2. Reverse individual batches progressively
  3. Stream results directly without a separate sort operation
  4. Stop reading early when the limit is reached (single-partition case)

Performance Benefits:

  • Eliminates memory-intensive sort operations for large files
  • Enables streaming for reverse-order queries with limits
  • Reduces query latency significantly for sorted data
  • Reduces I/O by stopping early when limit is satisfied (single-partition)

Scope and Limitations

This optimization applies to:

  • ✅ Single-partition scans (most common case for sorted Parquet files) - Full optimization: sort completely eliminated
  • ✅ Multi-partition scans - Partial optimization: each partition scans in reverse, but SortPreservingMerge is still required
  • ✅ Queries with ORDER BY ... DESC on pre-sorted columns
  • ✅ Queries with LIMIT clauses (most beneficial for single-partition)

This optimization does NOT apply to:

  • ❌ Unsorted files - No benefit from reverse scanning
  • ❌ Complex sort expressions that don't match file ordering

Single-partition vs Multi-partition:

Scenario Optimization Effect Why
Single-partition Sort operation completely eliminated No merge needed, direct reverse streaming with limit pushdown
Multi-partition Per-partition sorts eliminated, but merge still required Each partition reads in reverse (eliminating per-partition sorts), but SortPreservingMergeExec is needed to combine streams. Limit cannot be pushed to individual partitions.

Performance comparison:

  • Single-partition: ORDER BY DESC LIMIT N → Direct reverse scan with limit pushed down to DataSource
  • Multi-partition: ORDER BY DESC LIMIT N → Reverse scan per partition + LocalLimitExec + SortPreservingMergeExec

While multi-partition scans still require a merge operation, they benefit significantly from:

  • Elimination of per-partition sort operations
  • Parallel reverse scanning across partitions
  • Reduced data volume entering the merge stage via LocalLimitExec

Configuration

This optimization is enabled by default but can be controlled via:

SQL:

SET datafusion.execution.parquet.enable_reverse_scan = true/false;

Rust API:

let ctx = SessionContext::new()
    .with_config(
        SessionConfig::new()
            .with_parquet_reverse_scan(false)  // Disable optimization
    );

When to disable:

  • If your Parquet files have very large row groups (> 256MB) and memory is constrained (row group buffering required for correctness)
  • For debugging or performance comparison purposes
  • If you observe unexpected behavior (please report as a bug!)

Default: Enabled (true)

Implementation Details

Architecture

The implementation consists of four main components:

1. ParquetSource API (source.rs)

  • Added reverse_scan: bool field to ParquetSource
  • Added with_reverse_scan() and reverse_scan() methods
  • The flag is propagated through the file scan configuration

2. ParquetOpener (opener.rs)

  • Added reverse_scan: bool field
  • Row Group Reversal: Before building the stream, row group indices are reversed: row_group_indexes.reverse()
  • Stream Selection: Based on reverse_scan flag, creates either:
    • Normal stream: RecordBatchStreamAdapter
    • Reverse stream: ReversedParquetStream with row-group-level buffering

3. ReversedParquetStream (opener.rs)

A custom stream implementation that performs two-stage reversal with optional limit support:

Stage 1 - Row Reversal: Reverse rows within each batch using Arrow's take kernel

let indices = UInt32Array::from_iter_values((0..num_rows as u32).rev());
take(column, &indices, None)

Stage 2 - Batch Reversal: Reverse the order of batches within each row group

reversed_batches.into_iter().rev()

Key Properties:

  • Bounded Memory: Buffers at most one row group at a time (typically ~128MB with default Parquet writer settings)
  • Progressive Output: Outputs reversed batches immediately after each row group completes
  • Limit Support: Unified implementation handles both limited and unlimited scans
    • With limit: Stops processing when limit is reached, avoiding unnecessary I/O
    • Without limit: Processes entire file in reverse order
  • Metrics: Tracks row_groups_reversed, batches_reversed, and reverse_time

4. Physical Optimizer (reverse_order.rs)

  • New ReverseOrder optimization rule
  • Detects patterns where reversing the input satisfies sort requirements:
    • SortExec with reversible input ordering
    • GlobalLimitExec -> SortExec patterns (most beneficial case)
  • Uses TreeNodeRewriter to push reverse flag down to ParquetSource
  • Single-partition check: Only pushes limit to single-partition DataSourceExec to avoid correctness issues with multi-partition scans
  • Preserves correctness by checking:
    • Input ordering compatibility
    • Required input ordering constraints
    • Ordering preservation properties

Why Row-Group-Level Buffering?

Row group buffering is necessary for correctness:

  1. Parquet Structure: Files are organized into independent row groups (typically ~128MB with default settings)
  2. Batch Boundaries: The parquet reader's batches may not align with row group boundaries
  3. Correct Ordering: We must ensure complete row groups are reversed to maintain semantic correctness

This is the minimal buffering granularity that ensures correct results while still being compatible with arrow-rs's existing parquet reader architecture.

Memory Characteristics:

  • Maximum memory: Size of largest row group (typically ~128MB with default Parquet writer settings)
  • Not O(file_size), but O(row_group_size)
  • Acceptable trade-off for elimination of full-file sort operation

Why this is necessary:

  • Parquet batches don't align with row group boundaries
  • Must buffer complete row groups to ensure correct ordering
  • This is the minimal buffering granularity for correctness

Future Optimization: Page-level reverse scanning in arrow-rs could further reduce memory usage and improve latency by eliminating row-group buffering entirely.

What changes are included in this PR?

  1. Core Implementation:

    • ParquetSource: Added reverse scan flag and methods
    • ParquetOpener: Row group reversal and stream creation logic
    • ReversedParquetStream: Unified stream implementation with optional limit support
  2. Physical Optimization:

    • ReverseOrder: New optimizer rule for detecting and applying reverse scan optimization
    • Pattern matching for SortExec and GlobalLimitExec -> SortExec
    • Single-partition validation to ensure optimization is beneficial
  3. Configuration:

    • Added enable_reverse_scan config option (default: true)
    • SQL and Rust API support
  4. Metrics:

    • row_groups_reversed: Count of reversed row groups
    • batches_reversed: Count of reversed batches
    • reverse_time: Time spent reversing data

Are these changes tested?

Yes, comprehensive tests added:

Unit Tests (opener.rs):

  • Single batch reversal
  • Multiple batch reversal
  • Multiple row group handling
  • Limit enforcement
  • Null value handling
  • ParquetSource flag propagation

Integration Tests (reverse_order.rs):

  • Sort removal optimization
  • Limit + Sort pattern optimization
  • Multi-partition handling (partial optimization with merge)
  • Nested sort patterns
  • Edge cases (empty exec, multiple columns, etc.)

SQL Logic Tests (.slt files):

  • End-to-end query validation
  • Single-partition reverse scan with multiple row groups
  • Multi-partition reverse scan with file reversal
  • Uneven partition handling
  • Performance comparisons
  • Correctness verification across various scenarios

Are there any user-facing changes?

New Configuration Option:

  • datafusion.execution.parquet.enable_reverse_scan (default: true)

Behavioral Changes:

  • Queries with ORDER BY ... DESC LIMIT N on sorted single-partition Parquet files will automatically use reverse scanning when beneficial
  • Multi-partition queries benefit from per-partition reverse scanning, though merge is still required
  • No changes to query results - only performance improvements
  • New metrics available in query execution metrics

Breaking Changes:

  • None. This is a purely additive optimization that maintains backward compatibility.

@github-actions github-actions bot added optimizer Optimizer rules datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt) physical-expr Changes to the physical-expr crates labels Nov 19, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Nov 21, 2025
@zhuqi-lucas zhuqi-lucas changed the title Draft Reverse parquet Support reverse Parquet Scan and fast parquet order inversion at row group level Nov 21, 2025
@zhuqi-lucas zhuqi-lucas changed the title Support reverse Parquet Scan and fast parquet order inversion at row group level Support reverse parquet scan and fast parquet order inversion at row group level Nov 21, 2025
@github-actions github-actions bot added common Related to common crate execution Related to the execution crate proto Related to proto crate labels Nov 22, 2025
@xudong963
Copy link
Member

Also cc @suremarc, finally, we're contributing our reversed parquet optimization to upstream, I guess you may be interested in seeing it.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Nov 22, 2025

Thank you @xudong963 @suremarc, i do a lot of changes comparing our internal implementation in this PR, but i think in general the major design is similar to our internal version, the row group level reverse. But need to add more follow-up PRs to improve it further, for example, we should support customer output order source, so that we can integrated it with ordered_partiton source, etc.

@alamb
Copy link
Contributor

alamb commented Nov 23, 2025

Thanks -- I'll try and review this tomorrow

@zhuqi-lucas
Copy link
Contributor Author

Thanks -- I'll try and review this tomorrow

Thank you @alamb !

/// are read in reverse order to eliminate sort operations.
/// Note: This buffers one row group at a time (typically ~128MB).
/// Default: true
pub enable_reverse_scan: bool, default = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, i default to true for reverse optimization, we can default to false if you think it's risky for some cases.

The key risk is the memory overhead, because it's row group level reverse, so we need to cache the row group level batches, if we setting the row group max size big, it will use high memory.

}

/// Remove unnecessary sort based on the logic from EnforceSorting::analyze_immediate_sort_removal
fn remove_unnecessary_sort(
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, i add this to reverse order because after reverse order, we can optimize more to remove the sort, so we don't need to execute the enforce sort optimization again after this optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting pushdown_sort before enforce_sorting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xudong963 , i tried this before, but it seems caused other problems for other optimizers issues, i can test again.

The best solution may be that we don't need a new optimizer pushdown_sort, so we can just enhance the existed optimizer to support it. I will try this later.

----
physical_plan
01)SortExec: TopK(fetch=3), expr=[number@0 ASC NULLS LAST], preserve_partitioning=[false]
02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/topk/partial_sorted/1.parquet]]}, projection=[number, letter, age], output_ordering=[number@0 DESC, letter@1 ASC NULLS LAST], file_type=parquet, predicate=DynamicFilter [ empty ]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverse scan make it don't need to sort here, and it's very fast.

@2010YOUY01
Copy link
Contributor

Supporting scanning Parquet files in reverse order is an absolutely great idea. I have a few questions.

Let me first rephrase it to make sure I understand correctly, this PR does:

  1. For applicable query patterns (topK that has reverse order to the parquet existing order), reverse the row-group scanning order
  2. For each row group, first cache all the result, then reverse the row-level order batch by batch.

This implementation is quite aggressive, I think it can get a bit tricky to tune it right, to avoid excessive caching, or reversing rows batch by batch become too expensive.

What if we limit the initial implementation only to reverse the row-group order, similar to what @adriangb is planning to do at file level in #17271
After scanning the last row-group, the topk dynamic filter will automatically get updated and skip the preceding row groups.

  • The benefits are simplicity and lower risk of regressions
  • The downside is it's too conservative and can't get the optimal performance. But once we have native reverse parquet decoding support in arrow-rs (that is described in the original issue Fast parquet order inversion #17172), we can implement the reverse scan at the row level as follow-ups.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Nov 24, 2025

Supporting scanning Parquet files in reverse order is an absolutely great idea. I have a few questions.

Let me first rephrase it to make sure I understand correctly, this PR does:

  1. For applicable query patterns (topK that has reverse order to the parquet existing order), reverse the row-group scanning order
  2. For each row group, first cache all the result, then reverse the row-level order batch by batch.

This implementation is quite aggressive, I think it can get a bit tricky to tune it right, to avoid excessive caching, or reversing rows batch by batch become too expensive.

What if we limit the initial implementation only to reverse the row-group order, similar to what @adriangb is planning to do at file level in #17271 After scanning the last row-group, the topk dynamic filter will automatically get updated and skip the preceding row groups.

  • The benefits are simplicity and lower risk of regressions
  • The downside is it's too conservative and can't get the optimal performance. But once we have native reverse parquet decoding support in arrow-rs (that is described in the original issue Fast parquet order inversion #17172), we can implement the reverse scan at the row level as follow-ups.

Thank you @2010YOUY01 for review and valid concern:
You raise valid concerns about memory overhead is what i mentioned the key risk for this approach.
However, I want to clarify that row group reversal alone cannot eliminate the SortExec - it only provides TopK filtering benefits. Without reversing rows within each row group, the data remains in the original order (e.g., ASC when we need DESC), so the sort must stay. I propose we keep the complete optimization but default enable_reverse_scan to false. Once we implement page-level caching in arrow-rs (which will reduce memory overhead significantly), we can consider enabling it by default.

And we've been running the full implementation (row group + row-level reversal) in production for very long time with excellent results: 10-100x speedups for time-series queries, well-controlled memory usage (~one row group cached at a time), but we need to note we should not make the row group size big if we enable this feature. And with very small limit, the high memory usage is very short time. Also the reverse time is very small compared to the benefit we remove all sort.

And if we want to improve the original scan to support limit the initial implementation only to reverse the row-group order, i think we can add follow-up PRs because this is another optimization which can't remove the sort for optimization so we need to do this in another PR.

And regarding native arrow-rs support for page-level reversal:

As discussed in arrow-rs#3922, implementing true page-level reverse decoding is
technically challenging due to:

  1. Most encodings (except PLAIN) use length-prefixed blocks that can't be decoded backwards
  2. Dremel record shredding for nested types is order-sensitive
  3. Requires Offset Index (Parquet 2.0+) to locate pages

While arrow-rs may eventually support this (as proposed in #17172), it requires
significant work. Our current implementation (row group-level caching) is the most
practical solution available today and has been proven in production for very long time.

Once arrow-rs implements native page-level reversal, we can easily migrate to it
without changing the DataFusion API. But it should be a long way to go.

What's your opinion?

@adriangb
Copy link
Contributor

I haven't looked into all of this discussion and code (I just got tagged). I've been looking into optimizing sorted scanning in DataFusion and IMO where we should land is:

  1. Via metadata (FileScanConfig / ORDERED BY ... in SQL) users declare a known sort order of their files.
  2. The planner uses statistics from the files + any ORDER BY clauses in the query to arrange file ordering to best match the query. The FileSource implementation can also receive the ORDER BY information and optimize scan order within a file (e.g. reversing the order of reads which is what I think this PR is doing).
  3. If the planner is able to deduce from file level stats that the files can be ordered and the FileSource reports that it is able to produce batches in sorted order then the optimizer can optimize away the sort completely.

I hope that is helpful.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Nov 24, 2025

I haven't looked into all of this discussion and code (I just got tagged). I've been looking into optimizing sorted scanning in DataFusion and IMO where we should land is:

  1. Via metadata (FileScanConfig / ORDERED BY ... in SQL) users declare a known sort order of their files.
  2. The planner uses statistics from the files + any ORDER BY clauses in the query to arrange file ordering to best match the query. The FileSource implementation can also receive the ORDER BY information and optimize scan order within a file (e.g. reversing the order of reads which is what I think this PR is doing).
  3. If the planner is able to deduce from file level stats that the files can be ordered and the FileSource reports that it is able to produce batches in sorted order then the optimizer can optimize away the sort completely.

I hope that is helpful.

Thank you @adriangb , it's helpful for future optimization.
My current implementation focuses on a specific optimization case: when data is already sorted and we need the reverse order, we can flip the scan direction instead of reading everything and sorting. The reverse_scan in FileSource handles the files/ and within-file ordering reversal.

I think these approaches are complementary - my PR handles the reverse scan optimization, while your vision provides a framework for broader sorted-scan optimizations using file-level statistics and metadata. Would be great to build toward that architecture incrementally.

@2010YOUY01
Copy link
Contributor

Thank you @2010YOUY01 for review and valid concern:
You raise valid concerns about memory overhead is what i mentioned the key risk for this approach.
However, I want to clarify that row group reversal alone cannot eliminate the SortExec - it only provides TopK filtering benefits. Without reversing rows within each row group, the data remains in the original order (e.g., ASC when we need DESC), so the sort must stay. I propose we keep the complete optimization but default enable_reverse_scan to false. Once we implement page-level caching in arrow-rs (which will reduce memory overhead significantly), we can consider enabling it by default.

Did you mean 'cannot eliminate the SortExec(TopK)'? Just to confirm there is no global sort, but it is true that we have do a topK on a whole row group for this naive approach.

I have a intuition that for this kind of workload, the bottleneck is on the parquet decoding speed, and an extra TopK won't introduce much additional overhead, so this naive approach can also get pretty fast.

It makes a lot of sense that it's very hard to implement page/row level reversal in arrow-rs side, so we have to figure out how to do this at row-group level.

Summary: Perhaps we can start by adding a few end-to-end benchmarks that reflect your typical production workload. If this PR’s approach shows a clear improvement over the naive approach in #18817 (comment) (I'm happy to do a quick prototype), we should definitely move forward.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Nov 24, 2025

Thank you @2010YOUY01 for review and valid concern:
You raise valid concerns about memory overhead is what i mentioned the key risk for this approach.
However, I want to clarify that row group reversal alone cannot eliminate the SortExec - it only provides TopK filtering benefits. Without reversing rows within each row group, the data remains in the original order (e.g., ASC when we need DESC), so the sort must stay. I propose we keep the complete optimization but default enable_reverse_scan to false. Once we implement page-level caching in arrow-rs (which will reduce memory overhead significantly), we can consider enabling it by default.

Did you mean 'cannot eliminate the SortExec(TopK)'? Just to confirm there is no global sort, but it is true that we have do a topK on a whole row group for this naive approach.

I have a intuition that for this kind of workload, the bottleneck is on the parquet decoding speed, and an extra TopK won't introduce much additional overhead, so this naive approach can also get pretty fast.

It makes a lot of sense that it's very hard to implement page/row level reversal in arrow-rs side, so we have to figure out how to do this at row-group level.

Summary: Perhaps we can start by adding a few end-to-end benchmarks that reflect your typical production workload. If this PR’s approach shows a clear improvement over the naive approach in #18817 (comment) (I'm happy to do a quick prototype), we should definitely move forward.

Nice point @2010YOUY01 , i agree most time will be decode page, i can change this PR to add the config to implement #18817 (comment) or create another PR for it, so we can have more options to compare, i agree the easier solution is better.

And a benchmark is really helpful, thanks!

@xudong963
Copy link
Member

FYI, I'll start reviewing the PR tomorrow.

@zhuqi-lucas
Copy link
Contributor Author

FYI, I'll start reviewing the PR tomorrow.

Thanks @xudong963 !

@adriangb
Copy link
Contributor

adriangb commented Nov 24, 2025

I haven't looked into all of this discussion and code (I just got tagged). I've been looking into optimizing sorted scanning in DataFusion and IMO where we should land is:

  1. Via metadata (FileScanConfig / ORDERED BY ... in SQL) users declare a known sort order of their files.
  2. The planner uses statistics from the files + any ORDER BY clauses in the query to arrange file ordering to best match the query. The FileSource implementation can also receive the ORDER BY information and optimize scan order within a file (e.g. reversing the order of reads which is what I think this PR is doing).
  3. If the planner is able to deduce from file level stats that the files can be ordered and the FileSource reports that it is able to produce batches in sorted order then the optimizer can optimize away the sort completely.

I hope that is helpful.

Thank you @adriangb , it's helpful for future optimization. My current implementation focuses on a specific optimization case: when data is already sorted and we need the reverse order, we can flip the scan direction instead of reading everything and sorting. The reverse_scan in FileSource handles the files/ and within-file ordering reversal.

I think these approaches are complementary - my PR handles the reverse scan optimization, while your vision provides a framework for broader sorted-scan optimizations using file-level statistics and metadata. Would be great to build toward that architecture incrementally.

My point is that instead of enable_reverse_scan: bool we might want to consider a more holistic approach e.g. try_pushdown_sort(&self, order: LexOrdering) -> Result<Option<Arc<dyn ExecutionPlan>>> either at the ExecutionPlan level or at the DataSource level.

I'm not opposed to this as a step towards that but I'm not sure how helpful it is. Seeing something more concrete w.r.t. how this interacts with the bigger picture would be helpful IMO.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Nov 24, 2025

I haven't looked into all of this discussion and code (I just got tagged). I've been looking into optimizing sorted scanning in DataFusion and IMO where we should land is:

  1. Via metadata (FileScanConfig / ORDERED BY ... in SQL) users declare a known sort order of their files.
  2. The planner uses statistics from the files + any ORDER BY clauses in the query to arrange file ordering to best match the query. The FileSource implementation can also receive the ORDER BY information and optimize scan order within a file (e.g. reversing the order of reads which is what I think this PR is doing).
  3. If the planner is able to deduce from file level stats that the files can be ordered and the FileSource reports that it is able to produce batches in sorted order then the optimizer can optimize away the sort completely.

I hope that is helpful.

Thank you @adriangb , it's helpful for future optimization. My current implementation focuses on a specific optimization case: when data is already sorted and we need the reverse order, we can flip the scan direction instead of reading everything and sorting. The reverse_scan in FileSource handles the files/ and within-file ordering reversal.
I think these approaches are complementary - my PR handles the reverse scan optimization, while your vision provides a framework for broader sorted-scan optimizations using file-level statistics and metadata. Would be great to build toward that architecture incrementally.

My point is that instead of enable_reverse_scan: bool we might want to consider a more holistic approach e.g. try_pushdown_sort(&self, order: LexOrdering) -> Result<Option<Arc<dyn ExecutionPlan>>> either at the ExecutionPlan level or at the DataSource level.

I'm not opposed to this as a step towards that but I'm not sure how helpful it is. Seeing something more concrete w.r.t. how this interacts with the bigger picture would be helpful IMO.

This is a great idea to have high level sort pushdown @adriangb , and reverse scan is one of the polices, i will refactor this PR to use this way, thanks!

Updated, i already changed to high level sort pushdown in this commit:
45089c5

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 24, 2025
// Successfully pushed down sort, now handle the limit
let total_fetch = limit_exec.skip() + limit_exec.fetch().unwrap_or(0);

// Try to push limit down as well if the source supports it
Copy link
Member

@xudong963 xudong963 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current limit_pushdown physical optimizer rule can do this. So do we still need to distinguish the sort and limit + sort pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add those logics here, because i found the optimizer order make we always need to run some of them more than one times if we remove this logic.

So i will try to add our optimizer to the existed optimizer.

/// Try to create a new execution plan that satisfies the given sort ordering.
///
/// Default implementation returns `Ok(None)`.
fn try_pushdown_sort(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the API for ExecutionPlan? Is it possible to push down sort during the pushdown sort optimizer? Because we need to traverse the plan in the rule, so it looks possible to find the target node and directly give it the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

@alamb
Copy link
Contributor

alamb commented Nov 25, 2025

I plan to review this PR carefully tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fast parquet order inversion

5 participants