Record sort order when writing Parquet with WITH ORDER #19595
+185
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #19433
Rationale for this change
When writing data to a table created with
CREATE EXTERNAL TABLE ... WITH ORDER, the sorting columns should be recorded in the Parquet file's row group metadata. This allows downstream readers to know the data is sorted and potentially skip sorting operations.What changes are included in this PR?
sort_expr_to_sorting_column()andlex_ordering_to_sorting_columns()functions inmetadata.rsto convert DataFusion ordering to ParquetSortingColumnsorting_columnsfield toParquetSinkwithwith_sorting_columns()builder methodcreate_writer_physical_plan()to pass order requirements toParquetSinkcreate_writer_props()to set sorting columns onWriterPropertiessorting_columnsmetadata is written correctlyAre these changes tested?
Yes, added
test_create_table_with_order_writes_sorting_columnsthat:WITH ORDER (a ASC NULLS FIRST, b DESC NULLS LAST)sorting_columnsmetadata matches the expected orderAre there any user-facing changes?
No user-facing API changes. Parquet files written via
INSERT INTOorCOPYfor tables withWITH ORDERwill now containsorting_columnsmetadata in the row group.🤖 Generated with Claude Code