Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jan 1, 2026

Which issue does this PR close?

Part of #19433

Rationale for this change

When writing data to a table created with CREATE EXTERNAL TABLE ... WITH ORDER, the sorting columns should be recorded in the Parquet file's row group metadata. This allows downstream readers to know the data is sorted and potentially skip sorting operations.

What changes are included in this PR?

  • Add sort_expr_to_sorting_column() and lex_ordering_to_sorting_columns() functions in metadata.rs to convert DataFusion ordering to Parquet SortingColumn
  • Add sorting_columns field to ParquetSink with with_sorting_columns() builder method
  • Update create_writer_physical_plan() to pass order requirements to ParquetSink
  • Update create_writer_props() to set sorting columns on WriterProperties
  • Add test verifying sorting_columns metadata is written correctly

Are these changes tested?

Yes, added test_create_table_with_order_writes_sorting_columns that:

  1. Creates an external table with WITH ORDER (a ASC NULLS FIRST, b DESC NULLS LAST)
  2. Inserts data
  3. Reads the Parquet file and verifies the sorting_columns metadata matches the expected order

Are there any user-facing changes?

No user-facing API changes. Parquet files written via INSERT INTO or COPY for tables with WITH ORDER will now contain sorting_columns metadata in the row group.

🤖 Generated with Claude Code

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jan 1, 2026
When writing data to a table created with `CREATE EXTERNAL TABLE ... WITH ORDER`,
this change records the sorting columns in the Parquet file's row group metadata.

Changes:
- Add `sort_expr_to_sorting_column()` and `lex_ordering_to_sorting_columns()`
  functions in metadata.rs to convert DataFusion ordering to Parquet SortingColumn
- Add `sorting_columns` field to ParquetSink with `with_sorting_columns()` builder
- Update `create_writer_physical_plan()` to pass order requirements to ParquetSink
- Update `create_writer_props()` to set sorting columns on WriterProperties
- Add test verifying sorting_columns metadata is written correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the recording of sort order metadata in Parquet files when writing data with WITH ORDER clauses. When an external table is created with an ordering specification, subsequent INSERT INTO or COPY operations will now embed sorting column information in the Parquet row group metadata, enabling downstream readers to potentially skip redundant sort operations.

  • Adds conversion functions to translate DataFusion ordering expressions to Parquet SortingColumn metadata
  • Updates ParquetSink to accept and propagate sorting column information through the writer pipeline
  • Includes comprehensive test coverage to verify metadata is correctly written

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
datafusion/datasource-parquet/src/metadata.rs Adds sort_expr_to_sorting_column() and lex_ordering_to_sorting_columns() helper functions to convert DataFusion ordering to Parquet sorting metadata
datafusion/datasource-parquet/src/file_format.rs Integrates sorting column conversion into create_writer_physical_plan() and updates ParquetSink with builder pattern support for sorting columns; modifies create_writer_props() to set sorting columns on writer properties
datafusion/core/tests/parquet/ordering.rs Adds new test file with test_create_table_with_order_writes_sorting_columns to verify sorting metadata is correctly written to Parquet files
datafusion/core/tests/parquet/mod.rs Registers the new ordering test module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant