Skip to content

fix(rust/sedona): Strip schema metadata when input uses RecordBatchReaderProvider#517

Merged
paleolimbot merged 7 commits intoapache:mainfrom
paleolimbot:strip-metadata-from-arrow-input
Jan 16, 2026
Merged

fix(rust/sedona): Strip schema metadata when input uses RecordBatchReaderProvider#517
paleolimbot merged 7 commits intoapache:mainfrom
paleolimbot:strip-metadata-from-arrow-input

Conversation

@paleolimbot
Copy link
Member

Foreign record batch readers/arrow array streams often contain schema metadata attached by Pandas or R, and sometimes that metadata can interfere with the ability of DataFusion internals to propagate a schema in such a way that the metadata remains unchanged (and thus fails assertions for optimizer rules recreating an identical schema).

The true fix is somewhere in DataFusion; however, as a workaround we can strip schema metadata on input since we aren't using it anyway.

Closes #477

@paleolimbot paleolimbot marked this pull request as ready for review January 14, 2026 21:25
@paleolimbot paleolimbot requested a review from Copilot January 14, 2026 21:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a schema metadata issue when using RecordBatchReaderProvider with foreign record batch readers (from Pandas or R). The solution strips schema metadata on input to work around DataFusion's schema propagation issues that cause optimizer rule failures.

Changes:

  • Added a helper function to strip schema metadata from SchemaRef objects
  • Updated RecordBatchReaderProvider constructor to strip metadata from input schemas
  • Added a regression test using GeoPandas data to verify the fix

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
rust/sedona/src/record_batch_reader_provider.rs Implements schema metadata stripping logic and updates the constructor to use it
python/sedonadb/tests/test_sjoin.py Adds regression test for spatial joins with Pandas metadata

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paleolimbot and others added 2 commits January 14, 2026 16:29
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@paleolimbot paleolimbot merged commit 2535b29 into apache:main Jan 16, 2026
15 checks passed
@paleolimbot paleolimbot deleted the strip-metadata-from-arrow-input branch January 16, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PhysicalOptimizer rule 'join_selection' failed: Schema mismatch in ST_Intersects query (Python)

2 participants