perf: Use file_metadata_cache in geoparquet by petern48 · Pull Request #294 · apache/sedona-db

petern48 · 2025-11-09T23:06:42Z

Use file metadata cache for geoparquet

closes #250

petern48 · 2025-11-09T23:06:59Z

Well that was suspiciously simple...

petern48 · 2025-11-09T23:10:28Z

rust/sedona-geoparquet/src/format.rs

+                let file_metadata_cache =
+                    state.runtime_env().cache_manager.get_file_metadata_cache();


with_file_metadata_cache() is called for each iteration of the loop (.map()), we need a clone for each separate iteration. get_file_metadata_cache() returns a cloned Arc, already, so no need to call another .clone().

https://github.com/apache/datafusion/blob/28755b1d7eb5222a8f5fb5417134dd6865ac1311/datafusion/execution/src/cache/cache_manager.rs#L174-L176

petern48 · 2025-11-09T23:15:55Z

It would be nice to see how effective this is, with some benchmarks involving geoparquet reading. That doesn't seem to exist yet, does it?

paleolimbot

Thank you!

Can you also the with_file_metadata_cache() line here?

sedona-db/rust/sedona-geoparquet/src/file_opener.rs

Lines 116 to 120 in f49016d

    
           let parquet_metadata = 
        
               DFParquetMetadata::new(&self_clone.object_store, &file_meta.object_meta) 
        
                   .with_metadata_size_hint(self_clone.metadata_size_hint) 
        
                   .fetch_metadata() 
        
                   .await?;

This seems a bit slower than on main at the moment and 0.1 and I don't see an impact on multiple calls to read_parquet(). Querying a big table with lots of Parquet files is probably a good way to check (but it might be better to list files individually for this particular benchmark, since there's also a cost to querying S3 to list the files).

import os
os.environ["AWS_SKIP_SIGNATURE"] = "true"
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
import sedona.db

sd = sedona.db.connect()

# 16s on main, 22s on this PR
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings")

# Second time: 16s on main, 19s on this PR
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)

From https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/ , I wonder if there's something in the ListingTable we need to port over to SedonaContext::read_parquet() as well (or something we need to propagate in the FileConfig).

petern48 · 2025-11-12T05:58:26Z

Seems like you're right. Here's what I got in terms of numbers (4 trials each). (Before adding the extra changes)

# On branch:
20.807693004608154 seconds. 19.300298929214478 seconds. 30.610652208328247. 20.58704400062561

# On main:
19.879024744033813 seconds. 17.96437692642212 seconds. 21.73752498626709. 17.080259084701538.

Need more time to investigate.

petern48 · 2025-11-27T16:19:16Z

Found this neat datafusion feature for checking the metadata cache. I'll give it a try once I find the time to circle back to this.
apache/datafusion#17126

petern48 · 2025-11-27T16:26:28Z

Can you also the with_file_metadata_cache() line here?

Tried this before, and ran into some complications when trying to pass the state variable I need there. Datafusion's opener doesn't seem to use the metadata cache state either. I haven't yet looked deeper into what they're doing, but it's probably something.

(As a note to myself) I need to next look into what's happening here with ArrowReaderMetadata and async_file_reader.

paleolimbot · 2025-12-02T15:44:40Z

I just noticed that DataFusion 51 updated the default metadata cache hint. It may be worth trying some of this with

sd.sql("SET datafusion.execution.parquet.metadata_size_hint = 524288")

petern48 commented Nov 9, 2025

View reviewed changes

petern48 added the performance label Nov 9, 2025

petern48 marked this pull request as ready for review November 9, 2025 23:45

paleolimbot reviewed Nov 10, 2025

View reviewed changes

petern48 marked this pull request as draft November 12, 2025 05:58

petern48 added 2 commits December 23, 2025 15:04

Use file_metadata_cache in geoparquet

1482d34

Propagate file_metadata_cache to FileOpener too

07cfbbe

petern48 force-pushed the file_metadata_cache branch 2 times, most recently from 804154f to 07cfbbe Compare December 31, 2025 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Use file_metadata_cache in geoparquet#294

perf: Use file_metadata_cache in geoparquet#294
petern48 wants to merge 2 commits intoapache:mainfrom
petern48:file_metadata_cache

petern48 commented Nov 9, 2025

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

petern48 Nov 9, 2025

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

paleolimbot left a comment

Uh oh!

petern48 commented Nov 12, 2025

Uh oh!

petern48 commented Nov 27, 2025

Uh oh!

petern48 commented Nov 27, 2025

Uh oh!

paleolimbot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let file_metadata_cache =
		state.runtime_env().cache_manager.get_file_metadata_cache();

	let parquet_metadata =
	DFParquetMetadata::new(&self_clone.object_store, &file_meta.object_meta)
	.with_metadata_size_hint(self_clone.metadata_size_hint)
	.fetch_metadata()
	.await?;

Conversation

petern48 commented Nov 9, 2025

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

petern48 Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

petern48 commented Nov 12, 2025

Uh oh!

petern48 commented Nov 27, 2025

Uh oh!

petern48 commented Nov 27, 2025

Uh oh!

paleolimbot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants