perf: Use file_metadata_cache in geoparquet#294
perf: Use file_metadata_cache in geoparquet#294petern48 wants to merge 2 commits intoapache:mainfrom
Conversation
|
Well that was suspiciously simple... |
| let file_metadata_cache = | ||
| state.runtime_env().cache_manager.get_file_metadata_cache(); |
There was a problem hiding this comment.
with_file_metadata_cache() is called for each iteration of the loop (.map()), we need a clone for each separate iteration. get_file_metadata_cache() returns a cloned Arc, already, so no need to call another .clone().
|
It would be nice to see how effective this is, with some benchmarks involving geoparquet reading. That doesn't seem to exist yet, does it? |
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you!
Can you also the with_file_metadata_cache() line here?
sedona-db/rust/sedona-geoparquet/src/file_opener.rs
Lines 116 to 120 in f49016d
This seems a bit slower than on main at the moment and 0.1 and I don't see an impact on multiple calls to read_parquet(). Querying a big table with lots of Parquet files is probably a good way to check (but it might be better to list files individually for this particular benchmark, since there's also a cost to querying S3 to list the files).
import os
os.environ["AWS_SKIP_SIGNATURE"] = "true"
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
import sedona.db
sd = sedona.db.connect()
# 16s on main, 22s on this PR
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings")
# Second time: 16s on main, 19s on this PR
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)From https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/ , I wonder if there's something in the ListingTable we need to port over to SedonaContext::read_parquet() as well (or something we need to propagate in the FileConfig).
|
Seems like you're right. Here's what I got in terms of numbers (4 trials each). (Before adding the extra changes) Need more time to investigate. |
|
Found this neat datafusion feature for checking the metadata cache. I'll give it a try once I find the time to circle back to this. |
Tried this before, and ran into some complications when trying to pass the state variable I need there. Datafusion's opener doesn't seem to use the metadata cache state either. I haven't yet looked deeper into what they're doing, but it's probably something. (As a note to myself) I need to next look into what's happening here with |
|
I just noticed that DataFusion 51 updated the default metadata cache hint. It may be worth trying some of this with sd.sql("SET datafusion.execution.parquet.metadata_size_hint = 524288") |
804154f to
07cfbbe
Compare
Use file metadata cache for geoparquet
closes #250