feat(rust/sedona-geoparquet): Ensure metadata cache is used in GeoParquet wrapper code#646
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements metadata caching for GeoParquet reads to reduce redundant metadata fetches from remote object stores. The PR addresses a performance issue where each GeoParquet query would fetch the same metadata multiple times without using DataFusion's built-in metadata cache.
Changes:
- Pipes through file metadata cache to
DFParquetMetadataoperations in both schema inference and file opening - Adds
metadata_cachefield toGeoParquetFileSourceandGeoParquetFileOpenerstructs with proper propagation through all transformation methods - Updates Overture Buildings documentation to demonstrate cache configuration and modern SedonaDB API patterns
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| rust/sedona-geoparquet/src/format.rs | Adds metadata cache support to GeoParquetFormat, including cache retrieval in infer_schema and create_physical_plan, and proper propagation through GeoParquetFileSource methods |
| rust/sedona-geoparquet/src/file_opener.rs | Adds metadata_cache field to GeoParquetFileOpener and uses it when fetching parquet metadata |
| docs/overture-examples.md | Comprehensive rewrite demonstrating cache configuration, parameterized queries, and modern SedonaDB patterns for Overture data |
| docs/overture-examples.ipynb | Corresponding Jupyter notebook with consistent examples and output |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Thanks for this! Sorry for a noob question. Even if I include Benchmark results (seconds):
|
|
If the benchmark is running a fresh process each time, then We can do that if we want...it roughly involves reimplementing the default cache: ...backing it with a SQLite database or files in a temporary directory. It can be overridden when we set up the runtime environment here: sedona-db/rust/sedona/src/context_builder.rs Lines 194 to 223 in bed9151 |
zhangfengcdt
left a comment
There was a problem hiding this comment.
Looks good to me! I would suggest we add some document to clarify:
- What is recommended cache size for very large datasets like Overture
- Do we support cache invalidation and if not, we should clarify the limitation and the risk of inconsistence (though it might not be the main concerns for slow updating dataset).
| def test_udf_sedonadb_registry_function_to_datafusion(con): | ||
| datafusion = pytest.importorskip("datafusion") | ||
| udf_impl = udf.arrow_udf(pa.binary(), [udf.STRING, udf.NUMERIC])(some_udf) | ||
|
|
||
| # Register with our session | ||
| con.register_udf(udf_impl) | ||
|
|
||
| # Create a datafusion session, fetch our udf and register with the other session | ||
| datafusion_ctx = datafusion.SessionContext() | ||
| datafusion_ctx.register_udf( | ||
| datafusion.ScalarUDF.from_pycapsule(con._impl.scalar_udf("some_udf")) | ||
| ) |
There was a problem hiding this comment.
I added #655 to track this...you have to try pretty hard to trigger this failing functionality so I just removed the tests for now.
In #251 we tried to use the file metadata cache and found that it actually slowed down queries. @yutannihilation kindly benchmarked the effect of the cache against DuckDB to demonstrate that the file cache there is effective for queries against large tables. @b4l kindly showed how to do this in #604.
This PR pipes through the requisite options to ensure the cache is used for GeoParquet reads. This is especially important because we need to pull two extra copies of the metadata after DataFusion has already pulled it: if we don't use the cached version, we issue three requests where we could have issued one. For most queries this is done in parallel/async in a non-blocking way and is hard to notice; however, remote tables with large numbers of Parquet files do very badly.
A secondary issue is that the default size of the cache is not well-equiped to deal with Overture buildings, which we were using to benchmark this. The buildings data requires almost 900 megabytes of cache space and because it is a least-recently used cache being queried roughly in order three times, if the cache size is even a little bit smaller than the full size of the dataset then it is 0% useful. The increase we see in time is probably because of contention on the mutex guarding the in-memory cache.
I took the opportunity to redo the Overture buildings documentation page to include this and a few other improvements we added in the last few months.
Closes #250.