Skip to content

feat(rust/sedona-geoparquet): Ensure metadata cache is used in GeoParquet wrapper code#646

Merged
paleolimbot merged 13 commits intoapache:mainfrom
paleolimbot:all-the-metadata-caches
Feb 23, 2026
Merged

feat(rust/sedona-geoparquet): Ensure metadata cache is used in GeoParquet wrapper code#646
paleolimbot merged 13 commits intoapache:mainfrom
paleolimbot:all-the-metadata-caches

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 19, 2026

In #251 we tried to use the file metadata cache and found that it actually slowed down queries. @yutannihilation kindly benchmarked the effect of the cache against DuckDB to demonstrate that the file cache there is effective for queries against large tables. @b4l kindly showed how to do this in #604.

This PR pipes through the requisite options to ensure the cache is used for GeoParquet reads. This is especially important because we need to pull two extra copies of the metadata after DataFusion has already pulled it: if we don't use the cached version, we issue three requests where we could have issued one. For most queries this is done in parallel/async in a non-blocking way and is hard to notice; however, remote tables with large numbers of Parquet files do very badly.

A secondary issue is that the default size of the cache is not well-equiped to deal with Overture buildings, which we were using to benchmark this. The buildings data requires almost 900 megabytes of cache space and because it is a least-recently used cache being queried roughly in order three times, if the cache size is even a little bit smaller than the full size of the dataset then it is 0% useful. The increase we see in time is probably because of contention on the mutex guarding the in-memory cache.

import re
import os
os.environ["AWS_SKIP_SIGNATURE"] = "true"
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
import sedona.db

sd = sedona.db.connect()

sd.sql("SET datafusion.runtime.metadata_cache_limit = '900M'").execute()

# 16s on main, 10s on this PR with a big enough cache
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2026-02-18.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)

# Second time: 16s on main, 0s with this PR
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2026-02-18.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)

I took the opportunity to redo the Overture buildings documentation page to include this and a few other improvements we added in the last few months.

Closes #250.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements metadata caching for GeoParquet reads to reduce redundant metadata fetches from remote object stores. The PR addresses a performance issue where each GeoParquet query would fetch the same metadata multiple times without using DataFusion's built-in metadata cache.

Changes:

  • Pipes through file metadata cache to DFParquetMetadata operations in both schema inference and file opening
  • Adds metadata_cache field to GeoParquetFileSource and GeoParquetFileOpener structs with proper propagation through all transformation methods
  • Updates Overture Buildings documentation to demonstrate cache configuration and modern SedonaDB API patterns

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
rust/sedona-geoparquet/src/format.rs Adds metadata cache support to GeoParquetFormat, including cache retrieval in infer_schema and create_physical_plan, and proper propagation through GeoParquetFileSource methods
rust/sedona-geoparquet/src/file_opener.rs Adds metadata_cache field to GeoParquetFileOpener and uses it when fetching parquet metadata
docs/overture-examples.md Comprehensive rewrite demonstrating cache configuration, parameterized queries, and modern SedonaDB patterns for Overture data
docs/overture-examples.ipynb Corresponding Jupyter notebook with consistent examples and output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paleolimbot and others added 2 commits February 19, 2026 17:44
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@paleolimbot paleolimbot marked this pull request as ready for review February 19, 2026 23:44
@yutannihilation
Copy link
Contributor

Thanks for this!

Sorry for a noob question. Even if I include sd.sql("SET datafusion.runtime.metadata_cache_limit = '900M'").execute(), the benchmark still doesn't improve much. Is this expected at the moment (until the new version of DataFusion...?)? Maybe the difference is that DuckDB somehow holds a persistent cache among sessions while SedonaDB doesn't?

Benchmark results (seconds):

Engine Median Mean Min Max Runs row_count max_confidence
DuckDB 0.591740 2.654925 0.575550 6.797485 3 7471 0.999544084072
SedonaDB 9.257312 8.987796 8.034394 9.671681 3 7471 0.999544084072

@paleolimbot
Copy link
Member Author

If the benchmark is running a fresh process each time, then sd.sql("SET datafusion.runtime.metadata_cache_limit = '900M'").execute() won't help (the cache is in-memory only). I'm not sure exactly how DuckDB does it but having a persistent cache would be great.

We can do that if we want...it roughly involves reimplementing the default cache:

https://github.com/apache/datafusion/blob/1736fd2a40b64c6e39fb12090a2dbe8be07ac5ac/datafusion/execution/src/cache/file_metadata_cache.rs#L143-L205

...backing it with a SQLite database or files in a temporary directory. It can be overridden when we set up the runtime environment here:

https://github.com/apache/datafusion/blob/1736fd2a40b64c6e39fb12090a2dbe8be07ac5ac/datafusion/execution/src/runtime_env.rs#L379-L383

/// Build a [`RuntimeEnv`] from the current configuration.
///
/// This constructs the memory pool and disk manager based on the
/// builder settings and returns the resulting runtime environment.
pub fn build_runtime_env(&self) -> Result<Arc<RuntimeEnv>> {
let mut rt_builder = RuntimeEnvBuilder::new();
if let Some(memory_limit) = self.memory_limit {
let track_capacity = NonZeroUsize::new(10).expect("track capacity must be non-zero");
let pool: Arc<dyn MemoryPool> = match self.pool_type {
PoolType::Fair => Arc::new(TrackConsumersPool::new(
SedonaFairSpillPool::new(memory_limit, self.unspillable_reserve_ratio),
track_capacity,
)),
PoolType::Greedy => Arc::new(TrackConsumersPool::new(
GreedyMemoryPool::new(memory_limit),
track_capacity,
)),
};
rt_builder = rt_builder.with_memory_pool(pool);
}
if let Some(ref temp_dir) = self.temp_dir {
let dm_builder = DiskManagerBuilder::default()
.with_mode(DiskManagerMode::Directories(vec![PathBuf::from(temp_dir)]));
rt_builder = rt_builder.with_disk_manager_builder(dm_builder);
}
rt_builder.build_arc()
}

Copy link
Member

@zhangfengcdt zhangfengcdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I would suggest we add some document to clarify:

  • What is recommended cache size for very large datasets like Overture
  • Do we support cache invalidation and if not, we should clarify the limitation and the risk of inconsistence (though it might not be the main concerns for slow updating dataset).

Comment on lines -242 to -253
def test_udf_sedonadb_registry_function_to_datafusion(con):
datafusion = pytest.importorskip("datafusion")
udf_impl = udf.arrow_udf(pa.binary(), [udf.STRING, udf.NUMERIC])(some_udf)

# Register with our session
con.register_udf(udf_impl)

# Create a datafusion session, fetch our udf and register with the other session
datafusion_ctx = datafusion.SessionContext()
datafusion_ctx.register_udf(
datafusion.ScalarUDF.from_pycapsule(con._impl.scalar_udf("some_udf"))
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added #655 to track this...you have to try pretty hard to trigger this failing functionality so I just removed the tests for now.

@paleolimbot paleolimbot merged commit a788960 into apache:main Feb 23, 2026
17 checks passed
@paleolimbot paleolimbot deleted the all-the-metadata-caches branch February 23, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use metadata cache when fetching metadata in geoparquet reader

4 participants