Refactor cache APIs to support ordering information #19597

adriangb · 2026-01-01T20:57:53Z

Which issue does this PR close?

Part of #19433

Rationale for this change

In preparation for ordering inference from Parquet metadata, the cache system needs refactoring to:

Support storing ordering information alongside statistics
Simplify the CacheAccessor trait by removing the Extra associated type and *_with_extra methods
Move validation logic into typed wrapper structs with explicit is_valid_for() methods

What changes are included in this PR?

Simplify `CacheAccessor` trait

Before:

pub trait CacheAccessor<K, V>: Send + Sync {
    type Extra: Clone;
    fn get(&self, k: &K) -> Option<V>;
    fn get_with_extra(&self, k: &K, e: &Self::Extra) -> Option<V>;
    fn put(&self, key: &K, value: V) -> Option<V>;
    fn put_with_extra(&self, key: &K, value: V, e: &Self::Extra) -> Option<V>;
    // ... other methods
}

After:

pub trait CacheAccessor<K, V>: Send + Sync {
    fn get(&self, key: &K) -> Option<V>;
    fn put(&self, key: &K, value: V) -> Option<V>;
    // ... other methods (no Extra type, no *_with_extra methods)
}

Introduce typed wrapper structs for cached values

Instead of passing validation metadata separately via Extra, embed it in the cached value type:

CachedFileMetadata - contains meta: ObjectMeta, statistics: Arc<Statistics>, ordering: Option<LexOrdering>
CachedFileList - contains files: Arc<Vec<ObjectMeta>> with filter_by_prefix() helper
CachedFileMetadataEntry - contains meta: ObjectMeta, file_metadata: Arc<dyn FileMetadata>

Each wrapper has an is_valid_for(&ObjectMeta) method that checks if the cached entry is still valid (size and last_modified match).

New validation pattern

The typical usage pattern changes from:

// Before: validation hidden in get_with_extra
if let Some(stats) = cache.get_with_extra(&path, &object_meta) {
    // use stats
}

To:

// After: explicit validation
if let Some(cached) = cache.get(&path) {
    if cached.is_valid_for(&object_meta) {
        // use cached.statistics
    }
}

Add ordering support

CachedFileMetadata has new ordering: Option<LexOrdering> field
FileStatisticsCacheEntry has new has_ordering: bool field for introspection

Are these changes tested?

Yes, existing cache tests pass plus new tests for ordering support.

Are there any user-facing changes?

Breaking change to cache traits. Users with custom cache implementations will need to:

Update CacheAccessor impl to remove Extra type and *_with_extra methods
Update cached value types to the new wrappers (CachedFileMetadata, etc.)
Update callsites to use the new validation pattern with is_valid_for()

🤖 Generated with Claude Code

Refactor the cache system to support storing both statistics and ordering information together, in preparation for ordering inference from Parquet metadata. Changes to cache_manager.rs: - Add `CachedFileMetadata` struct with `meta`, `statistics`, and `ordering` fields - Refactor `FileStatisticsCache` trait to use `CachedFileMetadata` and Path keys - Add `has_ordering` field to `FileStatisticsCacheEntry` - Add `CachedFileList` for list files cache - Refactor `FileMetadataCache` trait to use `CachedFileMetadataEntry` and Path keys Changes to cache implementations: - Update `DefaultFileStatisticsCache` to use new trait methods - Update `DefaultFilesMetadataCache` to use new trait methods - Simplify list files cache implementation Changes to callsites: - Update `ListingTable::do_collect_statistics` to use new cache API - Update `DFParquetMetadata::fetch_metadata` to use new cache API - Update `ListingTableUrl` to use new cache API 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Replace unused FileStatisticsCache import with CacheAccessor which provides the len() method used in tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Copilot

Pull request overview

This PR refactors the cache API trait hierarchy to prepare for ordering inference from Parquet metadata. The refactoring eliminates the awkward Extra generic parameter from CacheAccessor, introduces wrapper types for cached data (CachedFileMetadata, CachedFileList, CachedFileMetadataEntry), and establishes a cleaner trait hierarchy where specific cache traits extend the base CacheAccessor trait.

Key changes:

Removed Extra associated type from CacheAccessor, simplifying the trait with unified get/put methods
Introduced CachedFileMetadata struct to store statistics and ordering information together
Changed cache key types from ObjectMeta to Path for consistency, with validation now handled by wrapper structs

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
datafusion/execution/src/cache/mod.rs	Refactored `CacheAccessor` trait to remove `Extra` type and `get_with_extra`/`put_with_extra` methods; improved documentation
datafusion/execution/src/cache/cache_manager.rs	Added `CachedFileMetadata`, `CachedFileList`, and `CachedFileMetadataEntry` wrapper types with validation methods; updated trait definitions to extend `CacheAccessor`
datafusion/execution/src/cache/cache_unit.rs	Updated `DefaultFileStatisticsCache` to use new `CachedFileMetadata` type and implement split trait hierarchy; added ordering support tests
datafusion/execution/src/cache/file_metadata_cache.rs	Changed cache key from `ObjectMeta` to `Path`; validation moved to `CachedFileMetadataEntry::is_valid_for`; updated all tests
datafusion/execution/src/cache/list_files_cache.rs	Removed inline prefix filtering from cache internals; moved to `CachedFileList::filter_by_prefix` helper method; simplified cache API
datafusion/datasource/src/url.rs	Updated `list_with_cache` to use `CachedFileList` and apply prefix filtering after cache retrieval
datafusion/datasource-parquet/src/metadata.rs	Updated to use `CachedFileMetadataEntry` with `Path` keys and explicit validation checks
datafusion/catalog-listing/src/table.rs	Updated `do_collect_statistics` to use new cache API with `CachedFileMetadata` wrapper
datafusion/execution/Cargo.toml	Added `datafusion-physical-expr-common` dependency for `LexOrdering` support

Comments suppressed due to low confidence (1)

datafusion/datasource/src/url.rs:402

The prefix filtering logic is duplicated in lines 370-382 and lines 393-402. Consider using the CachedFileList::filter_by_prefix helper method instead. Replace the inline filtering with cached.filter_by_prefix(&Some(full_prefix)) and CachedFileList::new(vec.clone()).filter_by_prefix(&Some(full_prefix)) respectively.

                if prefix.is_some() {
                    let full_prefix_str = full_prefix.as_ref();
                    cached
                        .files
                        .iter()
                        .filter(|meta| {
                            meta.location.as_ref().starts_with(full_prefix_str)
                        })
                        .cloned()
                        .collect()
                } else {
                    cached.files.as_ref().clone()
                }
            } else {
                // Cache miss - always list and cache the full table
                // This ensures we have complete data for future prefix queries
                let vec = store
                    .list(Some(table_base_path))
                    .try_collect::<Vec<ObjectMeta>>()
                    .await?;
                cache.put(table_base_path, CachedFileList::new(vec.clone()));

                // If a prefix filter was requested, apply it to the results
                if prefix.is_some() {
                    let full_prefix_str = full_prefix.as_ref();
                    vec.into_iter()
                        .filter(|meta| {
                            meta.location.as_ref().starts_with(full_prefix_str)
                        })
                        .collect()
                } else {
                    vec
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address review feedback to remove duplicated prefix filtering logic. Now both cache hit and cache miss paths use the filter_by_prefix helper. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions bot added catalog Related to the catalog crate execution Related to the execution crate datasource Changes to the datasource crate labels Jan 1, 2026

adriangb force-pushed the cache-ordering-refactor branch 2 times, most recently from 3697e42 to 12e3efe Compare January 1, 2026 21:39

github-actions bot added the core Core DataFusion crate label Jan 1, 2026

adriangb mentioned this pull request Jan 1, 2026

infer parquet file order from metadata, write order in ParquetSink #19433

Open

adriangb force-pushed the cache-ordering-refactor branch from 12e3efe to aa3f29c Compare January 1, 2026 22:12

adriangb and others added 2 commits January 1, 2026 16:23

fix: import CacheAccessor trait for len() method in tests

b44defd

Replace unused FileStatisticsCache import with CacheAccessor which provides the len() method used in tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

fmt

a1a6d7a

github-actions bot removed the core Core DataFusion crate label Jan 1, 2026

adriangb requested a review from Copilot January 1, 2026 22:24

Copilot started reviewing on behalf of adriangb January 1, 2026 22:25 View session

Copilot AI reviewed Jan 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor cache APIs to support ordering information #19597

Refactor cache APIs to support ordering information #19597

adriangb commented Jan 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor cache APIs to support ordering information #19597

Are you sure you want to change the base?

Refactor cache APIs to support ordering information #19597

Conversation

adriangb commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Simplify CacheAccessor trait

Introduce typed wrapper structs for cached values

New validation pattern

Add ordering support

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adriangb commented Jan 1, 2026 •

edited

Loading

Simplify `CacheAccessor` trait