Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jan 1, 2026

Which issue does this PR close?

Part of #19433

Rationale for this change

In preparation for ordering inference from Parquet metadata, the cache system needs refactoring to:

  1. Support storing ordering information alongside statistics
  2. Simplify the CacheAccessor trait by removing the Extra associated type and *_with_extra methods
  3. Move validation logic into typed wrapper structs with explicit is_valid_for() methods

What changes are included in this PR?

Simplify CacheAccessor trait

Before:

pub trait CacheAccessor<K, V>: Send + Sync {
    type Extra: Clone;
    fn get(&self, k: &K) -> Option<V>;
    fn get_with_extra(&self, k: &K, e: &Self::Extra) -> Option<V>;
    fn put(&self, key: &K, value: V) -> Option<V>;
    fn put_with_extra(&self, key: &K, value: V, e: &Self::Extra) -> Option<V>;
    // ... other methods
}

After:

pub trait CacheAccessor<K, V>: Send + Sync {
    fn get(&self, key: &K) -> Option<V>;
    fn put(&self, key: &K, value: V) -> Option<V>;
    // ... other methods (no Extra type, no *_with_extra methods)
}

Introduce typed wrapper structs for cached values

Instead of passing validation metadata separately via Extra, embed it in the cached value type:

  • CachedFileMetadata - contains meta: ObjectMeta, statistics: Arc<Statistics>, ordering: Option<LexOrdering>
  • CachedFileList - contains files: Arc<Vec<ObjectMeta>> with filter_by_prefix() helper
  • CachedFileMetadataEntry - contains meta: ObjectMeta, file_metadata: Arc<dyn FileMetadata>

Each wrapper has an is_valid_for(&ObjectMeta) method that checks if the cached entry is still valid (size and last_modified match).

New validation pattern

The typical usage pattern changes from:

// Before: validation hidden in get_with_extra
if let Some(stats) = cache.get_with_extra(&path, &object_meta) {
    // use stats
}

To:

// After: explicit validation
if let Some(cached) = cache.get(&path) {
    if cached.is_valid_for(&object_meta) {
        // use cached.statistics
    }
}

Add ordering support

  • CachedFileMetadata has new ordering: Option<LexOrdering> field
  • FileStatisticsCacheEntry has new has_ordering: bool field for introspection

Are these changes tested?

Yes, existing cache tests pass plus new tests for ordering support.

Are there any user-facing changes?

Breaking change to cache traits. Users with custom cache implementations will need to:

  1. Update CacheAccessor impl to remove Extra type and *_with_extra methods
  2. Update cached value types to the new wrappers (CachedFileMetadata, etc.)
  3. Update callsites to use the new validation pattern with is_valid_for()

🤖 Generated with Claude Code

@github-actions github-actions bot added catalog Related to the catalog crate execution Related to the execution crate datasource Changes to the datasource crate labels Jan 1, 2026
@adriangb adriangb force-pushed the cache-ordering-refactor branch 2 times, most recently from 3697e42 to 12e3efe Compare January 1, 2026 21:39
@github-actions github-actions bot added the core Core DataFusion crate label Jan 1, 2026
Refactor the cache system to support storing both statistics and ordering
information together, in preparation for ordering inference from Parquet
metadata.

Changes to cache_manager.rs:
- Add `CachedFileMetadata` struct with `meta`, `statistics`, and `ordering` fields
- Refactor `FileStatisticsCache` trait to use `CachedFileMetadata` and Path keys
- Add `has_ordering` field to `FileStatisticsCacheEntry`
- Add `CachedFileList` for list files cache
- Refactor `FileMetadataCache` trait to use `CachedFileMetadataEntry` and Path keys

Changes to cache implementations:
- Update `DefaultFileStatisticsCache` to use new trait methods
- Update `DefaultFilesMetadataCache` to use new trait methods
- Simplify list files cache implementation

Changes to callsites:
- Update `ListingTable::do_collect_statistics` to use new cache API
- Update `DFParquetMetadata::fetch_metadata` to use new cache API
- Update `ListingTableUrl` to use new cache API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@adriangb adriangb force-pushed the cache-ordering-refactor branch from 12e3efe to aa3f29c Compare January 1, 2026 22:12
adriangb and others added 2 commits January 1, 2026 16:23
Replace unused FileStatisticsCache import with CacheAccessor
which provides the len() method used in tests.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@github-actions github-actions bot removed the core Core DataFusion crate label Jan 1, 2026
@adriangb adriangb requested a review from Copilot January 1, 2026 22:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the cache API trait hierarchy to prepare for ordering inference from Parquet metadata. The refactoring eliminates the awkward Extra generic parameter from CacheAccessor, introduces wrapper types for cached data (CachedFileMetadata, CachedFileList, CachedFileMetadataEntry), and establishes a cleaner trait hierarchy where specific cache traits extend the base CacheAccessor trait.

Key changes:

  • Removed Extra associated type from CacheAccessor, simplifying the trait with unified get/put methods
  • Introduced CachedFileMetadata struct to store statistics and ordering information together
  • Changed cache key types from ObjectMeta to Path for consistency, with validation now handled by wrapper structs

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
datafusion/execution/src/cache/mod.rs Refactored CacheAccessor trait to remove Extra type and get_with_extra/put_with_extra methods; improved documentation
datafusion/execution/src/cache/cache_manager.rs Added CachedFileMetadata, CachedFileList, and CachedFileMetadataEntry wrapper types with validation methods; updated trait definitions to extend CacheAccessor
datafusion/execution/src/cache/cache_unit.rs Updated DefaultFileStatisticsCache to use new CachedFileMetadata type and implement split trait hierarchy; added ordering support tests
datafusion/execution/src/cache/file_metadata_cache.rs Changed cache key from ObjectMeta to Path; validation moved to CachedFileMetadataEntry::is_valid_for; updated all tests
datafusion/execution/src/cache/list_files_cache.rs Removed inline prefix filtering from cache internals; moved to CachedFileList::filter_by_prefix helper method; simplified cache API
datafusion/datasource/src/url.rs Updated list_with_cache to use CachedFileList and apply prefix filtering after cache retrieval
datafusion/datasource-parquet/src/metadata.rs Updated to use CachedFileMetadataEntry with Path keys and explicit validation checks
datafusion/catalog-listing/src/table.rs Updated do_collect_statistics to use new cache API with CachedFileMetadata wrapper
datafusion/execution/Cargo.toml Added datafusion-physical-expr-common dependency for LexOrdering support
Comments suppressed due to low confidence (1)

datafusion/datasource/src/url.rs:402

  • The prefix filtering logic is duplicated in lines 370-382 and lines 393-402. Consider using the CachedFileList::filter_by_prefix helper method instead. Replace the inline filtering with cached.filter_by_prefix(&Some(full_prefix)) and CachedFileList::new(vec.clone()).filter_by_prefix(&Some(full_prefix)) respectively.
                if prefix.is_some() {
                    let full_prefix_str = full_prefix.as_ref();
                    cached
                        .files
                        .iter()
                        .filter(|meta| {
                            meta.location.as_ref().starts_with(full_prefix_str)
                        })
                        .cloned()
                        .collect()
                } else {
                    cached.files.as_ref().clone()
                }
            } else {
                // Cache miss - always list and cache the full table
                // This ensures we have complete data for future prefix queries
                let vec = store
                    .list(Some(table_base_path))
                    .try_collect::<Vec<ObjectMeta>>()
                    .await?;
                cache.put(table_base_path, CachedFileList::new(vec.clone()));

                // If a prefix filter was requested, apply it to the results
                if prefix.is_some() {
                    let full_prefix_str = full_prefix.as_ref();
                    vec.into_iter()
                        .filter(|meta| {
                            meta.location.as_ref().starts_with(full_prefix_str)
                        })
                        .collect()
                } else {
                    vec
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address review feedback to remove duplicated prefix filtering logic.
Now both cache hit and cache miss paths use the filter_by_prefix helper.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate datasource Changes to the datasource crate execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant