-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Refactor cache APIs to support ordering information #19597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3697e42 to
12e3efe
Compare
Refactor the cache system to support storing both statistics and ordering information together, in preparation for ordering inference from Parquet metadata. Changes to cache_manager.rs: - Add `CachedFileMetadata` struct with `meta`, `statistics`, and `ordering` fields - Refactor `FileStatisticsCache` trait to use `CachedFileMetadata` and Path keys - Add `has_ordering` field to `FileStatisticsCacheEntry` - Add `CachedFileList` for list files cache - Refactor `FileMetadataCache` trait to use `CachedFileMetadataEntry` and Path keys Changes to cache implementations: - Update `DefaultFileStatisticsCache` to use new trait methods - Update `DefaultFilesMetadataCache` to use new trait methods - Simplify list files cache implementation Changes to callsites: - Update `ListingTable::do_collect_statistics` to use new cache API - Update `DFParquetMetadata::fetch_metadata` to use new cache API - Update `ListingTableUrl` to use new cache API 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
12e3efe to
aa3f29c
Compare
Replace unused FileStatisticsCache import with CacheAccessor which provides the len() method used in tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the cache API trait hierarchy to prepare for ordering inference from Parquet metadata. The refactoring eliminates the awkward Extra generic parameter from CacheAccessor, introduces wrapper types for cached data (CachedFileMetadata, CachedFileList, CachedFileMetadataEntry), and establishes a cleaner trait hierarchy where specific cache traits extend the base CacheAccessor trait.
Key changes:
- Removed
Extraassociated type fromCacheAccessor, simplifying the trait with unifiedget/putmethods - Introduced
CachedFileMetadatastruct to store statistics and ordering information together - Changed cache key types from
ObjectMetatoPathfor consistency, with validation now handled by wrapper structs
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| datafusion/execution/src/cache/mod.rs | Refactored CacheAccessor trait to remove Extra type and get_with_extra/put_with_extra methods; improved documentation |
| datafusion/execution/src/cache/cache_manager.rs | Added CachedFileMetadata, CachedFileList, and CachedFileMetadataEntry wrapper types with validation methods; updated trait definitions to extend CacheAccessor |
| datafusion/execution/src/cache/cache_unit.rs | Updated DefaultFileStatisticsCache to use new CachedFileMetadata type and implement split trait hierarchy; added ordering support tests |
| datafusion/execution/src/cache/file_metadata_cache.rs | Changed cache key from ObjectMeta to Path; validation moved to CachedFileMetadataEntry::is_valid_for; updated all tests |
| datafusion/execution/src/cache/list_files_cache.rs | Removed inline prefix filtering from cache internals; moved to CachedFileList::filter_by_prefix helper method; simplified cache API |
| datafusion/datasource/src/url.rs | Updated list_with_cache to use CachedFileList and apply prefix filtering after cache retrieval |
| datafusion/datasource-parquet/src/metadata.rs | Updated to use CachedFileMetadataEntry with Path keys and explicit validation checks |
| datafusion/catalog-listing/src/table.rs | Updated do_collect_statistics to use new cache API with CachedFileMetadata wrapper |
| datafusion/execution/Cargo.toml | Added datafusion-physical-expr-common dependency for LexOrdering support |
Comments suppressed due to low confidence (1)
datafusion/datasource/src/url.rs:402
- The prefix filtering logic is duplicated in lines 370-382 and lines 393-402. Consider using the
CachedFileList::filter_by_prefixhelper method instead. Replace the inline filtering withcached.filter_by_prefix(&Some(full_prefix))andCachedFileList::new(vec.clone()).filter_by_prefix(&Some(full_prefix))respectively.
if prefix.is_some() {
let full_prefix_str = full_prefix.as_ref();
cached
.files
.iter()
.filter(|meta| {
meta.location.as_ref().starts_with(full_prefix_str)
})
.cloned()
.collect()
} else {
cached.files.as_ref().clone()
}
} else {
// Cache miss - always list and cache the full table
// This ensures we have complete data for future prefix queries
let vec = store
.list(Some(table_base_path))
.try_collect::<Vec<ObjectMeta>>()
.await?;
cache.put(table_base_path, CachedFileList::new(vec.clone()));
// If a prefix filter was requested, apply it to the results
if prefix.is_some() {
let full_prefix_str = full_prefix.as_ref();
vec.into_iter()
.filter(|meta| {
meta.location.as_ref().starts_with(full_prefix_str)
})
.collect()
} else {
vec
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address review feedback to remove duplicated prefix filtering logic. Now both cache hit and cache miss paths use the filter_by_prefix helper. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Which issue does this PR close?
Part of #19433
Rationale for this change
In preparation for ordering inference from Parquet metadata, the cache system needs refactoring to:
CacheAccessortrait by removing theExtraassociated type and*_with_extramethodsis_valid_for()methodsWhat changes are included in this PR?
Simplify
CacheAccessortraitBefore:
After:
Introduce typed wrapper structs for cached values
Instead of passing validation metadata separately via
Extra, embed it in the cached value type:CachedFileMetadata- containsmeta: ObjectMeta,statistics: Arc<Statistics>,ordering: Option<LexOrdering>CachedFileList- containsfiles: Arc<Vec<ObjectMeta>>withfilter_by_prefix()helperCachedFileMetadataEntry- containsmeta: ObjectMeta,file_metadata: Arc<dyn FileMetadata>Each wrapper has an
is_valid_for(&ObjectMeta)method that checks if the cached entry is still valid (size and last_modified match).New validation pattern
The typical usage pattern changes from:
To:
Add ordering support
CachedFileMetadatahas newordering: Option<LexOrdering>fieldFileStatisticsCacheEntryhas newhas_ordering: boolfield for introspectionAre these changes tested?
Yes, existing cache tests pass plus new tests for ordering support.
Are there any user-facing changes?
Breaking change to cache traits. Users with custom cache implementations will need to:
CacheAccessorimpl to removeExtratype and*_with_extramethodsCachedFileMetadata, etc.)is_valid_for()🤖 Generated with Claude Code