19597: Refactor cache APIs to support ordering information by martin-augment · Pull Request #160 · martin-augment/datafusion

martin-augment · 2026-01-02T13:34:35Z

19597: To review by AI

Refactor the cache system to support storing both statistics and ordering information together, in preparation for ordering inference from Parquet metadata. Changes to cache_manager.rs: - Add `CachedFileMetadata` struct with `meta`, `statistics`, and `ordering` fields - Refactor `FileStatisticsCache` trait to use `CachedFileMetadata` and Path keys - Add `has_ordering` field to `FileStatisticsCacheEntry` - Add `CachedFileList` for list files cache - Refactor `FileMetadataCache` trait to use `CachedFileMetadataEntry` and Path keys Changes to cache implementations: - Update `DefaultFileStatisticsCache` to use new trait methods - Update `DefaultFilesMetadataCache` to use new trait methods - Simplify list files cache implementation Changes to callsites: - Update `ListingTable::do_collect_statistics` to use new cache API - Update `DFParquetMetadata::fetch_metadata` to use new cache API - Update `ListingTableUrl` to use new cache API 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Replace unused FileStatisticsCache import with CacheAccessor which provides the len() method used in tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Address review feedback to remove duplicated prefix filtering logic. Now both cache hit and cache miss paths use the filter_by_prefix helper. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai · 2026-01-02T13:34:40Z

Warning

Rate limit exceeded

@martin-augment has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 7 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 3d3d7a6 and 362f417.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (9)

datafusion/catalog-listing/src/table.rs
datafusion/datasource-parquet/src/metadata.rs
datafusion/datasource/src/url.rs
datafusion/execution/Cargo.toml
datafusion/execution/src/cache/cache_manager.rs
datafusion/execution/src/cache/cache_unit.rs
datafusion/execution/src/cache/file_metadata_cache.rs
datafusion/execution/src/cache/list_files_cache.rs
datafusion/execution/src/cache/mod.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-02T13:36:54Z

PR Review: Refactor cache APIs to support ordering information

Summary

This PR refactors the cache API to embed validation metadata (ObjectMeta) directly into cached entries, and adds support for caching file ordering information. The changes simplify the cache interface by removing the Extra type parameter and moving validation logic to the cached values themselves.

Architecture & Design

✅ Strengths

Cleaner API Design: Removing the Extra type parameter from CacheAccessor significantly simplifies the trait interface. The old get_with_extra() and put_with_extra() methods were confusing - the new approach of embedding metadata in cached values is more intuitive.
Consistent Validation Pattern: The is_valid_for() method on cached entries provides a clear, consistent validation pattern across all cache types (file statistics, file metadata, and list files).
Ordering Support: Adding ordering: Option<LexOrdering> to CachedFileMetadata is well-designed for future enhancements without breaking the API.
Comprehensive Test Coverage: Excellent test coverage with ~350 lines of new tests covering:
- Cache validation on file modification
- Ordering caching and updates
- Multiple cache invalidation scenarios
- LRU eviction with TTL
- Memory tracking

🔍 Code Quality Issues

1. Inconsistent Cache Key Usage (datafusion/datasource-parquet/src/metadata.rs:127-172)

Issue: The code switches between using object_meta.location (Path) and the full object_meta as the cache key.

// Line 130: Uses object_meta.location as key
&& let Some(cached) = file_metadata_cache.get(&object_meta.location)
// Line 164: Uses object_meta.location as key again  
file_metadata_cache.put(
    &object_meta.location,
    CachedFileMetadataEntry::new(

Concern: While this appears consistent in the current code, the trait signature change from CacheAccessor<ObjectMeta, ...> to CacheAccessor<Path, ...> is a significant breaking change. The old code used the full ObjectMeta as the key which included size and last_modified for automatic validation. Now validation is manual.

Recommendation: Add a comment explaining why Path is sufficient as a key even though validation needs the full ObjectMeta.

2. Potential Race Condition (datafusion/catalog-listing/src/table.rs:711-743)

Issue: There's a TOCTOU (time-of-check to time-of-use) pattern:

// Check cache first
if let Some(cached) = self.collected_statistics.get(&part_file.object_meta.location) {
    // Validate that cached entry is still valid
    if cached.is_valid_for(&part_file.object_meta) {
        return Ok(cached.statistics);
    }
}

// Cache miss or invalid - infer statistics
let statistics = self.options.format.infer_stats(...).await?;
// ... 
self.collected_statistics.put(...)

Risk: Between the get() call and the put() call, another thread could have already computed and cached the same statistics. While this won't cause correctness issues (both threads compute the same result), it wastes resources.

Impact: Low - The DashMap implementation provides atomic operations, so at worst you get redundant computation, not data corruption.

Recommendation: Consider documenting this behavior or using a compute-if-absent pattern if performance becomes an issue.

3. Silent Failure on Cache Rejection (datafusion/execution/src/cache/file_metadata_cache.rs:71-73)

// no point in trying to add this value to the cache if it cannot fit entirely
if value_size > self.memory_limit {
    return None;  // Silent failure
}

Issue: When an entry is too large to cache, it silently returns None. Callers can't distinguish between "entry was too large" vs "old entry was evicted".

Impact: Medium - This could lead to confusion during debugging when large files are never cached.

Recommendation: Consider logging a debug message when entries are rejected due to size, especially for the file metadata cache where individual Parquet footers could exceed limits.

🐛 Potential Bugs

4. Empty Vector Handling (datafusion/execution/src/cache/list_files_cache.rs:99-110)

fn try_new(
    cached_file_list: CachedFileList,
    ttl: Option<Duration>,
    now: Instant,
) -> Option<Self> {
    let size_bytes = (cached_file_list.files.capacity() * size_of::<ObjectMeta>())
        + cached_file_list
            .files
            .iter()
            .map(meta_heap_bytes)
            .reduce(|acc, b| acc + b)?;  // Returns None if empty!

Bug: For an empty file list, .reduce() returns None, causing try_new() to return None, which prevents caching empty directory listings.

Test Gap: The test at line 740 validates this behavior but doesn't document whether it's intentional.

Impact: Medium - Empty partitions won't be cached, leading to repeated object store calls.

Recommendation:

.reduce(|acc, b| acc + b)
.unwrap_or(0);  // Empty list has 0 heap bytes

5. Missing Validation on Cached Updates (datafusion/datasource/src/url.rs:370-382)

let vec = if let Some(cached) = cache.get(table_base_path) {
    debug!("Hit list files cache");
    cached.filter_by_prefix(&filter_prefix)
} else {
    // Cache miss - always list and cache the full table

Issue: The code doesn't validate cached file lists against the current table state. If files are added/removed from the table, the cache won't know until TTL expires.

Impact: Low to Medium - This is by design (noted in documentation), but worth calling out that stale data is possible.

Note: The PR description mentions "19597: To review by AI" but doesn't link to an issue explaining the motivation for these changes.

⚡ Performance Considerations

6. Clone Overhead in filter_by_prefix (datafusion/execution/src/cache/cache_manager.rs:130-140)

pub fn filter_by_prefix(&self, prefix: &Option<Path>) -> Vec<ObjectMeta> {
    match prefix {
        Some(prefix) => self
            .files
            .iter()
            .filter(|meta| meta.location.as_ref().starts_with(prefix.as_ref()))
            .cloned()  // Clones ObjectMeta for each matching file
            .collect(),
        None => self.files.as_ref().clone(),  // Clones entire Vec
    }
}

Impact: When no prefix is provided, this clones the entire Vec<ObjectMeta>. For tables with thousands of files, this could be expensive.

Recommendation: Consider returning Arc<Vec<ObjectMeta>> or &[ObjectMeta] to avoid cloning when possible. The caller at datasource/src/url.rs:385 immediately converts to a stream, so returning an iterator would work:

pub fn iter_with_prefix(&self, prefix: &Option<Path>) -> impl Iterator<Item = &ObjectMeta> {
    let files = self.files.clone();  // Clone Arc, not Vec
    files.iter().filter(move |meta| {
        prefix.as_ref().map_or(true, |p| meta.location.as_ref().starts_with(p.as_ref()))
    })
}

7. Memory Accounting for Ordering (datafusion/execution/src/cache/cache_unit.rs:91)

statistics_size_bytes: 0, // TODO: set to the real size in the future

Issue: The ordering field in CachedFileMetadata is not included in memory accounting.

Impact: Low - Ordering info is small relative to statistics, but memory limits could be exceeded.

Recommendation: Implement proper size calculation for LexOrdering or document that it's intentionally excluded.

🔒 Security Considerations

No security issues identified. The changes don't introduce new attack vectors:

No user input validation issues
No unsafe code added
Cache eviction properly handles memory limits
No sensitive data exposure

📝 Documentation

Good:

Comprehensive doc comments on new types
Clear usage patterns documented in trait docs
Test names are descriptive

Needs Improvement:

The PR description is very minimal ("19597: To review by AI")
No migration guide for users of the old Extra-based API
CachedFileList::filter_by_prefix doesn't document the performance implications of cloning

🎯 Overall Assessment

Quality Score: 8/10

Strengths:

Well-designed API improvements that remove complexity
Excellent test coverage with realistic scenarios
Clean separation of concerns
Proper handling of edge cases (TTL, LRU, eviction)

Weaknesses:

Bug Add Claude Code GitHub Workflow #4 (empty vector handling) should be fixed
Performance issue 18417: Feat: Make extract SQL expression timezone aware #6 could impact large tables
Missing implementation of memory accounting (TODO)

Recommendations

Must Fix:

Fix the empty vector bug in ListFilesEntry::try_new() (Issue Add Claude Code GitHub Workflow #4)

Should Fix:
2. Optimize filter_by_prefix() to avoid cloning large vectors (Issue #6)
3. Add debug logging for cache rejections (Issue #3)

Nice to Have:
4. Document the TOCTOU pattern in do_collect_statistics() (Issue #2)
5. Complete the TODO for statistics_size_bytes calculation (Issue #7)
6. Add a migration guide in the PR description

Verdict: Approve with minor fixes. The API improvements are solid and the test coverage is excellent. The empty vector bug should be fixed before merging, and the performance optimization for prefix filtering would be valuable for production use.

augmentcode · 2026-01-02T13:38:49Z

🤖 Augment PR Summary

Summary: Refactors DataFusion’s runtime cache APIs to carry file-ordering information alongside existing cached data, enabling future planning optimizations.

Changes:

Introduces new cache value types: CachedFileMetadata (stats + optional ordering), CachedFileMetadataEntry (validated file metadata), and CachedFileList (cached listings with post-filtering).
Simplifies CacheAccessor by removing the Extra parameter; validation metadata is now embedded in the cached value and checked via is_valid_for.
Updates callers (ListingTable stats collection, Parquet footer metadata caching, and URL listing) to use the new APIs and explicit validation.
Refactors default cache implementations (LRU list-files + file-metadata caches, DashMap statistics cache) to store the new entry structs.
Adds/updates unit tests to cover validation, ordering presence, TTL expiration, and prefix filtering behavior.

Technical Notes: Ordering is plumbed through the cache layer via LexOrdering, but this PR does not yet populate ordering for table statistics (stored as None for now).

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

adriangb and others added 4 commits January 1, 2026 16:12

fix: import CacheAccessor trait for len() method in tests

b44defd

Replace unused FileStatisticsCache import with CacheAccessor which provides the len() method used in tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

fmt

a1a6d7a

augmentcode bot reviewed Jan 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

19597: Refactor cache APIs to support ordering information#160

19597: Refactor cache APIs to support ordering information#160
martin-augment wants to merge 4 commits intomainfrom
pr-19597-2026-01-02-13-34-31

martin-augment commented Jan 2, 2026

Uh oh!

coderabbitai bot commented Jan 2, 2026

Rate limit exceeded

Uh oh!

claude bot commented Jan 2, 2026

Uh oh!

augmentcode bot commented Jan 2, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Jan 2, 2026

Uh oh!

coderabbitai bot commented Jan 2, 2026

Rate limit exceeded

Uh oh!

claude bot commented Jan 2, 2026

PR Review: Refactor cache APIs to support ordering information

Summary

Architecture & Design

✅ Strengths

🔍 Code Quality Issues

1. Inconsistent Cache Key Usage (datafusion/datasource-parquet/src/metadata.rs:127-172)

2. Potential Race Condition (datafusion/catalog-listing/src/table.rs:711-743)

3. Silent Failure on Cache Rejection (datafusion/execution/src/cache/file_metadata_cache.rs:71-73)

🐛 Potential Bugs

4. Empty Vector Handling (datafusion/execution/src/cache/list_files_cache.rs:99-110)

5. Missing Validation on Cached Updates (datafusion/datasource/src/url.rs:370-382)

⚡ Performance Considerations

6. Clone Overhead in filter_by_prefix (datafusion/execution/src/cache/cache_manager.rs:130-140)

7. Memory Accounting for Ordering (datafusion/execution/src/cache/cache_unit.rs:91)

🔒 Security Considerations

📝 Documentation

🎯 Overall Assessment

Recommendations

Uh oh!

augmentcode bot commented Jan 2, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants