Skip to content

19597: Refactor cache APIs to support ordering information#160

Open
martin-augment wants to merge 4 commits intomainfrom
pr-19597-2026-01-02-13-34-31
Open

19597: Refactor cache APIs to support ordering information#160
martin-augment wants to merge 4 commits intomainfrom
pr-19597-2026-01-02-13-34-31

Conversation

@martin-augment
Copy link
Owner

19597: To review by AI

adriangb and others added 4 commits January 1, 2026 16:12
Refactor the cache system to support storing both statistics and ordering
information together, in preparation for ordering inference from Parquet
metadata.

Changes to cache_manager.rs:
- Add `CachedFileMetadata` struct with `meta`, `statistics`, and `ordering` fields
- Refactor `FileStatisticsCache` trait to use `CachedFileMetadata` and Path keys
- Add `has_ordering` field to `FileStatisticsCacheEntry`
- Add `CachedFileList` for list files cache
- Refactor `FileMetadataCache` trait to use `CachedFileMetadataEntry` and Path keys

Changes to cache implementations:
- Update `DefaultFileStatisticsCache` to use new trait methods
- Update `DefaultFilesMetadataCache` to use new trait methods
- Simplify list files cache implementation

Changes to callsites:
- Update `ListingTable::do_collect_statistics` to use new cache API
- Update `DFParquetMetadata::fetch_metadata` to use new cache API
- Update `ListingTableUrl` to use new cache API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace unused FileStatisticsCache import with CacheAccessor
which provides the len() method used in tests.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address review feedback to remove duplicated prefix filtering logic.
Now both cache hit and cache miss paths use the filter_by_prefix helper.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Jan 2, 2026

Warning

Rate limit exceeded

@martin-augment has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 7 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 3d3d7a6 and 362f417.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • datafusion/catalog-listing/src/table.rs
  • datafusion/datasource-parquet/src/metadata.rs
  • datafusion/datasource/src/url.rs
  • datafusion/execution/Cargo.toml
  • datafusion/execution/src/cache/cache_manager.rs
  • datafusion/execution/src/cache/cache_unit.rs
  • datafusion/execution/src/cache/file_metadata_cache.rs
  • datafusion/execution/src/cache/list_files_cache.rs
  • datafusion/execution/src/cache/mod.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 2, 2026

PR Review: Refactor cache APIs to support ordering information

Summary

This PR refactors the cache API to embed validation metadata (ObjectMeta) directly into cached entries, and adds support for caching file ordering information. The changes simplify the cache interface by removing the Extra type parameter and moving validation logic to the cached values themselves.

Architecture & Design

✅ Strengths

  1. Cleaner API Design: Removing the Extra type parameter from CacheAccessor significantly simplifies the trait interface. The old get_with_extra() and put_with_extra() methods were confusing - the new approach of embedding metadata in cached values is more intuitive.

  2. Consistent Validation Pattern: The is_valid_for() method on cached entries provides a clear, consistent validation pattern across all cache types (file statistics, file metadata, and list files).

  3. Ordering Support: Adding ordering: Option<LexOrdering> to CachedFileMetadata is well-designed for future enhancements without breaking the API.

  4. Comprehensive Test Coverage: Excellent test coverage with ~350 lines of new tests covering:

    • Cache validation on file modification
    • Ordering caching and updates
    • Multiple cache invalidation scenarios
    • LRU eviction with TTL
    • Memory tracking

🔍 Code Quality Issues

1. Inconsistent Cache Key Usage (datafusion/datasource-parquet/src/metadata.rs:127-172)

Issue: The code switches between using object_meta.location (Path) and the full object_meta as the cache key.

// Line 130: Uses object_meta.location as key
&& let Some(cached) = file_metadata_cache.get(&object_meta.location)
// Line 164: Uses object_meta.location as key again  
file_metadata_cache.put(
    &object_meta.location,
    CachedFileMetadataEntry::new(

Concern: While this appears consistent in the current code, the trait signature change from CacheAccessor<ObjectMeta, ...> to CacheAccessor<Path, ...> is a significant breaking change. The old code used the full ObjectMeta as the key which included size and last_modified for automatic validation. Now validation is manual.

Recommendation: Add a comment explaining why Path is sufficient as a key even though validation needs the full ObjectMeta.

2. Potential Race Condition (datafusion/catalog-listing/src/table.rs:711-743)

Issue: There's a TOCTOU (time-of-check to time-of-use) pattern:

// Check cache first
if let Some(cached) = self.collected_statistics.get(&part_file.object_meta.location) {
    // Validate that cached entry is still valid
    if cached.is_valid_for(&part_file.object_meta) {
        return Ok(cached.statistics);
    }
}

// Cache miss or invalid - infer statistics
let statistics = self.options.format.infer_stats(...).await?;
// ... 
self.collected_statistics.put(...)

Risk: Between the get() call and the put() call, another thread could have already computed and cached the same statistics. While this won't cause correctness issues (both threads compute the same result), it wastes resources.

Impact: Low - The DashMap implementation provides atomic operations, so at worst you get redundant computation, not data corruption.

Recommendation: Consider documenting this behavior or using a compute-if-absent pattern if performance becomes an issue.

3. Silent Failure on Cache Rejection (datafusion/execution/src/cache/file_metadata_cache.rs:71-73)

// no point in trying to add this value to the cache if it cannot fit entirely
if value_size > self.memory_limit {
    return None;  // Silent failure
}

Issue: When an entry is too large to cache, it silently returns None. Callers can't distinguish between "entry was too large" vs "old entry was evicted".

Impact: Medium - This could lead to confusion during debugging when large files are never cached.

Recommendation: Consider logging a debug message when entries are rejected due to size, especially for the file metadata cache where individual Parquet footers could exceed limits.

🐛 Potential Bugs

4. Empty Vector Handling (datafusion/execution/src/cache/list_files_cache.rs:99-110)

fn try_new(
    cached_file_list: CachedFileList,
    ttl: Option<Duration>,
    now: Instant,
) -> Option<Self> {
    let size_bytes = (cached_file_list.files.capacity() * size_of::<ObjectMeta>())
        + cached_file_list
            .files
            .iter()
            .map(meta_heap_bytes)
            .reduce(|acc, b| acc + b)?;  // Returns None if empty!

Bug: For an empty file list, .reduce() returns None, causing try_new() to return None, which prevents caching empty directory listings.

Test Gap: The test at line 740 validates this behavior but doesn't document whether it's intentional.

Impact: Medium - Empty partitions won't be cached, leading to repeated object store calls.

Recommendation:

.reduce(|acc, b| acc + b)
.unwrap_or(0);  // Empty list has 0 heap bytes

5. Missing Validation on Cached Updates (datafusion/datasource/src/url.rs:370-382)

let vec = if let Some(cached) = cache.get(table_base_path) {
    debug!("Hit list files cache");
    cached.filter_by_prefix(&filter_prefix)
} else {
    // Cache miss - always list and cache the full table

Issue: The code doesn't validate cached file lists against the current table state. If files are added/removed from the table, the cache won't know until TTL expires.

Impact: Low to Medium - This is by design (noted in documentation), but worth calling out that stale data is possible.

Note: The PR description mentions "19597: To review by AI" but doesn't link to an issue explaining the motivation for these changes.

⚡ Performance Considerations

6. Clone Overhead in filter_by_prefix (datafusion/execution/src/cache/cache_manager.rs:130-140)

pub fn filter_by_prefix(&self, prefix: &Option<Path>) -> Vec<ObjectMeta> {
    match prefix {
        Some(prefix) => self
            .files
            .iter()
            .filter(|meta| meta.location.as_ref().starts_with(prefix.as_ref()))
            .cloned()  // Clones ObjectMeta for each matching file
            .collect(),
        None => self.files.as_ref().clone(),  // Clones entire Vec
    }
}

Impact: When no prefix is provided, this clones the entire Vec<ObjectMeta>. For tables with thousands of files, this could be expensive.

Recommendation: Consider returning Arc<Vec<ObjectMeta>> or &[ObjectMeta] to avoid cloning when possible. The caller at datasource/src/url.rs:385 immediately converts to a stream, so returning an iterator would work:

pub fn iter_with_prefix(&self, prefix: &Option<Path>) -> impl Iterator<Item = &ObjectMeta> {
    let files = self.files.clone();  // Clone Arc, not Vec
    files.iter().filter(move |meta| {
        prefix.as_ref().map_or(true, |p| meta.location.as_ref().starts_with(p.as_ref()))
    })
}

7. Memory Accounting for Ordering (datafusion/execution/src/cache/cache_unit.rs:91)

statistics_size_bytes: 0, // TODO: set to the real size in the future

Issue: The ordering field in CachedFileMetadata is not included in memory accounting.

Impact: Low - Ordering info is small relative to statistics, but memory limits could be exceeded.

Recommendation: Implement proper size calculation for LexOrdering or document that it's intentionally excluded.

🔒 Security Considerations

No security issues identified. The changes don't introduce new attack vectors:

  • No user input validation issues
  • No unsafe code added
  • Cache eviction properly handles memory limits
  • No sensitive data exposure

📝 Documentation

Good:

  • Comprehensive doc comments on new types
  • Clear usage patterns documented in trait docs
  • Test names are descriptive

Needs Improvement:

  • The PR description is very minimal ("19597: To review by AI")
  • No migration guide for users of the old Extra-based API
  • CachedFileList::filter_by_prefix doesn't document the performance implications of cloning

🎯 Overall Assessment

Quality Score: 8/10

Strengths:

  • Well-designed API improvements that remove complexity
  • Excellent test coverage with realistic scenarios
  • Clean separation of concerns
  • Proper handling of edge cases (TTL, LRU, eviction)

Weaknesses:

Recommendations

Must Fix:

  1. Fix the empty vector bug in ListFilesEntry::try_new() (Issue Add Claude Code GitHub Workflow #4)

Should Fix:
2. Optimize filter_by_prefix() to avoid cloning large vectors (Issue #6)
3. Add debug logging for cache rejections (Issue #3)

Nice to Have:
4. Document the TOCTOU pattern in do_collect_statistics() (Issue #2)
5. Complete the TODO for statistics_size_bytes calculation (Issue #7)
6. Add a migration guide in the PR description


Verdict: Approve with minor fixes. The API improvements are solid and the test coverage is excellent. The empty vector bug should be fixed before merging, and the performance optimization for prefix filtering would be valuable for production use.

@augmentcode
Copy link

augmentcode bot commented Jan 2, 2026

🤖 Augment PR Summary

Summary: Refactors DataFusion’s runtime cache APIs to carry file-ordering information alongside existing cached data, enabling future planning optimizations.

Changes:

  • Introduces new cache value types: CachedFileMetadata (stats + optional ordering), CachedFileMetadataEntry (validated file metadata), and CachedFileList (cached listings with post-filtering).
  • Simplifies CacheAccessor by removing the Extra parameter; validation metadata is now embedded in the cached value and checked via is_valid_for.
  • Updates callers (ListingTable stats collection, Parquet footer metadata caching, and URL listing) to use the new APIs and explicit validation.
  • Refactors default cache implementations (LRU list-files + file-metadata caches, DashMap statistics cache) to store the new entry structs.
  • Adds/updates unit tests to cover validation, ordering presence, TTL expiration, and prefix filtering behavior.

Technical Notes: Ordering is plumbed through the cache layer via LexOrdering, but this PR does not yet populate ordering for table statistics (stored as None for now).

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants