Skip to content

Enable Foyer disk based cache for Parquet warm reads#1

Open
vishwasgarg18 wants to merge 1 commit intodatafusionfrom
parquet/foyer-integeration
Open

Enable Foyer disk based cache for Parquet warm reads#1
vishwasgarg18 wants to merge 1 commit intodatafusionfrom
parquet/foyer-integeration

Conversation

@vishwasgarg18
Copy link
Copy Markdown
Collaborator

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -0,0 +1,32 @@
/*
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the right place. Placed here to avoid compilation issue.
Update location post open sourcing Tiered storage.

Copy link
Copy Markdown
Collaborator Author

@vishwasgarg18 vishwasgarg18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foyer cache integration.

@vishwasgarg18 vishwasgarg18 changed the title Enable Foyer disk based cache for Parquet reads Enable Foyer disk based cache for Parquet warm reads Mar 31, 2026
use vectorized_exec_spi::{log_info, log_error, log_debug};

// Default page cache budgets — overridden by Java settings via createCache()
const DEFAULT_PAGE_CACHE_MEMORY_BYTES: usize = 256 * 1024 * 1024; // 256 MB L1 memory
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss and update.

Foyer provides out of the box memory cache. Can be updated to -1 to set it to only disk based cache.

let inner: HybridCache<String, CachedBytes> = rt.block_on(async move {
HybridCacheBuilder::new()
.with_name("foyer-parquet-page-cache")
.memory(memory_capacity_bytes)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need disk cache lets not use memeory


/// An [`ObjectStore`] wrapper that caches `get_range` / `get_ranges` results
/// in the Foyer hybrid (memory + disk) page cache.
pub struct CachingObjectStore {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of tiered object store corret where the cach information is passed and then cache is used

// Default page cache budgets — overridden by Java settings via createCache()
const DEFAULT_PAGE_CACHE_MEMORY_BYTES: usize = 256 * 1024 * 1024; // 256 MB L1 memory
const DEFAULT_PAGE_CACHE_DISK_BYTES: usize = 10 * 1024 * 1024 * 1024; // 10 GB L2 disk
const DEFAULT_PAGE_CACHE_DIR: &str = "/tmp/foyer-page-cache";
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this should be setting nothing should be default

* combined with the byte range, e.g. {@code "data/nodes/0/.../parquet/_parquet_0.parquet:4096-8192"}.
* The exact key format is an implementation detail of the provider.
*/
public interface PageCacheProvider {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need a page cache provider we have a cache strategy provider correct and for parquet we can have pass through on java and foyer in rust correct

/** Whether this input has been closed */
private boolean closed = false;

public CachedParquetIndexInput(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data fusion is taking care of parquet file reads why is this required?

/// Parse the eviction_type string for PAGES cache type.
/// Expected format: "<disk_capacity_bytes>|<disk_dir>"
/// Falls back to defaults if the string is malformed (e.g. plain "LRU" from old Java code).
fn parse_page_cache_params(eviction_str: &str) -> (usize, String) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not make changes to liquid cache

* by this plugin, without any classloader visibility issues.
*/
public class DataFusionPlugin extends Plugin implements ActionPlugin, SearchEnginePlugin, AnalyticsBackEndPlugin, ExtensiblePlugin, SearchAnalyticsBackEndPlugin {
public class DataFusionPlugin extends Plugin implements ActionPlugin, SearchEnginePlugin, AnalyticsBackEndPlugin, ExtensiblePlugin, SearchAnalyticsBackEndPlugin, PageCacheProvider {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not add more implementation to main plugin. We need to wrap thing around plugin

// Called by CachedParquetCacheStrategy in the tiered-storage module.

@Override
public byte[] getPageRange(String path, int start, int end) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion plugin should not be aware about caches

/// which causes `JoinError::Cancelled` panics in `foyer-storage`. We therefore keep
/// the runtime as an `Arc` field so it is dropped only after the `HybridCache` itself.
#[derive(Clone)]
pub struct FoyerDiskPageCache {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be part of tiered obkect store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants