Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
abb764e
Add _cache.py first attempt
ruaridhg Jul 31, 2025
d72078f
test.py ran without error, creating test.zarr/
ruaridhg Jul 31, 2025
e1266b4
Added testing for cache.py LRUStoreCache for v3
ruaridhg Aug 4, 2025
40e6f46
Fix ruff errors
ruaridhg Aug 4, 2025
eadc7bb
Add working example comparing LocalStore to LRUStoreCache
ruaridhg Aug 4, 2025
5f90a71
Delete test.py to clean-up
ruaridhg Aug 4, 2025
ae51d23
Added lrustorecache to changes and user-guide docs
ruaridhg Aug 7, 2025
e58329a
Fix linting issues
ruaridhg Aug 7, 2025
26bd3fc
Implement dual store cache
ruaridhg Aug 8, 2025
5c92d48
Fixed failing tests
ruaridhg Aug 8, 2025
f0c302c
Fix linting errors
ruaridhg Aug 8, 2025
11f17d6
Add logger info
ruaridhg Aug 11, 2025
a7810dc
Delete unnecessary extra functionality
ruaridhg Aug 11, 2025
a607ce0
Rename to caching_store
ruaridhg Aug 11, 2025
8e79e3e
Add test_storage.py
ruaridhg Aug 11, 2025
d31e565
Fix logic in _caching_store.py
ruaridhg Aug 11, 2025
92cd63c
Update tests to match caching_store implemtation
ruaridhg Aug 11, 2025
aa38def
Delete LRUStoreCache files
ruaridhg Aug 11, 2025
86dda09
Update __init__
ruaridhg Aug 11, 2025
bb807d0
Add functionality for max_size
ruaridhg Aug 11, 2025
ed4b284
Add tests for cache_info and clear_cache
ruaridhg Aug 11, 2025
0fe580b
Delete test.py
ruaridhg Aug 11, 2025
1d9a1f7
Fix linting errors
ruaridhg Aug 11, 2025
16ae3bd
Update feature description
ruaridhg Aug 11, 2025
62b739f
Fix errors
ruaridhg Aug 11, 2025
f51fdb8
Fix cachingstore.rst errors
ruaridhg Aug 11, 2025
ffa9822
Fix cachingstore.rst errors
ruaridhg Aug 11, 2025
cda4767
Merge branch 'main' into rmg/cache_remote_stores_locally
ruaridhg Aug 11, 2025
d20843a
Fixed eviction key logic with proper size tracking
ruaridhg Aug 11, 2025
4b8d0a6
Increase code coverage to 98%
ruaridhg Aug 11, 2025
84a87e2
Fix linting errors
ruaridhg Aug 11, 2025
f3b6b3e
Merge branch 'main' into rmg/cache_remote_stores_locally
d-v-b Aug 21, 2025
114a29a
Merge branch 'main' into rmg/cache_remote_stores_locally
d-v-b Aug 28, 2025
39cb6b1
Merge branch 'main' into rmg/cache_remote_stores_locally
d-v-b Sep 17, 2025
1f200ed
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Sep 30, 2025
f9c8c09
move cache store to experimental, fix bugs
d-v-b Sep 30, 2025
6861490
update changelog
d-v-b Sep 30, 2025
41d182c
remove logging config override, remove dead code, adjust evict_key lo…
d-v-b Oct 1, 2025
83539d3
add docs
d-v-b Oct 1, 2025
56db161
add tests for relaxed cache coherency
d-v-b Oct 1, 2025
3d21514
adjust code examples (but we don't know if they work, because we don'…
d-v-b Oct 1, 2025
923cc53
apply changes based on AI code review, and move tests into tests/test…
d-v-b Oct 2, 2025
e319ce3
update changelog
d-v-b Oct 2, 2025
af81c17
fix exception log
d-v-b Oct 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/3366.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Adds `zarr.experimental.cache_store.CacheStore`, a `Store` that implements caching by combining two other `Store` instances. See the [docs page](https://zarr.readthedocs.io/en/latest/user-guide/experimental#cachestore) for more information about this feature.
272 changes: 272 additions & 0 deletions docs/user-guide/experimental.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Experimental features

This section contains documentation for experimental Zarr Python features. The features described here are exciting and potentially useful, but also volatile -- we might change them at any time. Take this into account if you consider depending on these features.

## `CacheStore`

Zarr Python 3.1.4 adds `zarr.experimental.cache_store.CacheStore` provides a dual-store caching implementation
that can be wrapped around any Zarr store to improve performance for repeated data access.
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
latency can significantly impact data access speed.

The CacheStore implements a cache that uses a separate Store instance as the cache backend,
providing persistent caching capabilities with time-based expiration, size-based eviction,
and flexible cache storage options. It automatically evicts the least recently used items
when the cache reaches its maximum size.

Because the `CacheStore` uses an ordinary Zarr `Store` object as the caching layer, you can reuse the data stored in the cache later.

> **Note:** The CacheStore is a wrapper store that maintains compatibility with the full
> `zarr.abc.store.Store` API while adding transparent caching functionality.

## Basic Usage

Creating a CacheStore requires both a source store and a cache store. The cache store
can be any Store implementation, providing flexibility in cache persistence:

```python exec="true" session="experimental" source="above" result="ansi"
import zarr
from zarr.storage import LocalStore
import numpy as np
from tempfile import mkdtemp
from zarr.experimental.cache_store import CacheStore

# Create a local store and a separate cache store
local_store_path = mkdtemp(suffix='.zarr')
source_store = LocalStore(local_store_path)
cache_store = zarr.storage.MemoryStore() # In-memory cache
cached_store = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=256*1024*1024 # 256MB cache
)

# Create an array using the cached store
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')

# Write some data to force chunk creation
zarr_array[:] = np.random.random((100, 100))
```

The dual-store architecture allows you to use different store types for source and cache,
such as a remote store for source data and a local store for persistent caching.

## Performance Benefits

The CacheStore provides significant performance improvements for repeated data access:

```python exec="true" session="experimental" source="above" result="ansi"
import time

# Benchmark reading with cache
start = time.time()
for _ in range(100):
_ = zarr_array[:]
elapsed_cache = time.time() - start

# Compare with direct store access (without cache)
zarr_array_nocache = zarr.open(local_store_path, mode='r')
start = time.time()
for _ in range(100):
_ = zarr_array_nocache[:]
elapsed_nocache = time.time() - start

# Cache provides speedup for repeated access
speedup = elapsed_nocache / elapsed_cache
```

Cache effectiveness is particularly pronounced with repeated access to the same data chunks.


## Cache Configuration

The CacheStore can be configured with several parameters:

**max_size**: Controls the maximum size of cached data in bytes

```python exec="true" session="experimental" source="above" result="ansi"
# 256MB cache with size limit
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=256*1024*1024
)

# Unlimited cache size (use with caution)
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=None
)
```

**max_age_seconds**: Controls time-based cache expiration

```python exec="true" session="experimental" source="above" result="ansi"
# Cache expires after 1 hour
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_age_seconds=3600
)

# Cache never expires
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_age_seconds="infinity"
)
```

**cache_set_data**: Controls whether written data is cached

```python exec="true" session="experimental" source="above" result="ansi"
# Cache data when writing (default)
cache = CacheStore(
store=source_store,
cache_store=cache_store,
cache_set_data=True
)

# Don't cache written data (read-only cache)
cache = CacheStore(
store=source_store,
cache_store=cache_store,
cache_set_data=False
)
```

## Cache Statistics

The CacheStore provides statistics to monitor cache performance and state:

```python exec="true" session="experimental" source="above" result="ansi"
# Access some data to generate cache activity
data = zarr_array[0:50, 0:50] # First access - cache miss
data = zarr_array[0:50, 0:50] # Second access - cache hit

# Get comprehensive cache information
info = cached_store.cache_info()
print(info['cache_store_type']) # e.g., 'MemoryStore'
print(info['max_age_seconds'])
print(info['max_size'])
print(info['current_size'])
print(info['tracked_keys'])
print(info['cached_keys'])
print(info['cache_set_data'])
```

The `cache_info()` method returns a dictionary with detailed information about the cache state.

## Cache Management

The CacheStore provides methods for manual cache management:

```python exec="true" session="experimental" source="above" result="ansi"
# Clear all cached data and tracking information
import asyncio
asyncio.run(cached_store.clear_cache())

# Check cache info after clearing
info = cached_store.cache_info()
assert info['tracked_keys'] == 0
assert info['current_size'] == 0
```

The `clear_cache()` method is an async method that clears both the cache store
(if it supports the `clear` method) and all internal tracking data.

## Best Practices

1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
2. **Size the cache appropriately**: Set `max_size` based on available storage and expected data access patterns
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data

## Working with Different Store Types

The CacheStore can wrap any store that implements the `zarr.abc.store.Store` interface
and use any store type for the cache backend:

### Local Store with Memory Cache

```python exec="true" session="experimental-memory-cache" source="above" result="ansi"
from zarr.storage import LocalStore, MemoryStore
from zarr.experimental.cache_store import CacheStore
from tempfile import mkdtemp

local_store_path = mkdtemp(suffix='.zarr')
source_store = LocalStore(local_store_path)
cache_store = MemoryStore()
cached_store = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=128*1024*1024
)
```

### Memory Store with Persistent Cache

```python exec="true" session="experimental-local-cache" source="above" result="ansi"
from tempfile import mkdtemp
from zarr.storage import MemoryStore, LocalStore
from zarr.experimental.cache_store import CacheStore

memory_store = MemoryStore()
local_store_path = mkdtemp(suffix='.zarr')
persistent_cache = LocalStore(local_store_path)
cached_store = CacheStore(
store=memory_store,
cache_store=persistent_cache,
max_size=256*1024*1024
)
```

The dual-store architecture provides flexibility in choosing the best combination
of source and cache stores for your specific use case.

## Examples from Real Usage

Here's a complete example demonstrating cache effectiveness:

```python exec="true" session="experimental-final" source="above" result="ansi"
import numpy as np
import time
from tempfile import mkdtemp
import zarr
import zarr.storage
from zarr.experimental.cache_store import CacheStore

# Create test data with dual-store cache
local_store_path = mkdtemp(suffix='.zarr')
source_store = zarr.storage.LocalStore(local_store_path)
cache_store = zarr.storage.MemoryStore()
cached_store = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=256*1024*1024
)
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
zarr_array[:] = np.random.random((100, 100))

# Demonstrate cache effectiveness with repeated access
start = time.time()
data = zarr_array[20:30, 20:30] # First access (cache miss)
first_access = time.time() - start

start = time.time()
data = zarr_array[20:30, 20:30] # Second access (cache hit)
second_access = time.time() - start

# Check cache statistics
info = cached_store.cache_info()
assert info['cached_keys'] > 0 # Should have cached keys
assert info['current_size'] > 0 # Should have cached data
print(f"Cache contains {info['cached_keys']} keys with {info['current_size']} bytes")
```

This example shows how the CacheStore can significantly reduce access times for repeated
data reads, particularly important when working with remote data sources. The dual-store
architecture allows for flexible cache persistence and management.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ nav:
- user-guide/extending.md
- user-guide/gpu.md
- user-guide/consolidated_metadata.md
- user-guide/experimental.md
- API Reference:
- api/index.md
- api/array.md
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,7 @@ filterwarnings = [
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning"
]
markers = [
"asyncio: mark test as asyncio test",
"gpu: mark a test as requiring CuPy and GPU",
"slow_hypothesis: slow hypothesis tests",
]
Expand Down
Loading