diff --git a/dev-docs/specs/2026-06-16-api-roadmap.md b/dev-docs/specs/2026-06-16-api-roadmap.md new file mode 100644 index 0000000..2978da3 --- /dev/null +++ b/dev-docs/specs/2026-06-16-api-roadmap.md @@ -0,0 +1,113 @@ +# zarrista — API roadmap & next areas + +**Date:** 2026-06-16 +**Status:** Draft + +## Positioning + +zarrista is **not** the same kind of project as the official +[`zarrs-python`](https://github.com/zarrs/zarrs-python). That binding is small and +narrow: it injects the `zarrs` codec pipeline *behind* zarr-python, accelerating +encode/decode while zarr-python owns the API, stores, indexing, and metadata. + +zarrista is being explored as a **low-level Zarr API in its own right** — one that +could *replace* zarr-python, or that zarr-python could *depend on* for its core in +the medium term. That ambition sets the design constraints below. + +### Design mindset vs. shipping order + +These are deliberately different: + +- **Design mindset: zarr-python replacement.** Every API decision should be made as + if this library will eventually need writing, full indexing semantics, a + pluggable store abstraction, groups-with-creation, and consolidated metadata. + Don't paint ourselves into a read-only or numerics-only corner with the type + signatures, class hierarchy, or store traits. +- **Shipping order: fast standalone cloud reader first.** The immediate goal is to + get something real working end-to-end so we can **benchmark** against zarr-python + on cloud reads. The reader path (async + obstore) is where zarrs should already + beat zarr-python, so it's both the fastest path to a demo and the most + differentiated. + +The litmus test for any near-term decision: *does it move us toward a benchmarkable +cloud reader, without foreclosing the replacement-grade API later?* + +## Current surface (as of this doc) + +Read-only, async-first metadata + raw-chunk reader: + +- `Array` / `AsyncArray`: metadata properties (`shape`, `dtype`, `ndim`, `attrs`, + `metadata`, `chunk_grid`, `codecs`, `dimension_names`, `path`) plus + `retrieve_chunk(chunk_indices)`. +- `Group` / `AsyncGroup`: `attrs`, `array_keys()`, `group_keys()`, child navigation. +- `Data`: zero-copy numpy via the buffer protocol. +- Stores: sync `FilesystemStore` / `MemoryStore`; async = any `obstore.ObjectStore`. +- Dtypes: fixed-width numerics only (bool, int/uint 8–64, float16/32/64). +- No writing, no array-coordinate indexing, no fill values, no var-length dtypes. + +The key gap: `retrieve_chunk` is a *chunk-coordinate* primitive. Users think in +*array coordinates*. Closing that gap is what turns this from a chunk inspector into +an array library. + +## Tier 1 — makes it usable (do first) + +1. **Array indexing / `__getitem__`.** Map Python `slice`/`int`/`Ellipsis`/`None` + to `zarrs::array_subset::ArraySubset`, call `retrieve_array_subset_opt`, return + an ndarray. Start with **basic indexing** (slices + ints + ellipsis); defer + orthogonal/vectorized/boolean to a later pass (mirror zarr-python's `.oindex` / + `.vindex` split). The `retrieve_array_subset` path is already stubbed/commented + in `array/sync.rs`. This is the single highest-impact change. + +2. **Fill values + edge chunks.** Indexing forces this: subsets spanning the array + boundary or hitting missing chunks need the fill value. Extraction code already + exists commented-out in `dtype.rs`. Expose as `Array.fill_value` *and* wire into + the subset read path — without it, partial-edge-chunk reads are wrong. + +3. **Complete the dtype story.** Add **variable-length strings** and **fixed-width + bytes** (target numpy 2 `StringDType`). Structured/complex dtypes can wait. + +## Tier 2 — where zarrista should beat zarr-python + +The differentiation, and the thing to benchmark: + +4. **Parallel multi-chunk / subset reads.** Lean on zarrs's concurrent codec+I/O + pipeline. Expose `retrieve_chunks(list_of_indices)` and make `__getitem__` over a + multi-chunk region fan out internally with a configurable concurrency limit + (`zarrs` `CodecOptions`/concurrency knobs). Headline: a single + `await arr[big_slice]` pulling hundreds of chunks concurrently from S3. + +5. **`retrieve_*_into` / preallocated output.** Decode into a caller-provided + buffer to avoid an allocation and integrate cleanly with xarray/dask block + fetching. Builds on the existing `Data` buffer-protocol work. + +## Tier 3 — larger projects (replacement-grade, sequence deliberately) + +6. **Writing.** New axis: writable stores (obstore PUT), `store_chunk` / + `store_array_subset`, `create_array` / `create_group`, resize. Required for the + replacement goal; not required for the first benchmark. + +7. **Store extensibility.** (a) Let obstore back the **sync** path too via + `block_on`, so we don't maintain two store worlds. (b) A Python-implementable + `Store` protocol for custom backends, mirroring zarr-python's `Store` ABC. + +8. **Consolidated metadata + group creation** — replacement-grade parity items. + +## Cross-cutting: testing + +Currently ~one smoke test. Before Tier 1 lands, stand up **round-trip tests against +zarr-python**: write with zarr-python, read with zarrista, assert equality across +dtypes/codecs/sharding. Cheapest way to buy correctness confidence and a prerequisite +for trustworthy benchmarks. Do this in parallel with Tier 1. + +## Recommended sequence + +1. Round-trip test harness vs. zarr-python (parallel, ongoing). +2. Tier 1: indexing → fill values/dtypes. +3. Tier 2: parallel bulk reads → benchmark vs. zarr-python on a real cloud dataset. +4. Reassess with benchmark numbers in hand before committing to Tier 3 (writing). + +## Out of scope (for now) + +Writing; full fancy/boolean indexing; consolidated metadata; group creation; custom +Python stores. All are in-scope for the *design* (don't foreclose them) but not for +the first benchmarkable milestone. diff --git a/dev-docs/specs/2026-06-18-string-dtype-design.md b/dev-docs/specs/2026-06-18-string-dtype-design.md new file mode 100644 index 0000000..30a8660 --- /dev/null +++ b/dev-docs/specs/2026-06-18-string-dtype-design.md @@ -0,0 +1,130 @@ +# zarrista — variable-length string dtype + +**Date:** 2026-06-18 +**Status:** Approved + +## Goal + +Let zarrista read Zarr arrays whose dtype is **variable-length UTF-8 string**, +returning them as a numpy 2 `StringDType` array via `Data.to_numpy()`. This is the +first slice of Tier 1, item 3 of the [API roadmap](2026-06-16-api-roadmap.md) +("complete the dtype story"). Fixed-width raw bytes, complex, and fixed UTF-32 are +explicitly deferred to follow-up tasks. + +## Guiding principle (settled in brainstorming) + +**Rust produces the decoded payload; Python owns numpy-dtype construction.** The +buffer protocol carries *bytes*, not type semantics, for anything richer than a +buffer-native scalar. For variable-length strings this is not just a preference — +it is forced: rust-numpy has no safe API for numpy 2 `StringDType` +([PyO3/rust-numpy#505](https://github.com/PyO3/rust-numpy/issues/505)). So the +numpy array is always built on the Python side, and Rust never touches +`StringDType`. + +## Dtype categories + +This frames the broader "dtype story" so the string work slots in cleanly. Only +**category 3** is implemented in this round. + +| Category | Members | Held as | Buffer protocol | `to_numpy` | +|---|---|---|---|---| +| 1. Buffer-native | existing 12 numerics | `ArrayD` | typed, strided (unchanged) | `np.asarray(self)` — zero-copy typed view | +| 2. Raw fixed-width | `r*N`, complex, UTF-32 (future) | bytes + numpy-dtype string + shape | flat `B` buffer | `np.frombuffer(self, dt).reshape(shape)` | +| 3. Variable-length | **`string` (this round)**, bytes (future) | `ArrayD` | none | build `list[str]` → `np.array(lst, StringDType()).reshape(shape)` | + +Category 2 infrastructure is **not** built in this round — string needs none of it. + +## What zarrs gives us + +- `DataType::String` is `StringDataType`; `dtype.is::()` identifies it. +- `String: ElementOwned` — zarrs does the UTF-8 decode for us. +- With the `ndarray` feature (already enabled), `retrieve_array_subset_ndarray::()` + and `retrieve_chunk_ndarray::()` return `ArrayD`. +- The `vlen_utf8` codec (and `vlen`, `vlen_v2`) is implemented in zarrs, so it can + decode what zarr-python writes for a string array. +- Edge/missing chunks are filled with the array's string fill value automatically + during retrieval — no special handling here. + +## Changes + +### `src/data.rs` — `DataInner` and `PyData` + +1. **Add a variant:** `DataInner::String(ArrayD)`. +2. **`with_array!` gains a `String($a) => $body` arm.** This compiles because the + only remaining `with_array!` call-site bodies are `a.shape()`, `a.strides()`, and + `a.as_ptr()` — all valid for `ArrayD`. (See deletion below for why no + body needs `numpy::Element`.) +3. **Delete `to_numpy_with_copy`.** It is the `buffer_format == None` fallback and is + currently dead (all 12 numerics have a format). String is handled by an explicit + branch in `to_numpy` instead, so the fallback — the one body that needed + `PyArray::from_array` (and thus `numpy::Element`, which `String` is not) — is + removed. This is what lets `with_array!` accept a `String` arm. +4. **`buffer_format`:** add `String(_) => None`. `__getbuffer__` already rejects the + `None` case with `PyBufferError` ("no buffer-protocol representation") — strings + are not buffer-exportable, by design. +5. **`itemsize` / `data_ptr`:** add `String(_)` arms. These are only reached from the + buffer-protocol path, which `String` never enters, so the arms exist solely for + match exhaustiveness (`itemsize` may return the size of a `String` element; + `data_ptr` returns the `ArrayD` pointer). The stored `strides` are unused + for strings. +6. **`to_numpy` becomes a 2-way branch:** + - `DataInner::String(arr)` → build the numpy array (below). + - everything else → `np.asarray(self)` (zero-copy typed view, unchanged). + + The `StringDType` builder: import `numpy`, construct `np.dtypes.StringDType()`, + build a flat `list[str]` by iterating `arr` in C-order (`arr.iter()` follows + zarrs's C-order layout), then + `np.array(flat_list, dtype=string_dtype).reshape(arr.shape())`. Empty arrays + (a zero-length axis) reshape correctly from an empty list. + +### `src/array/sync.rs` and `src/array/async.rs` — read dispatch + +In both `retrieve_array_subset` and `retrieve_chunk`, after the existing +`for_each_dtype!` macro loop, add an explicit string arm: + +```rust +if dtype.is::() { + let data = self.inner.retrieve_array_subset_ndarray::(&array_subset)?; + return Ok(PyData::from(DataInner::String(data))); +} +``` + +(and the `retrieve_chunk_ndarray::` analogue for `retrieve_chunk`). On +`AsyncArray` the same arm uses the async method names — +`async_retrieve_array_subset_ndarray::` and +`async_retrieve_chunk_ndarray::` — matching how the existing numeric arms +call `async_retrieve_array_subset::>`. The trailing +`NotImplementedError` for unsupported dtypes stays as the fallback for everything +still unimplemented. + +## Testing + +- **Python round-trip vs. zarr-python** (extends the harness seeded by the indexing + work): write variable-length string arrays with zarr-python across a few + shapes / chunkings, including a partial-edge-chunk case, read them back with + zarrista's `FilesystemStore`, and assert `Data.to_numpy()` equals the zarr-python + array. Cover both `retrieve_array_subset`/`__getitem__` and `retrieve_chunk`. + Include an array with an empty selection (zero-length axis). +- **Rust unit test:** a small `ArrayD` → `DataInner::String` → `to_numpy` + check is hard without a Python interpreter; rely on the Python round-trip for + end-to-end coverage and keep any Rust-side test limited to construction. +- **Tooling:** rebuild with `maturin develop` after Rust changes; run tests with + `uv run --no-project pytest` so uv does not rebuild on every invocation. + +## Risks + +- **zarr-python's string encoding.** The round-trip assumes zarr-python emits a + string array zarrs can decode via `vlen_utf8`/`vlen`. If zarr-python uses an + encoding zarrs does not recognize, the round-trip test will surface it; resolving + any codec mismatch is part of this task. + +## Out of scope (deferred) + +- **Fixed-width raw bytes** (`r*N` → `|V`) and the category-2 `frombuffer` + infrastructure — next task. +- **complex64/128 and fixed UTF-32** — the "fold in" follow-up task. +- **bfloat16** (drags in an `ml_dtypes` runtime dependency) and **variable-length + bytes** (numpy object array, different semantics). +- **String fill-value scalar exposure** (`fill_value_to_py` / `Array.fill_value`) — + the separate roadmap item 2. +- **Writing** string arrays.