Skip to content

Zarrista overview & goals #36

Description

@kylebarron

Human written by @kylebarron

Background

Zarr is the pre-eminent data format for storing N-dimensional data. Zarr-Python is the primary Python library for reading/writing this data; Zarrs is the primary Rust library for reading/writing Zarr, and is potentially much faster than Zarr-Python.

@d-v-b recently prototyped vibe-coded Zarrs bindings for Zarr-Python in zarr-developers/zarr-python#4064. After some discussion, I decided to start prototyping a standalone Zarrs binding for Python, with the goal of providing another option for zarr-developers/zarr-python#4064. That is, Zarrista should be a low-level binding which Zarr-Python could potentially wrap in its higher level APIs in the future.

@d-v-b found that zarr-developers/zarr-python#4064 had vastly improved performance:

I'm seeing ~15x throughput improvement, looks good.

This gives strong motivation for the potential performance improvements of a Python Zarr library built on Zarrs.

Goals

Zarrista should be both directly usable by intermediate-to-advanced users, but should also be built so that Zarr-Python could theoretically build on top of it in the future.

Zarrista should not add logic beyond what already exists in Zarrs. Similar to Obstore, the scope should be limited to only what is already implemented upstream. This keeps maintainability high.

Zarrista should expose as many APIs as possible from Zarrs. Array and Group will be medium-high level APIs, but ideally all lower level APIs (if stable) should also be exposed, so that downstream libraries can choose the most performance route for them.

  • Create a minimal but complete Python binding of Zarrs
    • Include parallel sync and async stores, arrays, and groups
    • Support chunk reading
    • Support array reading
    • Support chunk writing
    • Support array writing
  • Various store support
    • Icechunk integration
    • ObjectStore/Obstore integration
    • Filesystem integration
    • Memory store integration (for testing only)
  • Zero copy data exchange between Rust and Python
    • Primitive, fixed width types:
      • Buffer protocol/numpy
      • Arrow
      • DLPack
    • Variable width types:
      • Buffer protocol/numpy
      • Arrow
      • DLPack
    • Masked types:
      • Buffer protocol/numpy
      • Arrow
      • DLPack

Partner and Stakeholders

Partner with @d-v-b as needed for general project direction, and to ensure that public-facing API is usable

Keeping the scope limited to what is already implemented upstream keeps this project focused to what my (@kylebarron's) strengths are. Given my experience from obstore, async-tiff, and similar projects, I'm really good at creating high-performance, Pythonic APIs from Rust libraries. I have considerably less experience with Zarr itself. This means that I'm ill-equipped to design a higher-level Zarr API myself; that task should be left to @d-v-b and others in higher-level wrapper libraries (Zarr-Python or other).

Milestones

cc @developmentseed/cng-island

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions