Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrowRecordBatchCodec and vlen string support #2031

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Jul 12, 2024

The discussion in zarr-developers/zeps#47 got me thinking: what if, instead of turning numpy arrays into bytes, we turn them into self-describing Arrow Record Batches and serialize them using the Arrow IPC format.

This would be a new type of Array -> Bytes codec. The beautiful thing about this is that it gives us variable-length string encoding for free (as well as potentially many other benefits) -- xref zarr-developers/zarr-specs#83.

This PR is a proof of concept that this is feasible and in fact very easy.

There is a lot more to explore here, but I thought I would just through this up for discussion.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@rabernat
Copy link
Contributor Author

This experiment also suggests another interesting possibility: returning Arrow Arrays and Tables from a Zarr Array or Group. If the Zarr Arrays are all 1D, they can be represented as Arrow Arrays all the way through, and there are potentially opportunities to reduce memory copies. We could have ArrowBuffer / ArrowArrayBuffer types.

@jeromekelleher
Copy link
Member

This sounds really interesting and potentially very powerful @rabernat!

Would you mind commenting on the implications of a pyarrow dependency?

@rabernat
Copy link
Contributor Author

Would you mind commenting on the implications of a pyarrow dependency?

I feel like it is becoming as ubiquitous as numpy in the ecosystem, so I don't consider this a major blocker. Or it could be an optional dependency if you want to read data encoded this way. But I'd be curious to hear opinions on that.

@TomAugspurger
Copy link
Contributor

There's lots of feedback in pandas-dev/pandas#54466 on pandas adopting pyarrow as a required dependency. The primary concern raised is the size of the package, especially in serverless contexts (though it seems like AWS Lambda has some built-in support to make this not so much of an issue?).

There's some work being done in pyarrow to make core pieces available without having to bring in everything.

@@ -109,6 +109,7 @@ dependencies = [
"universal_pathlib"
]
extra-dependencies = [
"pyarrow",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_spec: ArraySpec,
) -> Buffer | None:
assert isinstance(chunk_array, NDBuffer)
arrow_array = pa.array(chunk_array.as_ndarray_like().ravel())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be chunk_array.as_numpy_array() here since pa.array() doesn't recognize CuPy arrays? Would be good to add a GPU test here for safety.

In theory, it would be possible to do zero-copy transfers for CuPy arrays too but would need to go from CuPy -> Numba first and then Numba -> Arrow.

@jhamman jhamman changed the base branch from v3 to main October 14, 2024 20:59
@jhamman jhamman added this to the After 3.0.0 milestone Oct 17, 2024
@dstansby dstansby removed the V3 label Dec 12, 2024
@dstansby dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

7 participants