-
-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArrowRecordBatchCodec and vlen string support #2031
base: main
Are you sure you want to change the base?
Conversation
This experiment also suggests another interesting possibility: returning Arrow Arrays and Tables from a Zarr Array or Group. If the Zarr Arrays are all 1D, they can be represented as Arrow Arrays all the way through, and there are potentially opportunities to reduce memory copies. We could have ArrowBuffer / ArrowArrayBuffer types. |
This sounds really interesting and potentially very powerful @rabernat! Would you mind commenting on the implications of a pyarrow dependency? |
I feel like it is becoming as ubiquitous as numpy in the ecosystem, so I don't consider this a major blocker. Or it could be an optional dependency if you want to read data encoded this way. But I'd be curious to hear opinions on that. |
There's lots of feedback in pandas-dev/pandas#54466 on pandas adopting pyarrow as a required dependency. The primary concern raised is the size of the package, especially in serverless contexts (though it seems like AWS Lambda has some built-in support to make this not so much of an issue?). There's some work being done in pyarrow to make core pieces available without having to bring in everything. |
@@ -109,6 +109,7 @@ dependencies = [ | |||
"universal_pathlib" | |||
] | |||
extra-dependencies = [ | |||
"pyarrow", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_spec: ArraySpec, | ||
) -> Buffer | None: | ||
assert isinstance(chunk_array, NDBuffer) | ||
arrow_array = pa.array(chunk_array.as_ndarray_like().ravel()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be chunk_array.as_numpy_array()
here since pa.array()
doesn't recognize CuPy arrays? Would be good to add a GPU test here for safety.
In theory, it would be possible to do zero-copy transfers for CuPy arrays too but would need to go from CuPy -> Numba first and then Numba -> Arrow.
The discussion in zarr-developers/zeps#47 got me thinking: what if, instead of turning numpy arrays into bytes, we turn them into self-describing Arrow Record Batches and serialize them using the Arrow IPC format.
This would be a new type of Array -> Bytes codec. The beautiful thing about this is that it gives us variable-length string encoding for free (as well as potentially many other benefits) -- xref zarr-developers/zarr-specs#83.
This PR is a proof of concept that this is feasible and in fact very easy.
There is a lot more to explore here, but I thought I would just through this up for discussion.
TODO: