custom arrow table & dataset formats for targets storage #805
Replies: 2 comments 1 reply
-
Glad you like it! Thanks for the kind words.
In general databases and larger-than-memory objects is tricky because of
Just to clear up a misconception: when Jared claimed that Both |
Beta Was this translation helpful? Give feedback.
-
I suspect that my knowledge may be insufficient to convey the arrow system and how it interacts with duckdb. But I can try! I have been aware of the built-in arrow storage format, and in particular I've been using So I write code like: tar_target(processed_data, {
open_dataset("data-raw/pq") %>%
filter(country == "US") %>%
select(state, country, etc) %>%
compute()), format = format_arrow_table()) And the results stay as active bindings to the arrow table both when the target runs as well as when reloaded either for other targets or tar_load. In addition to being various in-memory and file formats, arrow provides query engines that can either work on in-memory arrow objects (arrow tables) or larger-than-memory arrow datasets. Arrow datasets are stored on disk as folders full of parquet files (possibly with subfolders). In addition, duckdb and arrow can work together, so duckdb can perform SQL operations on parquet files, or folders full of parquet files, via arrow. The result is pretty different than other databases, as duckdb ends up performing queries, but I am not persisting any data within duckdb. The end result has decent alignment with targets approach of immutable objects. After another day of working with the custom formats, I think the arrow table storage format has potential to really change the code I am writing. The dataset one doesn't work so well. In particular, Arrow wants to persist datasets as folders, but targets wants the contents of /objects to be 1 file per target. I think there is a swapping operation that doesn't work properly for folders full of parquet. The overall size of a lot of the files does make it all pretty slow. |
Beta Was this translation helpful? Give feedback.
-
Even though I am a daily user of targets, I noticed today the (relatively) new custom storage formats feature with
tar_format
. This is an awesome feature, just wanted to share my experience and say thanks!I have been working on a project that involves larger-than-my-laptop-memory data, and so I've been trying to use a combination of targets, arrow, and duckdb. Wherever possible, I have been trying to avoid loading the data into R memory, and keeping it on the arrow/duckdb side.
In any case, I came across the conversation between @wlandau and @jaredlander of the PR for the custom formats feature. I don't know if Jared got around to a tarchetypes-like package for arrow storage formats, but thanks to the simplicity of the targets design, I was impressed by how easy it was to put together a working prototype.
Gist of custom formats for loading as Arrow Tables and Arrow Datasets: https://gist.github.com/jameelalsalam/8e83a458e6eeecd92906860128ec9d95
I haven't tested them much (and I don't ever use parallel workers) but initially these seem to be working.
Thanks so much to you Will, for all your work on the targets package, and to Jared for putting so much R knowledge out into the world through your events and talks.
Beta Was this translation helpful? Give feedback.
All reactions