custom arrow table & dataset formats for targets storage #805

jameelalsalam · 2022-03-14T02:16:26Z

jameelalsalam
Mar 14, 2022

Even though I am a daily user of targets, I noticed today the (relatively) new custom storage formats feature with tar_format. This is an awesome feature, just wanted to share my experience and say thanks!

I have been working on a project that involves larger-than-my-laptop-memory data, and so I've been trying to use a combination of targets, arrow, and duckdb. Wherever possible, I have been trying to avoid loading the data into R memory, and keeping it on the arrow/duckdb side.

In any case, I came across the conversation between @wlandau and @jaredlander of the PR for the custom formats feature. I don't know if Jared got around to a tarchetypes-like package for arrow storage formats, but thanks to the simplicity of the targets design, I was impressed by how easy it was to put together a working prototype.

Gist of custom formats for loading as Arrow Tables and Arrow Datasets: https://gist.github.com/jameelalsalam/8e83a458e6eeecd92906860128ec9d95

I haven't tested them much (and I don't ever use parallel workers) but initially these seem to be working.

Thanks so much to you Will, for all your work on the targets package, and to Jared for putting so much R knowledge out into the world through your events and talks.

wlandau · 2022-03-14T18:35:18Z

wlandau
Mar 14, 2022
Maintainer

Even though I am a daily user of targets, I noticed today the (relatively) new custom storage formats feature with tar_format. This is an awesome feature, just wanted to share my experience and say thanks!

Glad you like it! Thanks for the kind words.

I have been working on a project that involves larger-than-my-laptop-memory data, and so I've been trying to use a combination of targets, arrow, and duckdb. Wherever possible, I have been trying to avoid loading the data into R memory, and keeping it on the arrow/duckdb side.

In general databases and larger-than-memory objects is tricky because of targets' immutable object model, but I am glad you figured out a workaround. Do arrow tables work like memory-mapped files somehow?

Gist of custom formats for loading as Arrow Tables and Arrow Datasets: https://gist.github.com/jameelalsalam/8e83a458e6eeecd92906860128ec9d95

Just to clear up a misconception: when Jared claimed that targets did not work with arrow, he was only referring to his specific edge case with geospatial/sf datasets. Well before #736 and #739, targets already had the arrow storage formats "feather" and "parquet", and they were already working properly with the vast majority arrow datasets.

Both format = "feather" and format = "parquet" can store and retrieve data frames, RecordBatch objects, and Table objects. The spatial issue is caused by an incompatibility between packages arrow and sf,

1 reply

philiporlando Feb 20, 2024

Thanks for documenting this here. I'm working on a large-scale targets pipeline that is currently built around sf and PostGIS. I'm looking for ways to improve performance and this helped me avoid a huge rabbit hole.

jameelalsalam · 2022-03-15T05:16:35Z

jameelalsalam
Mar 15, 2022
Author

I suspect that my knowledge may be insufficient to convey the arrow system and how it interacts with duckdb. But I can try!

I have been aware of the built-in arrow storage format, and in particular I've been using format = "parquet" for awhile. The key difference in the custom format for arrow tables in the gist is read = read_parquet(path, as_data_frame = FALSE). This means that when the target is reloaded, it ends up as an active binding to an in-memory table, but in arrow instead of in R. The built-in parquet format becomes a tibble when re-loaded.

So I write code like:

tar_target(processed_data, {
open_dataset("data-raw/pq") %>%
  filter(country == "US") %>%
  select(state, country, etc) %>%
  compute()), format = format_arrow_table())

And the results stay as active bindings to the arrow table both when the target runs as well as when reloaded either for other targets or tar_load.

In addition to being various in-memory and file formats, arrow provides query engines that can either work on in-memory arrow objects (arrow tables) or larger-than-memory arrow datasets. Arrow datasets are stored on disk as folders full of parquet files (possibly with subfolders). In addition, duckdb and arrow can work together, so duckdb can perform SQL operations on parquet files, or folders full of parquet files, via arrow. The result is pretty different than other databases, as duckdb ends up performing queries, but I am not persisting any data within duckdb. The end result has decent alignment with targets approach of immutable objects.

After another day of working with the custom formats, I think the arrow table storage format has potential to really change the code I am writing. The dataset one doesn't work so well. In particular, Arrow wants to persist datasets as folders, but targets wants the contents of /objects to be 1 file per target. I think there is a swapping operation that doesn't work properly for folders full of parquet. The overall size of a lot of the files does make it all pretty slow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom arrow table & dataset formats for targets storage #805

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

custom arrow table & dataset formats for targets storage #805

jameelalsalam Mar 14, 2022

Replies: 2 comments · 1 reply

wlandau Mar 14, 2022 Maintainer

philiporlando Feb 20, 2024

jameelalsalam Mar 15, 2022 Author

jameelalsalam
Mar 14, 2022

Replies: 2 comments 1 reply

wlandau
Mar 14, 2022
Maintainer

jameelalsalam
Mar 15, 2022
Author