[general] New cloud hashing approach and collaborative workflows #1232

noamross · 2024-02-14T05:16:51Z

noamross
Feb 14, 2024
Maintainer

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

My team recently realized that this change in targets 1.4 may have significant consequences for our cloud-based workflows:

For a cloud target to be truly up to date, the hash in the metadata now needs to match the current object in the bucket, not the version recorded in the metadata (#1172). In other words, targets now tries to ensure that the up-to-date data objects in the cloud are in their newest versions. So if you roll back the metadata to an older version, you will still be able to access historical data versions with e.g. tar_read(), but the pipeline will no longer be up to date.

We use S3 versioning with a shared bucket for multiple users on a project. It has been excellent for letting us share compute-intense targets and avoiding downloading large targets that can be skipped, even as each user works on different branches of a project where some targets may diverge. As long a version of the target in the bucket exists that matches the local metafile, it could be skipped, and overwriting the latest version wouldn't interfere with others' state. This also means CI builds of targets can take advantage of already-built targets without interfering with development. Under targets 1.3, this requires setting `repository_meta = "local", which is fine. However, does this mean that in 1.4+, this approach to collaboration will no longer work? Is there a way to recover the old behavior? Being able to collaborate this way has been the primary benefit of cloud-based versioning for us.

@emmamendelsohn

wlandau · 2024-02-14T15:20:47Z

wlandau
Feb 14, 2024
Maintainer

I did not realize that the versioning system in targets <= 1.3.2 would be used to implement a shared-datastore scenario like the one you described. Your team's original strategy is definitely a clever and compelling way to collaborate, but it is definitely "off-label" with respect to the intended scope of targets, and unfortunately it no longer seems feasible. Off the top of my head, I am not sure if there is a workaround that would accomplish the same thing as seamlessly with targets >= 1.4.0. And unfortunately it is not feasible to to support both the old and the new behavior at an infrastructure level.

For context, the primary reason I made the switch was to eliminate the egregious performance penalties from calling the AWS API in large pipelines. In targets <= 1.3.2, to check If the pipeline is up to date, tar_make() and tar_outdated() needed a separate HEAD request for each individual target. Because of the huge overhead (to say nothing of cost), S3 storage was unusable at the level of scale for which targets was intended. (Except if users manually set cue = tar_cue(file = FALSE), which would detract from the integrity of the pipeline if used on a regular basis.)

I solved this problem by caching the results of a paginated LIST request, which reduced AWS HTTPS calls by 1000x and time spent by >60x. But unfortunately, ListObjectsV2 is not version-aware. The only alternative that might work is ListObjectVersions, which would download every single version of every single object in the pipeline, which would be an unmanageably large payload.

So as a side effect, targets is forced to check the current version instead of the version in the metadata to see if a target is up to date. And for the majority of use cases, I actually think this is a good at a conceptual level: everywhere else in targets, the target is not up to date unless the current version of the data agrees with the metadata. This has always been true for local data storage an un-versioned cloud storage, and now versioned cloud storage is philosophically aligned with that.

1 reply

snowpong Mar 12, 2024

everywhere else in targets, the target is not up to date unless the current version of the data agrees with the metadata. This has always been true for local data storage an un-versioned cloud storage, and now versioned cloud storage is philosophically aligned with that.

Agreed, metadata is king. But now in the case for repository_meta = "local" this seems turned on it's head. Now tar_outdated() will list targets as outdated if the version in my local metadata do not match the latest version in the bucket, even though the version required by my metadata is in the bucket. So now bucket really seems more like king.

noamross · 2024-02-14T15:30:10Z

noamross
Feb 14, 2024
Maintainer Author

Hmm, this is definitely going to require us to consider different strategies. I wonder if it is possible for us to implement a version where targets are named by hash in the S3 store, rather than by target name, and drop the versioning feature in the S3 bucket entirely. (This would have the added benefit of allowing the store to move between S3 providers, which is not possible for S3 object version history).

1 reply

wlandau Feb 15, 2024
Maintainer

I wonder if it is possible for us to implement a version where targets are named by hash in the S3 store, rather than by target name

This is how Rich Fitzjohn's storr package works. storr was the storage backend of drake (and was involved in remake before that). For targets, I chose the current (simpler) system because it seems to be more efficient in terms of speed and storage space, and it would not be feasible to change direction at this point.

Even if it were feasible, it would create a lot of "garbage" objects: old files whose names are hashes and which no longer belong to any current target. This would have the nice effect of supporting the kind of seamless collaborative versioning you are looking for, but it would also cause storage usage to balloon in many cases, both in gigabytes used and the number of files. The large number of files would put a strain on local file quotas and slow down the AWS S3 paginated LIST request that currently helps targets efficiently check that each target is up to date. There would need to be a manual garbage collection mechanism for files, which is an inefficient and temporary remediation strategy, far from ideal. I have been down that road with drake.

I describe all this not to argue, but because I think someone else might ask the same question, and I want to be able to read this comment and remember the specific reasons why I made these design choices.

I wonder if Docker-like layering would help your team. Docker uses a union file system to implement its layering capabilities (e.g. Union FS or Aufs, I think the choice is pluggable.) Not sure if I'm off base.

noamross · 2024-02-15T15:01:53Z

noamross
Feb 15, 2024
Maintainer Author

I'm not going to argue for you to change directions, but I must say this change surprised me because of how well suited the S3 versioning feature was for this workflow. I really thought it was one of the primary design goals for it. I don't know how widespread the user base is of this feature but I would be curious what their use cases are and if this will be an issue for others.

6 replies

wlandau Feb 16, 2024
Maintainer

I'm not going to argue for you to change directions,

If it seems like I am belaboring the point, it's because I think someone else is going to ask this same question several months from now, and I want to be able to look back on this discussion and remember why I did what I did.

but I must say this change surprised me because of how well suited the S3 versioning feature was for this workflow. I really thought it was one of the primary design goals for it.

I was going for a "time machine" effect so you could revert the metadata and read the matching output data, but I was not thinking about the possibilities for multiple people collaborating on a single data store.

I don't know how widespread the user base is of this feature but I would be curious what their use cases are and if this will be an issue for others.

At my workplace, our sys admin created a bucket for all of us, and we are only allowed to access data at a prefix specific to each user. So we would not be able to exploit the old targets hashing strategy for collaborative versioning anyway. I am not sure if other regulated corporate environments work the same way.

wlandau Feb 16, 2024
Maintainer

The thing that would make it possible to roll back to the previous behavior is some kind of inexpensive way to check the ETags of multiple S3 object versions. So if I have a list of 3000 AWS S3 objects and a version ID for each object (whether current or old), I would need a way to check the ETag of each of those specific 3000 object versions, while only using 3 API calls and without running through any of the objects or versions outside the list. So far I don't think AWS can do this, except possibly in an S3 batch operation, which looks like it is high-overhead and requires an esoteric set of permissions. And I would also want consistent behavior with Google Cloud so targets would have some semblance of consistency. It's always tough to request a new feature in such a large commercial product as AWS or Google Cloud, but until then it seems like my hands are tied.

noamross Feb 21, 2024
Maintainer Author

One possibility is to allow for the option of using [ListObjectVersions](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html), using the [version-id-marker](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html#API_ListObjectVersions_RequestSyntax) to filter to objects newer than the oldest object in the metafile. There would certainly still be some performance penalty for this, would it be possible to implement something like this but allow it as a configuration option? If so, would you accept a PR?

Our other current idea is to stop targets S3 versioning and and create our own solution of having our own proxy object types that refer to saved files on S3. This would have the benefit of allowing some other caching options, as well, but it would only make sense for specific high-cost targets. In this case we would lose general versioning capabilities for the rest of the pipeline that would allow easy time-machine-like comparison across versions with https://github.com/ecohealthalliance/relic/

wlandau Feb 22, 2024
Maintainer

As you alluded to, version-id-marker is still likely to list potentially thousands of superfluous versions of objects, which would get expensive. And unfortunately maintaining two different versioning strategies does not seem feasible. It's a lot of tech debt for a situation like this, and I hesitate to expose a choice that most users would struggle to understand and use correctly. (Simplifying behaviors and tightening up guardrails are major things I take away from developing drake.)

snowpong Mar 12, 2024

Is the problem with the idea of using ListObjectVersions with version-id-marker that you have no proper way to limit which objects you get back (like you only want a subset not all objects), or is the problem that you get more versions than you want? I would assume the bucket is only used for targets, so all objects should be valid to get? And with regards to unwanted versions - is not the point of versioning them that you want to keep them around?

I can see that the API is not ideal though :) I found this blog-post https://tech.trivago.com/post/2018-09-03-s3-versioning where they were grappling with enormous buckets and this API call.

Did any of you look at the possibility of using the version-id-marker as a sort of caching key. So for example, if one downloads all versions for all objects using ListObjectVersions and paging (until isTruncated is false), and then later comes back after something has changed, and uses the version-id-marker from last time, do you only get the new updates since last?

noamross · 2024-02-22T22:55:57Z

noamross
Feb 22, 2024
Maintainer Author

I was about to ask if, in the long-term, you would be open to custom repository back-ends other than "aws" and "gcp" that might allow for custom behaviors. However, it occurs to me that this might be entirely possible to implement using custom formats, where read and write functions access a local or cloud key-value store and just save the keys to disk. I'll toy with this a bit, but I'm sure it is also an "off-label" use case so I'll check before trying anything serious with it.

3 replies

noamross Feb 23, 2024
Maintainer Author

[Continues thinking out loud] On the other hand, a format-only approach to this would mean that targets would be unaware of the storage/caching mechanism, so all the status and monitoring info would be wrong, and targets would error if the cache were cleared without clearing the key files. I can think of some solutions (e.g. a set-up/sync function in the targets script), though they may be pretty hacky. To be continued...🤔

snowpong Mar 12, 2024

What did you end up doing to resolve it at your end, downgrade to before 1.4.0?

noamross Mar 12, 2024
Maintainer Author

We're in-progress with this. Some projects are sticking with pre-1.4.0 versions (we run everything with fixed versions with renv, so this isn't too hard), others are moving to run locally where versioning isn't important, or just uploading objects outside of the targets mechanism to save objects at set tags. The key-value approach above is promising but has to wait until we have a the free time to implement it.

snowpong · 2024-03-12T14:34:48Z

snowpong
Mar 12, 2024

We're also using S3 versioning with a shared bucket for multiple users on a project, so this is an interesting discussion to me. Our server is running locally (minio) and we're also doing repository_meta = "local". I'm trying to figure out what broke for you @noamross and what you fixed @wlandau - and if this change affects us as well.

As I can understand the change was done to speed up working with AWS. But the side-effect is that now targets determines if a pipeline is up-to-date differently? Before it was considering local metadata vs AWS store, and now it ignores local metadata?

3 replies

snowpong Mar 12, 2024

Reading the description again it says "For a cloud target to be truly up to date, the hash in the metadata now needs to match the current object in the bucket". Does that mean that user A could be running tar_outdated() and all is OK, go for lunch, then come back and run tar_outdated() and his pipeline is out of date, because user B has updated an object in the mean time? If so then this broke our workflow as well.

noamross Mar 12, 2024
Maintainer Author

Yes, that's the case. repository_meta = "local" was necessary starting from 1.3, I think. From 1.4.0 you can't share the object store.

snowpong Mar 12, 2024

Ouch😬. Thanks for verifying. I guess we'll have to downgrade.

wlandau · 2024-04-05T14:19:10Z

wlandau
Apr 5, 2024
Maintainer

I wonder if it is possible for us to implement a version where targets are named by hash in the S3 store, rather than by target name

Looking back on this now. The approach you describe is apparently called "content-addressable storage" (CAS). Like I mentioned, storr always does this on purpose, and targets < 1.4.0 does it by accident with version-enabled S3 buckets.

It would be better to be able to opt into CAS regardless of the other storage settings (e.g. AWS vs local). I am still not sure this is feasible for targets, but today I had the idea that maybe there could be a separate _targets/cas/ folder for these objects that would exist alongside the name-based _targets/objects/, so the transition from one to the other could be smooth and non-breaking. I will take a closer look at what can be done natively.

13 replies

snowpong Apr 10, 2024

We are very much interested in a CAS approach for targets with either a locally or remote/cloud solution. DVC is already built this way https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory (works both GC / AWS and local shared directory as the remote). I am following this thread with interest 👍

wlandau Apr 25, 2024
Maintainer

You could easily encapsulate the if() statement in a wrapper in the third-party CAS system.

Actually, a better way to do this is with tar_active(): https://docs.ropensci.org/targets/reference/tar_active.html

if (tar_active()) { 
  # only runs the on tar_make(), tar_make_clustermq(), or tar_make_future():
  synchronize_pointer_files()
}

list(targets::tar_target(...), ...)

karltk Aug 3, 2024

Hi everone, I'm also very much interested in support for a CAS-like cache that could be shared between myself and my team mates.

After reading this thread, I experimented a bit with the idea of replacing the files in _targets/objects with symlinks to files in a new _targets/cas directory.

Minimal proof-of-concept

I started out with a very minimal _targets.R file:

library(targets)

list(
  tar_target(foo, 1),
  tar_target(bar, foo + 1)
)

I tried to jam all of my code into a synchronize_pointer_files() as above, but my implementation also requires a bit of logic after the pipeline has run. I did not find an appropriate hook in targets that enables me to place the "after-code" inside my _targets.R file. For now, I created a new top-level file that wraps the call to targets::tar_make() itself.

This is my proof of concept code (call_tar_make.R):

library(dplyr)

# create a _targets/cas dir where the object files will reside
dir.create(file.path(targets::tar_path_store(), "cas"), showWarnings = FALSE, recursive = TRUE)

# check if a path is a symplink
is_symlink <- function(paths) isTRUE(nzchar(Sys.readlink(paths), keepNA=TRUE))

# replace a single file in _targets/objects with a symlink to the previously
# stored data in _targets/cas
force_link_to_cas <- function(hash, name) {
  cas_dir <- file.path("..", "cas")
  cas_source_file <- file.path(cas_dir, hash)
  object_target_file <- do.call(targets::tar_path_target, list(name))
  if(file.exists(object_target_file)) {
    file.remove(object_target_file)
  }  
  file.symlink(cas_source_file, object_target_file)
  paste0(object_target_file, ">", cas_source_file)
}

# for any stems in the meta file where there is a matching file (correct hash)
# in the cas, replace the file in _targets/objects/ with a symlink to the 
# file in _targets/cas
targets::tar_meta() %>% 
  filter(type == "stem") %>% 
  select(name, data) %>% 
  rowwise() %>% 
  mutate(link = force_link_to_cas(data, name))    

# run the pipeline as usual
targets::tar_make()

# move any newly created files in _targets/objects/ to _target/cas
# and replace by a symlink to this new cas entry
add_file_to_cas <- function(hash, name) {
  cas_dir <- file.path(targets::tar_path_store(), "cas")
  cas_target_file <- file.path(cas_dir, hash)
  cas_dir_rel <- file.path("..", "cas")
  cas_target_file_rel <- file.path(cas_dir_rel, hash)
  object_source_file <- do.call(targets::tar_path_target, list(name))
  if(!file.exists(cas_target_file) && !is_symlink(object_source_file)) {
    file.rename(object_source_file, cas_target_file)
    file.symlink(cas_target_file_rel, object_source_file)
  } 
  paste0(object_source_file, ">", cas_target_file)
}

# for any stems in the meta file where there is a true file in _targets/objects
# move that file into the _targets/cas (and rename to its hash), and add a 
# symlink from _targets/objects to this new entry in _targets/cas 
targets::tar_meta() %>% 
  filter(type == "stem") %>% 
  select(name, data) %>% 
  rowwise() %>% 
  mutate(link = add_file_to_cas(data, name))

This proof of concept enables the following workflow:

User A and B work on a targets project where they share the code through version control (hg, git, monotone...), and happen to work on the same server (or through a shared NFS mount)
User A sets up a targets project, runs call_tar_make.R, which fills the cas with entries for foo and bar.
User B checks out the same project, and creates a symlink to user A's cas directory (in practice, they would both create a symlink to a shared cas directory).
User B modifies the existing targets and/or creates additional targets, and runs call_tar_make.R, thus creating new entries in the shared cas, then pushes their changes. Any targets that weren't impacted by the change will be reused from the cas, no recompute necessary.
User A pulls the latest changes and runs call_tar_make.R. At this point, all targets should be skipped, as user B already computed all targets and placed their values in the cas.

Outstanding issues and questions

I have not looked into how the various other target types should be handled, e.g. branch.
FWICT, this approach won't work all that well together with the cloud backends. That's not my use case at the moment, but I wouldn't mind having the option of sharing the CAS over S3.
If this is a sensible way to go, are there any ways for me to move all of this code into the _targets.R file --- both the pre- and post -tar_make steps?
There are times when we, for various reasons, undo code changes in our pipeline. A CAS-based cache like this would contain a record of all previous values for all (stem) targets. Is there any way that we could instruct targets to pick an old computation out of the CAS instead of recomputing it? (My guess is no, since this historical information, mapping command and depend to data in the meta file is not kept, but I would be happy to hear your thoughts here).

wlandau Aug 8, 2024
Maintainer

Thanks for reminding me of this discussion and for offering a prototype.

I did not find an appropriate hook in targets that enables me to place the "after-code" inside my _targets.R file.

This is the purpose of the read and write functions in tar_format(), and it is where I think it makes the most sense to implement CAS.

Here is a prototype based on tar_format() in the case where the CAS data storage repository is just a local folder on disk. It works with dynamic branching. I hope the inline comments help anyone interested in developing a cloud-based CAS.

The only thing missing is a way to synchronize local pointer files.

library(targets)

cas_format <- tar_format(
  read = function(path) {
    # Step 1: look up the hash from the local pointer file.
    hash <- readLines(path)
    
    # Step 2: use the hash to get the data from CAS storage.
    data <- readRDS(file.path("cas", hash))
  },
  write = function(object, path) {
    # Step 1: save the object to a local temporary file.
    temp <- tempfile()
    on.exit(unlink(temp)) # Clean up the temporary file on exit or error.
    saveRDS(object, temp)
    
    # Step 2: compute the hash on the temporary file.
    hash <- secretbase::siphash13(file = temp)
    
    # Step 3: move the file to the CAS system if needed.
    new_path <- file.path("cas", hash)
    if (!file.exists(new_path)) {
      fs::dir_create(dirname(new_path), recurse = TRUE)
      fs::file_move(path = temp, new_path = new_path)
    }
    
    # Step 4. Save the hash to the local pointer file.
    writeLines(hash, path)
  }
)

tar_option_set(format = cas_format)

list(
  tar_target(x, seq_len(2)),
  tar_target(y, x, pattern = map(x))
)

wlandau Aug 8, 2024
Maintainer

Syncing pointer files is tricky because the tar_meta()$data stores the hashes of the pointer files, and we can't reverse engineer the hash to get the actual names of the objects in the CAS system. So we may have to rely on that CAS system to also store the hashes of pointer files, either as small files or in a database. Here's another prototype which syncs them. It relies on running targets:::hash_file() on pointer files, which trusts that the eventual result from targets will agree for the purposes of storing the metadata. This sort of thing should work, but I wonder if there is a cleaner and more robust way.

library(targets)

cas_format <- tar_format(
  read = function(path) {
    # Step 1: look up the hash from the local pointer file.
    hash <- readLines(path)
    
    # Step 2: use the hash to get the data from CAS data storage.
    readRDS(file.path("cas/data", hash))
  },
  write = function(object, path) {
    # Step 1: save and hash the actual data to a local temporary file.
    temp_data <- tempfile()
    on.exit(unlink(temp_data)) # Clean up the temporary file on exit or error.
    saveRDS(object, temp_data)
    hash_data <- targets:::hash_file(temp_data) # hash_file() would need to be exported.
    
    # Step 2: save and hash the pointer file to another local temporary file.
    temp_pointer <- tempfile()
    on.exit(unlink(temp_pointer), add = TRUE)
    writeLines(hash_data, temp_pointer)
    hash_pointer <- targets:::hash_file(temp_pointer)

    # Step 3: move data to CAS system.
    cas_data <- file.path("cas/data", hash_data)
    if (!file.exists(cas_data)) {
      fs::dir_create(dirname(cas_data), recurse = TRUE)
      fs::file_move(temp_data, cas_data)
    }
    
    # Step 4: copy pointer file to CAS system.
    cas_pointer <- file.path("cas/pointers", hash_pointer)
    if (!file.exists(cas_pointer)) {
      fs::dir_create(dirname(cas_pointer), recurse = TRUE)
      fs::file_copy(temp_pointer, cas_pointer)
    }
    
    # Step 5: store pointer file in _targets/objects/
    fs::file_move(temp_pointer, path)
  }
)

tar_option_set(format = cas_format)

if (tar_active()) {
  # Sync pointer files
  unlink("_targets/objects", recursive = TRUE)
  dir.create("_targets/objects")
  meta <- tar_meta(fields = any_of(c("name", "data")))
  sync <- meta[meta$data %in% list.files(file.path("cas/pointers")), ]
  purrr::pwalk(
    sync,
    \(name, data)
      fs::file_copy(
        file.path("cas/pointers", data),
        file.path("_targets/objects", name)
      )
  )
}

list(
  tar_target(x, seq_len(2)),
  tar_target(y, x, pattern = map(x))
)

wlandau · 2024-08-28T14:52:38Z

wlandau
Aug 28, 2024
Maintainer

An update: as of #1322, content-addressable storage (CAS) is now part of targets. Please have a look at tar_repository_cas() and tar_repository_cas_local().

I think the only tricky problems left to the user are (1) garbage collection / retention policy, and (2) an efficient exists() method given the constraints of cloud APIs.

targets/R/tar_repository_cas.R

Lines 1 to 238 in 4fe9da7

    
           #' @title Define a custom content-addressable storage 
        
           #'   (CAS) repository (an experimental feature). 
        
           #' @export 
        
           #' @family content-addressable storage 
        
           #' @description Define a custom storage repository that uses 
        
           #'   content-addressable storage (CAS). 
        
           #' @section Content-addressable storage: 
        
           #'   Normally, `targets` organizes output data 
        
           #'   based on target names. For example, 
        
           #'   if a pipeline has a single target `x` with default settings, 
        
           #'   then [tar_make()] saves the output data to the file 
        
           #'   `_targets/objects/x`. When the output of `x` changes, [tar_make()] 
        
           #'   overwrites `_targets/objects/x`. 
        
           #'   In other words, no matter how many changes happen to `x`, 
        
           #'   the data store always looks like this: 
        
           #' 
        
           #'   ``` 
        
           #'   _targets/ 
        
           #'   ├── meta/ 
        
           #'   │   └── meta 
        
           #'   └── objects/ 
        
           #'       └── x 
        
           #'   ``` 
        
           #' 
        
           #'   By contrast, with content-addressable storage (CAS), 
        
           #'   `targets` organizes outputs based on the hashes of their contents. 
        
           #'   The name of each output file is its hash, and the 
        
           #'   metadata maps these hashes to target names. For example, suppose 
        
           #'   target `x` has `repository = tar_repository_cas_local("my_cas")`. 
        
           #'   When the output of `x` changes, [tar_make()] creates a new file 
        
           #'   inside `my_cas/` without overwriting or deleting any other files 
        
           #'   in that folder. If you run [tar_make()] three different times 
        
           #'   with three different values of `x`, then storage will look like this: 
        
           #' 
        
           #'   ``` 
        
           #'   _targets/ 
        
           #'   └── meta/ 
        
           #'       └── meta 
        
           #'   my_cas/ 
        
           #'   ├── 1fffeb09ad36e84a 
        
           #'   ├── 68328d833e6361d3 
        
           #'   └── 798af464fb2f6b30 
        
           #'   ``` 
        
           #' 
        
           #'   The next call to `tar_read(x)` uses `tar_meta(x)$data` 
        
           #'   to look up the current hash of `x`. If `tar_meta(x)$data` returns 
        
           #'   `"1fffeb09ad36e84a"`, then `tar_read(x)` returns the data from 
        
           #'   `my_cas/1fffeb09ad36e84a`. Files `my_cas/68328d833e6361d3` and 
        
           #'   and `my_cas/798af464fb2f6b30` are left over from previous values of `x`. 
        
           #' 
        
           #'   Because CAS accumulates historical data objects, 
        
           #'   it is ideal for data versioning and collaboration. 
        
           #'   If you commit the `_targets/meta/meta` file to version control 
        
           #'   alongside the source code, 
        
           #'   then you can revert to a previous state of your pipeline with all your 
        
           #'   targets up to date, and a colleague can leverage your hard-won 
        
           #'   results using a fork of your code and metadata. 
        
           #' 
        
           #'   The downside of CAS is the cost of accumulating many data objects 
        
           #'   over time. Most pipelines that use CAS 
        
           #'   should have a garbage collection system or retention policy 
        
           #'   to remove data objects when they no longer needed. 
        
           #' 
        
           #'   The [tar_repository_cas()] function lets you create your own CAS system 
        
           #'   for `targets`. You can supply arbitrary custom methods to upload, 
        
           #'   download, and check for the existence of data objects. Your custom 
        
           #'   CAS system can exist locally on a shared file system or remotely 
        
           #'   on the cloud (e.g. in an AWS S3 bucket). 
        
           #'   See the "Repository functions" section and the documentation 
        
           #'   of individual arguments for advice on how 
        
           #'   to write your own methods. 
        
           #' 
        
           #'   The [tar_repository_cas_local()] function has an example 
        
           #'   CAS system based on a local folder on disk. 
        
           #'   It uses [tar_cas_u()] for uploads, 
        
           #'   [tar_cas_d()] for downloads, and 
        
           #'   [tar_cas_e()] for existence. 
        
           #' @section Repository functions: 
        
           #'   In [tar_repository_cas()], functions `upload`, `download`, 
        
           #'   and `exists` must be completely pure and self-sufficient. 
        
           #'   They must load or namespace all their own packages, 
        
           #'   and they must not depend on any custom user-defined 
        
           #'   functions or objects in the global environment of your pipeline. 
        
           #'   `targets` converts each function to and from text, 
        
           #'   so it must not rely on any data in the closure. 
        
           #'   This disqualifies functions produced by `Vectorize()`, 
        
           #'   for example. 
        
           #' 
        
           #'   `upload` and `download` can assume `length(path)` is 1, but they should 
        
           #'   account for the possibility that `path` could be a directory. To simply 
        
           #'   avoid supporting directories, `upload` could simply call an assertion: 
        
           #' 
        
           #'   ```r 
        
           #'   targets::tar_assert_not_dir( 
        
           #'     path, 
        
           #'     msg = "This CAS upload method does not support directories." 
        
           #'   ) 
        
           #'   ``` 
        
           #' 
        
           #'   Otherwise, support for directories may require handling them as a 
        
           #'   special case. For example, `upload` and `download` could copy 
        
           #'   all the files in the given directory, 
        
           #'   or they could manage the directory as a zip archive. 
        
           #' 
        
           #'   Some functions may need to be adapted and configured based on other 
        
           #'   inputs. For example, you may want to define 
        
           #'   `upload = \(key, path) file.move(path, file.path(folder, key))` 
        
           #'   but do not want to hard-code a value of `folder` when you write the 
        
           #'   underlying function. `substitute()` can help inject values into the 
        
           #'   body of a function. For example: 
        
           #' 
        
           #'   ``` 
        
           #'   upload <-  \(key, path) {} 
        
           #'   body(upload) <- substitute( 
        
           #'     file.move(path, file.path(folder, key)), 
        
           #'     list(folder = "my_cas") 
        
           #'   ) 
        
           #'   print(upload) 
        
           #'   ``` 
        
           #' 
        
           #'   Temporary or sensitive such as authentication credentials 
        
           #'   should not be injected 
        
           #'   this way into the function body. Instead, pass them as environment 
        
           #'   variables using [tar_resources_repository_cas()]. 
        
           #' @param upload A function with arguments `key` and `path`, in that order. 
        
           #'   This function should upload the file or directory from `path` 
        
           #'   to the CAS system. 
        
           #'   `path` is where the file is originally saved to disk outside the CAS 
        
           #'   system. It could be a staging area or a custom `format = "file"` 
        
           #'   location. `key` denotes the name of the destination data object 
        
           #'   in the CAS system. 
        
           #' 
        
           #'   In the case of `format = "file"`, `upload` must not delete or move 
        
           #'   the original file at `path`. In other words, `path` must still 
        
           #'   exist in its original form after `upload` finishes. 
        
           #'   But in the case of non-`"file"` targets, 
        
           #'   `path` is a staging area which is automatically removed 
        
           #'   after upload, so `upload` can safely remove `path` if needed. 
        
           #' 
        
           #'   To differentiate between 
        
           #'   `format = "file"` targets and non-`"file"` targets, the `upload` 
        
           #'   method can use [tar_format_get()]. For example, to make 
        
           #'   [tar_repository_cas_local()] efficient, `upload` moves the file 
        
           #'   if `targets::tar_format_get() == "file"` and copies it otherwise. 
        
           #' 
        
           #'   See the "Repository functions" section for more details. 
        
           #' @param download A function with arguments `key` and `path`, in that order. 
        
           #'   This function should download the data object at `key` from 
        
           #'   the CAS system to the file or directory at `path`. 
        
           #'   `key` denotes the name of the data object in the CAS system. 
        
           #'   `path` is a temporary staging area and not the final destination. 
        
           #' 
        
           #'   Please be careful to avoid deleting the object at `key` from the CAS 
        
           #'   system. If the CAS system is a local file system, for example, 
        
           #'   `download` should copy the file and not simply move it 
        
           #'   (e.g. please avoid `file.rename()`). 
        
           #' 
        
           #'   See the "Repository functions" section for more details. 
        
           #' @param exists A function with a single argument `key`. 
        
           #'   This function should check if there is an object at `key` in 
        
           #'   the CAS system. 
        
           #' 
        
           #'   For efficiency, `exists` can maintain an in-memory cache of keys. 
        
           #'   New lookups can check the cache and potentially avoid expensive 
        
           #'   queries to the CAS system. See the source code of [tar_cas_e()] 
        
           #'   for an example of how this can work for a local file system CAS. 
        
           #' 
        
           #'   See the "Repository functions" section for more details. 
        
           #' @param consistent Logical. Set to `TRUE` if the storage platform is 
        
           #'   strongly read-after-write consistent. Set to `FALSE` if the platform 
        
           #'   is not necessarily strongly read-after-write consistent. 
        
           #' 
        
           #'   A data storage system is said to have strong read-after-write consistency 
        
           #'   if a new object is fully available for reading as soon as the write 
        
           #'   operation finishes. Many modern cloud services like Amazon S3 and 
        
           #'   Google Cloud Storage have strong read-after-write consistency, 
        
           #'   meaning that if you upload an object with a PUT request, then a 
        
           #'   GET request immediately afterwards will retrieve the precise 
        
           #'   version of the object you just uploaded. 
        
           #' 
        
           #'   Some storage systems do not have strong read-after-write consistency. 
        
           #'   One example is network file systems (NFS). On a computing cluster, 
        
           #'   if one node creates a file on an NFS, then there is a delay before 
        
           #'   other nodes can access the new file. `targets` handles this situation 
        
           #'   by waiting for the new file to appear with the correct hash 
        
           #'   before attempting to use it in downstream computations. 
        
           #'   `consistent = FALSE` imposes a waiting period in which `targets` 
        
           #'   repeatedly calls the `exists` method until the file becomes available 
        
           #'   (at time intervals configurable with [tar_resources_network()]). 
        
           #'   These extra calls to `exists` may come with a 
        
           #'   little extra latency / computational burden, 
        
           #'   but on systems which are not strongly read-after-write consistent, 
        
           #'   it is the only way `targets` can safely use the current results 
        
           #'   for downstream computations. 
        
           #' @examples 
        
           #' if (identical(Sys.getenv("TAR_EXAMPLES"), "true")) { # for CRAN 
        
           #' tar_dir({ # tar_dir() runs code from a temp dir for CRAN. 
        
           #' tar_script({ 
        
           #'   repository <- tar_repository_cas( 
        
           #'     upload = function(key, path) { 
        
           #'       if (dir.exists(path)) { 
        
           #'         stop("This CAS repository does not support directory outputs.") 
        
           #'       } 
        
           #'       if (!file.exists("cas")) { 
        
           #'         dir.create("cas", recursive = TRUE) 
        
           #'       } 
        
           #'       file.copy(path, file.path("cas", key)) 
        
           #'     }, 
        
           #'     download = function(key, path) { 
        
           #'       file.copy(file.path("cas", key), path) 
        
           #'     }, 
        
           #'     exists = function(key) { 
        
           #'       file.exists(file.path("cas", key)) 
        
           #'     } 
        
           #'   ) 
        
           #'   write_file <- function(object) { 
        
           #'     writeLines(as.character(object), "file.txt") 
        
           #'     "file.txt" 
        
           #'   } 
        
           #'   list( 
        
           #'     tar_target(x, c(2L, 4L), repository = repository), 
        
           #'     tar_target( 
        
           #'       y, 
        
           #'       x, 
        
           #'       pattern = map(x), 
        
           #'       format = "qs", 
        
           #'       repository = repository 
        
           #'     ), 
        
           #'     tar_target(z, write_file(y), format = "file", repository = repository) 
        
           #'   ) 
        
           #' }) 
        
           #' tar_make() 
        
           #' tar_read(y) 
        
           #' tar_read(z) 
        
           #' list.files("cas") 
        
           #' tar_meta(any_of(c("x", "z")), fields = any_of("data")) 
        
           #' }) 
        
           #' }

1 reply

wlandau Oct 2, 2024
Maintainer

FYI targets 1.8.0 with support for content addressable storage is now on CRAN. This thread is quite long, so I propose we follow up with specific solutions and strategies in new discussion threads.

[general] New cloud hashing approach and collaborative workflows #1232

noamross Feb 14, 2024 Maintainer

Help

Description

Replies: 7 comments · 28 replies

wlandau Feb 14, 2024 Maintainer

noamross Feb 14, 2024 Maintainer Author

wlandau Feb 15, 2024 Maintainer

noamross Feb 15, 2024 Maintainer Author

wlandau Feb 16, 2024 Maintainer

wlandau Feb 16, 2024 Maintainer

noamross Feb 21, 2024 Maintainer Author

wlandau Feb 22, 2024 Maintainer

noamross Feb 22, 2024 Maintainer Author

noamross Feb 23, 2024 Maintainer Author

noamross Mar 12, 2024 Maintainer Author

noamross Mar 12, 2024 Maintainer Author

wlandau Apr 5, 2024 Maintainer

wlandau Apr 25, 2024 Maintainer

Minimal proof-of-concept

Outstanding issues and questions

wlandau Aug 8, 2024 Maintainer

wlandau Aug 8, 2024 Maintainer

wlandau Aug 28, 2024 Maintainer

wlandau Oct 2, 2024 Maintainer

noamross
Feb 14, 2024
Maintainer

Replies: 7 comments 28 replies

wlandau
Feb 14, 2024
Maintainer

noamross
Feb 14, 2024
Maintainer Author

wlandau Feb 15, 2024
Maintainer

noamross
Feb 15, 2024
Maintainer Author

wlandau Feb 16, 2024
Maintainer

wlandau Feb 16, 2024
Maintainer

noamross Feb 21, 2024
Maintainer Author

wlandau Feb 22, 2024
Maintainer

noamross
Feb 22, 2024
Maintainer Author

noamross Feb 23, 2024
Maintainer Author

noamross Mar 12, 2024
Maintainer Author

noamross Mar 12, 2024
Maintainer Author

wlandau
Apr 5, 2024
Maintainer

wlandau Apr 25, 2024
Maintainer

wlandau Aug 8, 2024
Maintainer

wlandau Aug 8, 2024
Maintainer

wlandau
Aug 28, 2024
Maintainer

wlandau Oct 2, 2024
Maintainer