[help] How to approach large project that could be multiple target pipelines in one project #1155

gadenbuie · 2023-09-25T01:41:37Z

gadenbuie
Sep 25, 2023

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

targets deliberately pushes users to avoid creating multiple sub-projects under the same roof. Things tend to get messy when people do that, and there are cleaner and more reproducible workarounds for optimally managing a workflow.

I'm wondering if the cleaner and more reproducible are documented somewhere.

I'm finding myself very tempted do break a rather large project into smaller targets pipelines, and there are natural breakpoints in my project where I could do that. My use case is a large web scraping project. There are an initial set of targets that result in several dynamic branches with about 50,000 branches each. If everything is up-to-date, it takes tar_make() 20 minutes to an hour to work through skipping all the up-to-date branches. What I'd like to do is start a new pipeline that picks up from the downloaded files. I can use the directory structure to simplify the file inputs into a more manageable number of targets.

What patterns would others recommend following in this case?

wlandau · 2023-09-25T09:05:04Z

wlandau
Sep 25, 2023
Maintainer

Please have a look at https://books.ropensci.org/targets/performance.html, I think most of the suggestions apply to your case. Also, you will probably see a speedup if you use format = "parquet" instead of format = "file" with manually written parquet files with. And I recommend batching the same amount of overall work into a smaller number of targets (c.f. https://books.ropensci.org/targets/performance.html#many-targets, https://books.ropensci.org/targets/targets.html#what-a-target-should-do, https://books.ropensci.org/targets/targets.html#how-much-a-target-should-do).

3 replies

gadenbuie Sep 25, 2023
Author

Thanks for the quick answer, Will. I of course read that section of the book (prior to posting, even!) and don't think much of it applies to my situation.

In particular, there is a phase of the pipeline where I cannot batch the work into larger units and am stuck with the tradeoffs that ensue. It's still advantageous for me to use targets, though, for two reasons: 1) I've already written the pipeline in targets and 2) even if it does take a while, it still helps me manage the work.

As I described, I'd really like view my project as two separate pipelines. The first collects the data from the remote source and organizes my local files and the second would process the local files. In this second pipeline I could very easily batch work into larger units, and the fact that I've written the collected files to disk gives me an easy way to separate the two concerns.

How would you approach this situation?

Also, you will probably see a speedup if you use format = "parquet" instead of format = "file" with manually written parquet files.

If I used format = "parquet", is there a way to control where the file is written and its naming structure? From reading the docs, it appears that with this choice the parquet file is controlled by targets in _targets/objects/. Unfortunately, I need to control the file name and directory structure so that I can take advantage of parquet features, like partitioning.

As a final note/aside, I and many in the community greatly appreciate your quick answers. It's a wonderful service to the community.

That said, when you close discussions immediately, you send a strong signal to the community that no further engagement is required. It also make the post much harder to discover, since GitHub makes it harder to search and find closed discussion threads. If you're wanting to use GitHub Discussions for broad community engagement, quickly closing discussions may be limiting this engagement.

wlandau Sep 27, 2023
Maintainer

Thanks for the quick answer, Will. I of course read that section of the book (prior to posting, even!) and don't think much of it applies to my situation.

If tar_outdated() is taking 20 minutes, then the format = "file_fast" option you mentioned may help. Profiling tar_outdated() could also help (https://books.ropensci.org/targets/performance.html#profiling), as would a reprex.

As I described, I'd really like view my project as two separate pipelines. The first collects the data from the remote source and organizes my local files and the second would process the local files. In this second pipeline I could very easily batch work into larger units, and the fact that I've written the collected files to disk gives me an easy way to separate the two concerns.

I do not often recommend this approach, but it sounds like it may help in your case after all. In the general case, the main pain point is making the _targets data store available among the different pipelines. This is could be especially difficult with cloud storage if you want targets to properly track upstream dependency data. And multi-project coordination can get messy overall. But since the only data you want to share across projects are local external files, this will work as long as the two pipelines coordinate well and stay in the same place relative to one another.

If I used format = "parquet", is there a way to control where the file is written and its naming structure? From reading the docs, it appears that with this choice the parquet file is controlled by targets in _targets/objects/. Unfortunately, I need to control the file name and directory structure so that I can take advantage of parquet features, like partitioning.

If you need parquet-specific features like partitioning, then yes your choice makes sense. format = "parquet" writes files in _targets/objects.

As a final note/aside, I and many in the community greatly appreciate your quick answers. It's a wonderful service to the community.

Thank you for the kind words.

I know it's an odd time for me to say this, but unfortunately, I won't be able to respond as quickly going forward. It has always been a challenge to keep up with questions, and this weekend was particularly rough because of everything that was going on outside of this particular discussion. I hit a breaking point. To improve my productivity and mental health, I needed to isolate my GitHub notifications and maintainer identity in a new email address which I only check on certain days of the week. The beginning of this week was already much better because of this change.

That said, when you close discussions immediately, you send a strong signal to the community that no further engagement is required. It also make the post much harder to discover, since GitHub makes it harder to search and find closed discussion threads. If you're wanting to use GitHub Discussions for broad community engagement, quickly closing discussions may be limiting this engagement.

The above change of pace should make it easier to keep new discussions open for longer if the answer is not clear. I will still close ones where the answer is certain, as well as older/slower ones in order to avoid getting overwhelmed.

gadenbuie Sep 27, 2023
Author

To improve my productivity and mental health, I needed to isolate my GitHub notifications and maintainer identity in a new email address which I only check on certain days of the week. The beginning of this week was already much better because of this change.

Thank you for your answers, Will! And to the above point I'll simply say that you have my full support in making this change! ❤️ ❤️ ❤️

MilesMcBain · 2023-10-04T23:41:23Z

MilesMcBain
Oct 4, 2023

As I described, I'd really like view my project as two separate pipelines. The first collects the data from the remote source and organizes my local files and the second would process the local files. In this second pipeline I could very easily batch work into larger units, and the fact that I've written the collected files to disk gives me an easy way to separate the two concerns.

I roll with this kind of muliple-plan in projects all the time. In fact it's the only way I set up projects now in my current job. This is partly because each project usually has a couple of billing milestones and it's good for record keeping to be able to have a plan and cache of results that represents the work that was signed off at that milestone.

We use the _targets.yaml approach outlined in: https://books.ropensci.org/targets/projects.html#multiple-projects

There's some stuff in {tflow} that greases the wheels in terms of hopping about between plans, loading and invalidating targets etc.

These homebrewed targets factories for sharing data between plans have also been useful: https://gist.github.com/MilesMcBain/4b7dec96bb1d721ab6038e213c1fdafe - I think this tends to be the main stumbling block?

Another issue that may pop up is that you want to lock 'downstream' plan from being affected by updates in the 'upstream' plan. E.g. someone conducted an investigation into some client question and accidentally modified the cache. We implement this with custom cues for some of the important datasets. They depend on a global variable that can disable them rebuilding from changes in the upstream plan.

0 replies

HollyX23 · 2024-02-15T15:55:24Z

HollyX23
Feb 15, 2024

When approaching a sizable web scraping project, it's wise to break it down into smaller, more manageable pipelines. This allows for better organization and easier maintenance. Modularization is key here; dividing the project into distinct stages or components enhances clarity and simplifies debugging and scaling efforts.

Additionally, adopting a structured directory layout for your scraped data can greatly improve efficiency. Organizing files into logical categories or subdirectories streamlines data management and facilitates subsequent processing steps. This approach leverages the directory structure to efficiently input files into pipelines, reducing complexity and enhancing reproducibility.

Considering your use of tar_make(), you're likely familiar with workflow management tools like targets. Exploring complementary tools such as Crawlbase can further enhance project organization and reproducibility. Crawlbase offers features like scheduling, data storage, and error handling, which can streamline web scraping workflows.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] How to approach large project that could be multiple target pipelines in one project #1155

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[help] How to approach large project that could be multiple target pipelines in one project #1155

gadenbuie Sep 25, 2023

Help

Description

Replies: 3 comments · 3 replies

wlandau Sep 25, 2023 Maintainer

gadenbuie Sep 25, 2023 Author

wlandau Sep 27, 2023 Maintainer

gadenbuie Sep 27, 2023 Author

MilesMcBain Oct 4, 2023

HollyX23 Feb 15, 2024

gadenbuie
Sep 25, 2023

Replies: 3 comments 3 replies

wlandau
Sep 25, 2023
Maintainer

gadenbuie Sep 25, 2023
Author

wlandau Sep 27, 2023
Maintainer

gadenbuie Sep 27, 2023
Author

MilesMcBain
Oct 4, 2023

HollyX23
Feb 15, 2024