Replies: 3 comments 3 replies
-
Please have a look at https://books.ropensci.org/targets/performance.html, I think most of the suggestions apply to your case. Also, you will probably see a speedup if you use |
Beta Was this translation helpful? Give feedback.
-
I roll with this kind of muliple-plan in projects all the time. In fact it's the only way I set up projects now in my current job. This is partly because each project usually has a couple of billing milestones and it's good for record keeping to be able to have a plan and cache of results that represents the work that was signed off at that milestone. We use the There's some stuff in These homebrewed targets factories for sharing data between plans have also been useful: https://gist.github.com/MilesMcBain/4b7dec96bb1d721ab6038e213c1fdafe - I think this tends to be the main stumbling block? Another issue that may pop up is that you want to lock 'downstream' plan from being affected by updates in the 'upstream' plan. E.g. someone conducted an investigation into some client question and accidentally modified the cache. We implement this with custom cues for some of the important datasets. They depend on a global variable that can disable them rebuilding from changes in the upstream plan. |
Beta Was this translation helpful? Give feedback.
-
When approaching a sizable web scraping project, it's wise to break it down into smaller, more manageable pipelines. This allows for better organization and easier maintenance. Modularization is key here; dividing the project into distinct stages or components enhances clarity and simplifies debugging and scaling efforts. Additionally, adopting a structured directory layout for your scraped data can greatly improve efficiency. Organizing files into logical categories or subdirectories streamlines data management and facilitates subsequent processing steps. This approach leverages the directory structure to efficiently input files into pipelines, reducing complexity and enhancing reproducibility. Considering your use of |
Beta Was this translation helpful? Give feedback.
-
Help
Description
In #255 (comment), Will wrote
I'm wondering if the cleaner and more reproducible are documented somewhere.
I'm finding myself very tempted do break a rather large project into smaller targets pipelines, and there are natural breakpoints in my project where I could do that. My use case is a large web scraping project. There are an initial set of targets that result in several dynamic branches with about 50,000 branches each. If everything is up-to-date, it takes
tar_make()
20 minutes to an hour to work through skipping all the up-to-date branches. What I'd like to do is start a new pipeline that picks up from the downloaded files. I can use the directory structure to simplify the file inputs into a more manageable number of targets.What patterns would others recommend following in this case?
Beta Was this translation helpful? Give feedback.
All reactions