@@ -25,7 +25,7 @@ library(epidatr)
2525
2626To get a better handle on custom ` epi_workflow() ` s, lets recreate and then
2727modify the example of ` four_week_ahead ` from the [ landing
28- page] ( ../index.html#motivating-example ) ,
28+ page] ( ../index.html#motivating-example )
2929
3030``` {r make-four-forecasts, warning=FALSE}
3131training_data <- covid_case_death_rates |>
@@ -43,7 +43,7 @@ four_week_ahead <- arx_forecaster(
4343four_week_ahead$epi_workflow
4444```
4545# Anatomy of an ` epi_workflow `
46- An ` epi_workflow ` is an extension of a ` workflows::workflow() ` to handle panel
46+ An ` epi_workflow() ` is an extension of a ` workflows::workflow() ` to handle panel
4747data and post-processing.
4848It consists of 3 components, shown above:
4949
@@ -52,17 +52,21 @@ It consists of 3 components, shown above:
5252 steps] ( https://recipes.tidymodels.org/reference/index.html ) .
5353 Think of it as a more flexible ` formula ` that you would pass to ` lm() ` : `y ~
5454 x1 + log(x2) + lag(x1, 5)`.
55+ The above model has 6 of these steps.
5556 In general, there are 2 broad classes of transformation that ` {recipes} `
5657 handles:
57- - Transforms of both training and test data that are always applied.
58+ - Transforms of both training and test data that are always applied.
5859 Examples include taking the log of a variable, leading or lagging,
59- filtering out rows, handling dummy variables, etc.
60+ filtering out rows, handling dummy variables, calculating growth rates,
61+ etc.
6062 - Operations that fit parameters during training to apply during prediction,
6163 such as centering by the mean.
6264 This is a major benefit of ` {recipes} ` , since it prevents data leakage,
6365 where information about the test/predict time data "leaks" into the
6466 parameters. <!-- TODO unsure if worth even keeping, as we effectively
6567 can't have data leakage. -->
68+ However, the main mechanism we rely on to prevent data leakage is proper
69+ [ backtesting] ( backtesting.html ) .
6670 For the case of centering, we need to store the mean of the predictor from
6771 the training data and use that value on the prediction data rather than
6872 accidentally calculating the mean of the test predictor for centering.
@@ -74,24 +78,27 @@ It consists of 3 components, shown above:
7478 between a wide collection of statistical models.
75793 . Postprocessor: unique to this package, and used to format and modify the
7680 prediction after the model has been fit.
77- Each operation is a layer , and the stack of layers is known as frosting,
81+ Each operation is a ` layer_ ` , and the stack of layers is known as ` frosting() ` ,
7882 continuing the metaphor of baking a cake established in the recipe.
7983 Some example operations include:
80- - generate quantiles from purely point-prediction models,
81- - undo operations done in the steps, such as convert back to counts from rates
82- - threshold so the forecast doesn't include negative values
83- - generally adapt the format of the prediction to it's eventual use.
84+ - generate quantiles from purely point-prediction models,
85+ - undo operations done in the steps, such as convert back to counts from rates
86+ - threshold so the forecast doesn't include negative values
87+ - generally adapt the format of the prediction to it's eventual use.
8488
8589# Recreating ` four_week_ahead ` in an ` epi_workflow() `
8690To be able to extend this beyond what ` arx_forecaster() ` itself will let us do,
8791we first need to understand how to recreate it using a custom ` epi_workflow() ` .
8892
8993To use a custom workflow, there are a couple of steps:
94+
90951 . define the ` epi_recipe() ` , which contains the preprocessing steps
91962 . define the ` frosting() ` which contains the post-processing layers
92- 3 . Combine these with a trainer such as ` quantile_reg() ` into an ` epi_workflow ` , which we can then fit on the training data
93- 4 . fit the workflow on some data
94- 5 . grab the right prediction data using ` get_test_data() ` and apply the fit data to generate a prediction
97+ 3 . Combine these with a trainer such as ` quantile_reg() ` into an
98+ ` epi_workflow() ` , which we can then fit on the training data
99+ 4 . ` fit() ` the workflow on some data
100+ 5 . grab the right prediction data using ` get_test_data() ` and apply the fit data
101+ to generate a prediction
95102
96103## Define the ` epi_recipe() `
97104To do this, we'll first take a look at the steps as they're found in
@@ -103,9 +110,9 @@ hardhat::extract_recipe(four_week_ahead$epi_workflow)
103110
104111So there are 6 steps we will need to recreate.
105112One thing to note about the extracted recipe is that it has already been
106- trained; for steps such as ` recipes::step_BoxCox() ` , this means that their
113+ trained; for steps such as ` recipes::step_BoxCox() ` which have parameters , this means that their
107114parameters have been calculated.
108- Recreating this:
115+ Before defining steps, we need to create an ` epi_recipe() ` to hold them
109116``` {r make_recipe}
110117four_week_recipe <- epi_recipe(
111118 covid_case_death_rates |>
@@ -120,7 +127,7 @@ or `other_keys`; it is typically easiest to just use the training data itself.
120127
121128
122129Then we add each step via pipes; in principle the order matters, though for this
123- recipe only ` step_epi_naomit ` and ` step_training_window ` depend on the steps
130+ recipe only ` step_epi_naomit() ` and ` step_training_window() ` depend on the steps
124131before them:
125132
126133``` {r make_steps}
@@ -132,20 +139,21 @@ four_week_recipe <- four_week_recipe |>
132139 step_training_window()
133140```
134141
135- one thing to note: we only added 5 steps here because ` step_epi_naomit() ` is
136- actually a wrapper around adding 2 base ` step_naomit() ` s, on for
142+ One thing to note: we only added 5 steps here because ` step_epi_naomit() ` is
143+ actually a wrapper around adding 2 base ` step_naomit() ` s, one for
137144` all_predictors() ` and one for ` all_outcomes() ` , differing in their treatment at
138145predict time.
139146
140- ` step_epi_lag ` and ` step_epi_ahead ` both accept tidy syntax, so if we
141- wanted the same lags for both ` case_rate ` and ` death_rate ` , we could have done
142- ` step_epi_lag(ends_with("rate"), lag = c(0, 7, 14)) ` .
147+ ` step_epi_lag() ` and ` step_epi_ahead() ` both accept tidy syntax, so if for
148+ example we wanted the same lags for both ` case_rate ` and ` death_rate ` , we could
149+ have done ` step_epi_lag(ends_with("rate"), lag = c(0, 7, 14)) ` .
143150
144151In general for the ` {recipes} ` package, steps assign roles, such as ` predictor `
145- and ` outcome ` to columns, either by adding new ones or adjusting existing ones.
146- ` step_epi_lag() ` for example, creates a new column per lag with the name
152+ and ` outcome ` to columns, either by adding new columns or adjusting existing
153+ ones.
154+ ` step_epi_lag() ` for example, creates a new column for each lag with the name
147155` lag_x_column_name ` and assigns them as predictors, while ` step_epi_ahead() `
148- creates ` ahead_column_name ` columns and assigns them as outcomes.[ ^ 2 ]
156+ creates ` ahead_x_column_name ` columns and assigns them as outcomes.
149157
150158One way to inspect the roles assigned is to use ` prep() ` :
151159
@@ -163,9 +171,14 @@ four_week_recipe |>
163171 bake(training_data)
164172```
165173
174+ This is also useful for debugging malfunctioning pipelines; if you define
175+ ` four_week_recipe ` only up to the step that is misbehaving, you can get a
176+ partial evaluation to see the exact data the step is being applied to.
177+ It also allows you to see the exact data that the parsnip model is fitting on.
178+
166179## Define the ` frosting() `
167- Since the post-processing frosting[ ^ 1 ] layers are unique to this package, to
168- inspect them we use ` extract_frosting ` from ` {epipredict} ` :
180+ Since the post-processing frosting layers [ ^ 1 ] are unique to this package, to
181+ inspect them we use ` extract_frosting() ` from ` {epipredict} ` :
169182
170183``` {r inspect_fwa_layers, warning=FALSE}
171184epipredict::extract_frosting(four_week_ahead$epi_workflow)
@@ -174,7 +187,7 @@ epipredict::extract_frosting(four_week_ahead$epi_workflow)
174187The above gives us detailed descriptions of the arguments to the functions named
175188above in the postprocessor of ` four_week_ahead$epiworkflow ` .
176189Creating the layers is a similar process, except with frosting instead of a
177- ` recipe ` :
190+ ` recipe ` [ ^ 2 ] :
178191
179192``` {r make_frosting}
180193four_week_layers <- frosting() |>
@@ -185,26 +198,31 @@ four_week_layers <- frosting() |>
185198 layer_threshold()
186199```
187200
201+
188202Most layers will work for any engine or steps; ` layer_predict() ` you will want
189203to call in every case.
190204There are a couple of layers, however, which depend on whether the engine used predicts quantiles or point estimates.
191205
192- The layers that are only supported by point estimate engines (such as ` linear_reg() ` ):
206+ The layers that are only supported by point estimate engines (such as
207+ ` linear_reg() ` ) are
193208
194- - ` layer_residual_quantiles() ` : the preferred method of generating quantiles, it
195- uses the error residuals of the engine. This will work for most parsnip engines.
209+ - ` layer_residual_quantiles() ` : the preferred method of generating quantiles for
210+ non-quantile models, it uses the error residuals of the engine.
211+ This will work for most parsnip engines.
196212- ` layer_predictive_distn() ` : alternate method of generating quantiles, it uses
197213 an approximate parametric distribution. This will work for linear regression
198214 specifically.
199215
200216TODO check this
201217
202218On the other hand, the layers that are only supported by quantile estimating
203- engines (such as ` quantile_reg() ` ):
219+ engines (such as ` quantile_reg() ` ) are
204220
205- - ` layer_quantile_distn() `
206- - ` layer_point_from_distn() ` : this adds the median quantile as an point
207- estimate[ ^ 3 ] .
221+ - ` layer_quantile_distn() ` : adds the specified quantiles.
222+ If they differ from the ones actually fit, they will be interpolated and/or
223+ extrapolated.
224+ - ` layer_point_from_distn() ` : this adds the median quantile as a point estimate,
225+ and should be called after ` layer_quantile_distn() ` if called at all.
208226
209227## Fitting an ` epi_workflow() `
210228
@@ -217,28 +235,23 @@ four_week_workflow <- epi_workflow(
217235)
218236```
219237
220- And fit it[ ^ 4 ] :
238+ And fit it to recreate ` four_week_ahead$epi_workflow `
221239``` {r workflow_fitting}
222240fit_workflow <- four_week_workflow |> fit(training_data)
223241```
224242
225- This has recreated ` four_week_ahead$epi_workflow `
226-
227243## Predicting
228244
229245To do a prediction, we need to first narrow the dataset down to the relevant
230246datapoints:
231247
232248``` {r grab_data}
233249relevant_data <- get_test_data(
234- extract_preprocessor(fit_workflow) ,
250+ four_week_recipe ,
235251 training_data
236252)
237253```
238254
239- Note the use of ` extract_preprocessor() ` rather than ` extract_recipe() ` ; this
240- gets the unfit version of the ` epi_recipe ` , which is necessary.
241-
242255With a fit workflow and test data in hand, we can actually make our predictions:
243256
244257``` {r workflow_pred}
@@ -252,15 +265,26 @@ predictions:
252265fit_workflow |> predict(training_data)
253266```
254267
255- However, this produces forecasts for not just the actual ` forecast_date ` , but
256- for every day in the dataset it has enough data to actually make a prediction.
268+ The resulting tibble is 800 rows long, however.
269+ This produces forecasts for not just the actual ` forecast_date ` , but for every
270+ day in the dataset it has enough data to actually make a prediction.
271+ To narrow this down, we could filter to rows where the ` time_value ` matches the ` forecast_date ` :
272+
273+ ``` {r workflow_pred_training_filter}
274+ fit_workflow |>
275+ predict(training_data) |>
276+ filter(time_value == forecast_date)
277+ ```
278+
279+ This can be useful for cases where ` get_test_data() ` doesn't pull sufficient
280+ data.
257281
258282# Extending ` four_week_ahead `
259283
260284There are many ways we could modify ` four_week_ahead ` ; one simple modification
261285would be to include a growth rate estimate as part of the model.
262- Another would be to include a time component to the prediction if we expect
263- there to be a strong seasonal variation .
286+ Another would be to include a time component to the prediction (useful if we
287+ expect there to be a strong seasonal component) .
264288Another would be to convert from rates to counts, for example if that were the
265289preferred prediction format.
266290
@@ -313,11 +337,11 @@ growth_rate_workflow <- epi_workflow(
313337)
314338
315339relevant_data <- get_test_data(
316- extract_preprocessor(fit_workflow) ,
340+ growth_rate_recipe ,
317341 training_data
318342)
319343gr_fit_workflow <- growth_rate_workflow |> fit(training_data)
320- gr_predictions <- fit_workflow |>
344+ gr_predictions <- gr_fit_workflow |>
321345 predict(relevant_data) |>
322346 filter(time_value == forecast_date)
323347```
@@ -354,3 +378,50 @@ result_plot
354378```
355379
356380TODO ` get_test_data ` isn't actually working here...
381+
382+ [ ^ 1 ] : Think of baking a cake, where adding the frosting is the last step in the process of actually baking.
383+
384+ [ ^ 2 ] : Note that the frosting doesn't require any information about the training
385+ data, since the output of the model only depends on the model used.
386+
387+ ## Population scaling
388+
389+ Suppose we're sending our predictions to someone who is looking to understand
390+ counts, rather than rates.
391+ Then we can adjust just the frosting to get a forecast for the counts from our
392+ rates forecaster:
393+
394+ ``` {r rate_scale}
395+ count_layers <-
396+ frosting() |>
397+ layer_predict() |>
398+ layer_residual_quantiles(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)) |>
399+ layer_population_scaling(
400+ .pred,
401+ .pred_distn,
402+ df = epidatasets::state_census,
403+ df_pop_col = "pop",
404+ create_new = FALSE,
405+ rate_rescaling = 1e5,
406+ by = c("geo_value" = "abbr")
407+ ) |>
408+ layer_add_forecast_date() |>
409+ layer_add_target_date() |>
410+ layer_threshold()
411+ # building the new workflow
412+ count_workflow <- epi_workflow(
413+ four_week_recipe,
414+ linear_reg(),
415+ count_layers
416+ )
417+ count_pred_data <- get_test_data(four_week_recipe, training_data)
418+ count_predictions <- count_workflow |>
419+ fit(training_data) |>
420+ predict(count_pred_data)
421+ count_predictions
422+ ```
423+
424+ which are 2-3 orders of magnitude larger than the corresponding rates above.
425+ ` df ` represents the scaling value; in this case it is the state populations,
426+ while ` rate_rescaling ` gives the denominator of the rate (our fit values were
427+ per 100,000).
0 commit comments