@@ -59,7 +59,7 @@ from JHU data.
5959Creating the dataset using ` {epidatr} `  and ` {epiprocess} ` 
6060</summary >
6161
62- This dataset can be found in the package as \< TODO DOESN’T EXIST \> ; we
62+ This dataset can be found in the package as ` covid_case_death_rates ` ; we
6363demonstrate some of the typically ubiquitous cleaning operations needed
6464to be able to forecast. First we pull both jhu-csse cases and deaths
6565from [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ )  package:
@@ -84,26 +84,35 @@ deaths <- pub_covidcast(
8484  geo_values  =  " *" 
8585) | > 
8686  select(geo_value , time_value , death_rate  =  value )
87+ ``` 
88+ 
89+ Since visualizing the results on every geography is somewhat
90+ overwhelming, we’ll only train on a subset of 5.
91+ 
92+ ```  r 
93+ used_locations  <-  c(" ca" " ma" " ny" " tx" 
8794cases_deaths  <- 
8895  full_join(cases , deaths , by  =  c(" time_value" " geo_value" | > 
96+   filter(geo_value  %in%  used_locations ) | > 
8997  as_epi_df(as_of  =  as.Date(" 2022-01-01" 
90- plot_locations  <-  c(" ca" " ma" " ny" " tx" 
9198#  plotting the data as it was downloaded
9299cases_deaths  | > 
93-   filter(geo_value  %in%  plot_locations ) | > 
94-   pivot_longer(cols  =  c(" case_rate" " death_rate" names_to  =  " source" | > 
95-   ggplot(aes(x  =  time_value , y  =  value )) + 
96-   geom_line() + 
97-   facet_grid(source  ~  geo_value , scale  =  " free" + 
100+   autoplot(
101+     case_rate ,
102+     death_rate ,
103+     .color_by  =  " none" 
104+   ) + 
105+   facet_grid(.response_name  ~  geo_value , scale  =  " free" + 
98106  scale_x_date(date_breaks  =  " 3 months" date_labels  =  " %Y %b" + 
99107  theme(axis.text.x  =  element_text(angle  =  90 , hjust  =  1 ))
100108``` 
101109
102- <img  src =" man/figures/README-case_death -1.png "  width =" 90% "  style =" display block ; margin auto ;"  />
110+ <img  src =" man/figures/README-date -1.png "  width =" 90% "  style =" display block ; margin auto ;"  />
103111
104112As with basically any dataset, there is some cleaning that we will need
105113to do to make it actually usable; we’ll use some utilities from
106114[ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ )  for this.
115+ 
107116First, to eliminate some of the noise coming from daily reporting, we do
1081177 day averaging over a trailing window[ ^ 1 ] :
109118
@@ -129,10 +138,12 @@ cases_deaths <-
129138  group_by(geo_value ) | > 
130139  mutate(
131140    outlr_death_rate  =  detect_outlr_rm(
132-       time_value , death_rate , detect_negatives  =  TRUE 
141+       time_value , death_rate ,
142+       detect_negatives  =  TRUE 
133143    ),
134144    outlr_case_rate  =  detect_outlr_rm(
135-       time_value , case_rate , detect_negatives  =  TRUE 
145+       time_value , case_rate ,
146+       detect_negatives  =  TRUE 
136147    )
137148  ) | > 
138149  unnest(cols  =  starts_with(" outlr" names_sep  =  " _" | > 
@@ -142,22 +153,6 @@ cases_deaths <-
142153    case_rate  =  outlr_case_rate_replacement 
143154  ) | > 
144155  select(geo_value , time_value , case_rate , death_rate )
145- cases_deaths 
146- # > An `epi_df` object, 32,424 x 4 with metadata:
147- # > * geo_type  = state
148- # > * time_type = day
149- # > * as_of     = 2022-01-01
150- # > 
151- # > # A tibble: 32,424 × 4
152- # >   geo_value time_value case_rate death_rate
153- # > * <chr>     <date>         <dbl>      <dbl>
154- # > 1 ak        2020-06-01      2.31          0
155- # > 2 ak        2020-06-02      1.94          0
156- # > 3 ak        2020-06-03      2.63          0
157- # > 4 ak        2020-06-04      2.59          0
158- # > 5 ak        2020-06-05      2.43          0
159- # > 6 ak        2020-06-06      2.35          0
160- # > # ℹ 32,418 more rows
161156``` 
162157
163158</details >
@@ -173,18 +168,19 @@ Plot
173168```  r 
174169forecast_date_label  <- 
175170  tibble(
176-     geo_value  =  rep(plot_locations , 2 ),
177-     source  =  c(rep(" case_rate" 4 ), rep(" death_rate" 4 )),
178-     dates  =  rep(forecast_date  -  7  *  2 , 2  *  length(plot_locations )),
171+     geo_value  =  rep(used_locations , 2 ),
172+     .response_name  =  c(rep(" case_rate" 4 ), rep(" death_rate" 4 )),
173+     dates  =  rep(forecast_date  -  7  *  2 , 2  *  length(used_locations )),
179174    heights  =  c(rep(150 , 4 ), rep(1.0 , 4 ))
180175  )
181176processed_data_plot  <- 
182177  cases_deaths  | > 
183-   filter(geo_value  %in%  plot_locations ) | > 
184-   pivot_longer(cols  =  c(" case_rate" " death_rate" names_to  =  " source" | > 
185-   ggplot(aes(x  =  time_value , y  =  value )) + 
186-   geom_line() + 
187-   facet_grid(source  ~  geo_value , scale  =  " free" + 
178+   autoplot(
179+     case_rate ,
180+     death_rate ,
181+     .color_by  =  " none" 
182+   ) + 
183+   facet_grid(.response_name  ~  geo_value , scale  =  " free" + 
188184  geom_vline(aes(xintercept  =  forecast_date )) + 
189185  geom_text(
190186    data  =  forecast_date_label ,
@@ -216,7 +212,7 @@ four_week_ahead <- arx_forecaster(
216212four_week_ahead 
217213# > ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
218214# > 
219- # > This forecaster was fit on 2025-01-24 15:31:46 .
215+ # > This forecaster was fit on 2025-01-27 16:36:10 .
220216# > 
221217# > Training data was an <epi_df> with:
222218# > • Geography: state,
@@ -226,8 +222,8 @@ four_week_ahead
226222# > 
227223# > ── Predictions ──────────────────────────────────────────────────────────────
228224# > 
229- # > A total of 56  predictions are available for
230- # > • 56  unique geographic regions,
225+ # > A total of 4  predictions are available for
226+ # > • 4  unique geographic regions,
231227# > • At forecast date: 2021-08-01,
232228# > • For target date: 2021-08-29,
233229# > 
@@ -246,58 +242,34 @@ Plotting the prediction intervals on our subset above[^3]:
246242Plot
247243</summary >
248244
249- This is the same kind of plot as ` processed_data_plot `  above, but with
250- the past data narrowed somewhat
251- 
252245```  r 
253- narrow_data_plot  <- 
254-    cases_deaths   | > 
255-   filter( time_value   >   " 2021-04-01 " )  | > 
256-   filter( geo_value   %in%   plot_locations ) | > 
257-   pivot_longer( cols  =  c( " case_rate " ,  " death_rate" ,  names_to   =   " source " )  | > 
258-   ggplot(aes( x   =   time_value ,  y   =   value ))  + 
259-   geom_line()  + 
260-   facet_grid( source   ~   geo_value ,  scale   =   " free " + 
246+ epiworkflow  <-   four_week_ahead $ epi_workflow 
247+ restricted_predictions   <- 
248+   four_week_ahead $ predictions  | > 
249+   rename( time_value   =   target_date ,  value   =   .pred ) | > 
250+   mutate( .response_name  =  " death_rate" 
251+ forecast_plot   <- 
252+   four_week_ahead   | > 
253+   autoplot( plot_data   =   cases_deaths ) + 
261254  geom_vline(aes(xintercept  =  forecast_date )) + 
262255  geom_text(
263-     data  =  forecast_date_label ,
256+     data  =  forecast_date_label  % > % filter( .response_name   ==   " death_rate " ) ,
264257    aes(x  =  dates , label  =  " forecast\n date" y  =  heights ),
265258    size  =  3 , hjust  =  " right" 
266259  ) + 
267260  scale_x_date(date_breaks  =  " 3 months" date_labels  =  " %Y %b" + 
268261  theme(axis.text.x  =  element_text(angle  =  90 , hjust  =  1 ))
269262``` 
270263
271- Putting that together with a plot of the bands, and a plot of the median
272- prediction.
273- 
274- ```  r 
275- epiworkflow  <-  four_week_ahead $ epi_workflow 
276- restricted_predictions  <- 
277-   four_week_ahead $ predictions  | > 
278-   filter(geo_value  %in%  plot_locations ) | > 
279-   rename(time_value  =  target_date , value  =  .pred ) | > 
280-   mutate(source  =  " death_rate" 
281- forecast_plot  <- 
282-   narrow_data_plot  | > 
283-   epipredict ::: plot_bands(
284-     restricted_predictions ,
285-     levels  =  0.9 
286-   ) + 
287-   geom_point(
288-     data  =  restricted_predictions ,
289-     aes(y  =  .data $ value )
290-   )
291- ``` 
292- 
293264</details >
294265
295266<img  src =" man/figures/README-show-single-forecast-1.png "  width =" 90% "  style =" display block ; margin auto ;"  />
296267
297- The yellow dot gives the median prediction, while the red interval gives
298- the 5-95% inter-quantile range. For this particular day and these
299- locations, the forecasts are relatively accurate, with the true data
300- being within the 25-75% interval. A couple of things to note:
268+ The black dot gives the median prediction, while the blue intervals give
269+ the 25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges. For this
270+ particular day and these locations, the forecasts are relatively
271+ accurate, with the true data being at least within the 10-90% interval.
272+ A couple of things to note:
301273
3022741 .   Our methods are primarily direct forecasters; this means we don’t
303275    need to predict 1, 2,…, 27 days ahead to then predict 28 days ahead
@@ -312,10 +284,10 @@ being within the 25-75% interval. A couple of things to note:
312284If you encounter a bug or have a feature request, feel free to file an
313285[ issue on our github
314286page] ( https://github.com/cmu-delphi/epipredict/issues ) . For other
315- questions, feel free to 
contact  [ Daniel ] ( [email protected] ) , 316- 317- [ Logan ] ( [email protected] ) , either via email or on the Insightnet 318- slack.
287+ questions, feel free to reach out to the authors, either via this 
288+ [ contact 
289+ form ] ( https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform ) , 
290+ email, or the Insightnet  slack.
319291
320292[ ^ 1 ] : This makes it so that any given day of the processed timeseries
321293    only depends on the previous week, which means that we avoid leaking
0 commit comments