-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path05-visualization.Rmd
776 lines (603 loc) · 44.2 KB
/
05-visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
---
output:
html_document:
toc: yes
html_notebook: default
pdf_document:
toc: yes
---
```{r echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}
library(knitr)
options(scipen = 999)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
opts_chunk$set(cache=T)
```
```{r, echo = FALSE, warning=FALSE, message=FALSE}
library(Hmisc)
library(knitr)
library(ggmap)
library(ggstatsplot)
register_google(key = "AIzaSyDMxadfjNH489ueRgo9S62PuZU6PbwAyiM")
```
## Data visualization
::: {.infobox .download data-latex="{download}"}
[You can download the corresponding R-Code here](./Code/04-visualization.R)
:::
This section discusses the important topic of data visualization and how to produce appropriate graphics to describe your data visually. You should always visualize your data first. The plots we created in the previous chapters used R's in-built functions. In this section, we will be using the `ggplot2` package by Hadley Wickham. It has the advantage of being fairly straightforward to learn and being very flexible when it comes to building more complex plots. For a more in depth discussion you can refer to chapter 4 of the book "Discovering Statistics Using R" by Andy Field et al. or read the following chapter from the book <a href="http://r4ds.had.co.nz/data-visualisation.html" target="_blank">"R for Data science"</a> by Hadley Wickham as well as <a href="https://r-graphics.org/" target="_blank">"R Graphics Cookbook"</a> by Winston Chang.
ggplot2 is built around the idea of constructing plots by stacking layers on top of one another. Every plot starts with the ```ggplot(data)``` function, after which layers can be added with the "+" symbol. The following figures show the layered structure of creating plots with ggplot.
<p style="text-align:center;">
<img src="https://github.com/IMSMWU/Teaching/raw/master/MRDA2017/Graphics/ggplot2.JPG" alt="DSUR cover" height="250" />
<img src="https://github.com/IMSMWU/Teaching/raw/master/MRDA2017/Graphics/ggplot1.JPG" alt="DSUR cover" height="250" />
</p>
### Categorical variables
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/Eer3BHhS5IY" frameborder="0" allowfullscreen></iframe>
</div>
<br>
#### Bar plot
To give you an example of how the graphics are composed, let's go back to the frequency table from the previous chapter, where we created a table showing the relative frequencies of songs in the Austrian streaming charts by genre.
```{r message=FALSE, warning=FALSE}
music_data <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/music_data_at.csv",
sep = ",",
header = TRUE)
music_data$release_date <- as.Date(music_data$release_date) #convert to date
music_data$explicit <- factor(music_data$explicit, levels = 0:1, labels = c("not explicit", "explicit")) #convert to factor
music_data$label <- as.factor(music_data$label) #convert to factor
music_data$rep_ctry <- as.factor(music_data$rep_ctry) #convert to factor
music_data$genre <- as.factor(music_data$genre) #convert to factor
prop.table(table(music_data[,c("genre")])) #relative frequencies
music_data <- music_data[!is.na(music_data$valence) & !is.na(music_data$duration_ms),] # exclude cases with missing values
```
How can we plot this kind of data? Since we have a categorical variable, we will use a bar plot. However, to be able to use the table for your plot, you first need to assign it to an object as a data frame using the ```as.data.frame()```-function.
```{r message=FALSE, warning=FALSE}
table_plot_rel <- as.data.frame(prop.table(table(music_data[,c("genre")]))) #relative frequencies #relative frequencies
head(table_plot_rel)
```
Since ```Var1``` is not a very descriptive name, let's rename the variable to something more meaningful
```{r message=FALSE, warning=FALSE}
library(plyr)
table_plot_rel <- plyr::rename(table_plot_rel, c(Var1="Genre"))
head(table_plot_rel)
```
Once we have our data set we can begin constructing the plot. As mentioned previously, we start with the ```ggplot()``` function, with the argument specifying the data set to be used. Within the function, we further specify the scales to be used using the aesthetics argument, specifying which variable should be plotted on which axis. In our example, we would like to plot the categories on the x-axis (horizontal axis) and the relative frequencies on the y-axis (vertical axis).
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (step 1)"}
library(ggplot2)
bar_chart <- ggplot(table_plot_rel, aes(x = Genre,y = Freq))
bar_chart
```
You can see that the coordinate system is empty. This is because so far, we have told R only which variables we would like to plot but we haven't specified which geometric figures (points, bars, lines, etc.) we would like to use. This is done using the ```geom_xxx()``` function. ggplot includes many different geoms, for a wide range of plots (e.g., geom_line, geom_histogram, geom_boxplot, etc.). A good overview of the various geom functions can be found <a href="https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf" target="_blank">here</a>. In our case, we would like to use a bar chart for which ```geom_col``` is appropriate.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (step 2)"}
bar_chart + geom_col()
```
Now we have specified the data, the scales and the shape. Specifying this information is essential for plotting data using ggplot. Everything that follows now just serves the purpose of making the plot look nicer by modifying the appearance of the plot. How about some more meaningful axis labels? We can specify the axis labels using the ```ylab()``` and ```xlab()``` functions:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (step 3)"}
bar_chart + geom_col() +
ylab("Relative frequency") +
xlab("Genre")
```
How about adding some value labels to the bars? This can be done using ```geom_text()```. Note that the ```sprintf()``` function is not mandatory and is only added to format the numeric labels here. The function takes two arguments: the first specifies the format wrapped in two ```%``` signs. Thus, ```%.0f``` means to format the value as a fixed point value with no digits after the decimal point, and ```%%``` is a literal that prints a "%" sign. The second argument is simply the numeric value to be used. In this case, the relative frequencies multiplied by 100 to obtain the percentage values. Using the ```vjust = ``` argument, we can adjust the vertical alignment of the label. In this case, we would like to display the label slightly above the bars.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (step 4)"}
bar_chart + geom_col() +
ylab("Relative frequency") +
xlab("Genre") +
geom_text(aes(label = sprintf("%.0f%%", Freq/sum(Freq) * 100)), vjust=-0.2)
```
We could go ahead and specify the appearance of every single element of the plot now. However, there are also pre-specified themes that include various formatting steps in one singe function. For example ```theme_bw()``` would make the plot appear like this:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (step 5)"}
bar_chart + geom_col() +
ylab("Relative frequency") +
xlab("Genre") +
geom_text(aes(label = sprintf("%.0f%%", Freq/sum(Freq) * 100)), vjust=-0.2) +
theme_bw()
```
and ```theme_minimal()``` looks like this:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (options 1)"}
bar_chart + geom_col() +
ylab("Relative frequency") +
xlab("Genre") +
geom_text(aes(label = sprintf("%.0f%%", Freq/sum(Freq) * 100)), vjust=-0.2) +
theme_minimal()
```
In a next step, let's prevent the axis labels from overlapping by rotating the labels.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (options 1)"}
bar_chart + geom_col() +
ylab("Relative frequency") +
xlab("Genre") +
geom_text(aes(label = sprintf("%.0f%%", Freq/sum(Freq) * 100)), vjust=-0.2) +
theme_minimal() +
theme(axis.text.x = element_text(angle=45,vjust=0.75))
```
We could also add a title and combine all labels using the `labs` function.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (options 1)"}
bar_chart + geom_col() +
labs(x = "Genre", y = "Relative frequency", title = "Chart songs by genre") +
geom_text(aes(label = sprintf("%.0f%%", Freq/sum(Freq) * 100)), vjust=-0.2) +
theme_minimal() +
theme(axis.text.x = element_text(angle=45,vjust=0.75),
plot.title = element_text(hjust = 0.5,color = "#666666")
)
```
We could also add some color to the bars using `scale_fill_brewer`, which comes with a range of <a href="http://applied-r.com/rcolorbrewer-palettes/" target="_blank">color palettes</a>.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (options 1)"}
bar_chart + geom_col(aes(fill = Genre)) +
labs(x = "Genre", y = "Relative frequency", title = "Chart share by genre") +
geom_text(aes(label = sprintf("%.0f%%", Freq/sum(Freq) * 100)), vjust=-0.2) +
theme_minimal() +
ylim(0,0.5) +
scale_fill_brewer(palette = "Blues") +
theme(axis.text.x = element_text(angle=45,vjust=0.75),
plot.title = element_text(hjust = 0.5,color = "#666666"),
legend.title = element_blank()
)
```
These were examples of built-in formatting options of ```ggolot()```, where the default is ```theme_classic()```. For even more options, check out the ```ggthemes``` package, which includes formats for specific publications. You can check out the different themes <a href="https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html" target="_blank">here</a>. For example ```theme_economist()``` uses the formatting of the journal "The Economist":
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Bar chart (options 2)"}
library(ggthemes)
bar_chart + geom_col() +
labs(x = "Genre", y = "Relative frequency", title = "Chart songs by genre") +
geom_text(aes(label = sprintf("%.0f%%", Freq/sum(Freq) * 100)), vjust=-0.2) +
theme_economist() +
ylim(0,0.5) +
theme(axis.text.x = element_text(angle=45,vjust=0.55),
plot.title = element_text(hjust = 0.5,color = "#666666")
)
```
::: {.infobox_orange .hint data-latex="{hint}"}
There are various similar packages with pre-specified themes, like the <a href="https://github.com/cttobin/ggthemr" target="_blank">`ggthemr`</a> package, the <a href="https://github.com/ricardo-bion/ggtech" target="_blank">`ggtech`</a> package, the <a href="https://github.com/johnmackintosh/rockthemes" target="_blank">`rockthemes`</a> package, or the <a href="https://github.com/Ryo-N7/tvthemes" target="_blank">`tvthemes`</a> package.
:::
In a next step, we might want to investigate whether the relative frequencies differ between songs with explicit and song without explicit lyrics. For this purpose we first construct the conditional relative frequency table from the previous chapter again. Recall that the latter gives us the relative frequency within a group (in our case genres), as compared to the relative frequency within the entire sample.
```{r}
table_plot_cond_rel <- as.data.frame(prop.table(table(music_data[,c("genre", "explicit")]),2)) #conditional relative frequencies
table_plot_cond_rel
```
We can now take these tables to construct plots grouped by explicitness. To achieve this we simply need to add the `facet_wrap()` function, which replicates a plot multiple times, split by a specified grouping factor. Note that the grouping factor has to be supplied in R’s formula notation, hence it is preceded by a “~” symbol.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Grouped bar chart (facet_wrap)"}
ggplot(table_plot_cond_rel, aes(x = genre, y = Freq)) +
geom_col(aes(fill = genre)) +
facet_wrap(~explicit) +
labs(x = "", y = "Relative frequency", title = "Distribution of genres for explicit and non-explicit songs") +
geom_text(aes(label = sprintf("%.0f%%", Freq * 100)), vjust=-0.2) +
theme_minimal() +
ylim(0,1) +
scale_fill_brewer(palette = "Blues") +
theme(axis.text.x = element_text(angle=45,vjust=1.1,hjust=1),
plot.title = element_text(hjust = 0.5,color = "#666666"),
legend.position = "none"
)
```
Alternatively, we might be interested to investigate the relative frequencies of explicit and non-explicit lyrics for each genre. To achieve this, we can also use the fill argument, which tells ggplot to fill the bars by a specified variable (in our case “explicit”). The position = "dodge" argument causes the bars to be displayed next to each other (as opposed to stacked on top of one another).
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Grouped bar chart (fill)"}
table_plot_cond_rel_1 <- as.data.frame(prop.table(table(music_data[,c("genre", "explicit")]),1)) #conditional relative frequencies
ggplot(table_plot_cond_rel_1, aes(x = genre, y = Freq, fill = explicit)) + #use "fill" argument for different colors
geom_col(position = "dodge") + #use "dodge" to display bars next to each other (instead of stacked on top)
geom_text(aes(label = sprintf("%.0f%%", Freq * 100)),position=position_dodge(width=0.9), vjust=-0.25) +
scale_fill_brewer(palette = "Blues") +
labs(x = "Genre", y = "Relative frequency", title = "Explicit lyrics share by genre") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45,vjust=1.1,hjust=1),
plot.title = element_text(hjust = 0.5,color = "#666666"),
legend.position = "none"
)
```
#### Pie chart
We could also visualize the same information using a pie chart.
```{r}
ggplot(subset(table_plot_rel,Freq > 0), aes(x="", y=Freq, fill=Genre)) + # Create a basic bar
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) + #Convert to pie (polar coordinates)
geom_text(aes(label = paste0(round(Freq*100), "%")), position = position_stack(vjust = 0.5)) + #add labels
scale_fill_brewer(palette = "Blues") +
labs(x = NULL, y = NULL, fill = NULL, title = "Spotify tracks by Genre") + #remove labels and add title
theme_minimal() +
theme(axis.line = element_blank(), # Tidy up the theme
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5, color = "#666666"))
```
#### Covariation plots
To visualize the co-variation between categorical variables, you’ll need to count the number of observations for each combination stored in the frequency table. Say, we wanted to investigate the association between the popularity of a song and the level of 'speechiness'. For this exercise, let's assume we have both variables measured as categorical (factor) variables. We can use the `quantcut()` function to create categorical variables based on the continuous variables. All we need to do is tell the function how many categories we would like to obtain and it will divide the data based on the percentiles equally.
```{r}
library(gtools)
music_data$streams_cat <- as.numeric(quantcut(music_data$streams, 5, na.rm=TRUE))
music_data$speech_cat <- as.numeric(quantcut(music_data$speechiness, 3, na.rm=TRUE))
music_data$streams_cat <- factor(music_data$streams_cat, levels = 1:5, labels = c("low", "low-med", "medium", "med-high", "high")) #convert to factor
music_data$speech_cat <- factor(music_data$speech_cat, levels = 1:3, labels = c("low", "medium", "high")) #convert to factor
```
Now we have multiple ways to visualize a relationship between the two variables with ggplot. One option would be to use a variation of the scatterplot which counts how many points overlap at any given point and increases the dot size accordingly. This can be achieved with ```geom_count()``` as the example below shows where the `stat(prop)` argument assures that we get relative frequencies and with the `group` argument we tell R to compute the relative frequencies by speechiness.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Covariation between categorical data (1)"}
ggplot(data = music_data) +
geom_count(aes(x = speech_cat, y = streams_cat, size = stat(prop), group = speech_cat)) +
ylab("Popularity") +
xlab("Speechiness") +
labs(size = "Proportion") +
theme_bw()
```
The plot shows that there appears to be a positive association between the popularity of a song and its level of speechiness.
Another option would be to use a tile plot that changes the color of the tile based on the frequency of the combination of factors. To achieve this, we first have to create a dataframe that contains the relative frequencies of all combinations of factors. Then we can take this dataframe and pass it to ```geom_tile()```, while specifying that the fill of each tile should be dependent on the observed frequency of the factor combination, which is done by specifying the fill in the ```aes()``` function.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Covariation between categorical data (2)"}
table_plot_rel <- prop.table(table(music_data[,c("speech_cat", "streams_cat")]),1)
table_plot_rel <- as.data.frame(table_plot_rel)
ggplot(table_plot_rel, aes(x = speech_cat, y = streams_cat)) +
geom_tile(aes(fill = Freq)) +
ylab("Populartiy") +
xlab("Speechiness") +
theme_bw()
```
### Continuous variables
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/1ttdoi_QMAw" frameborder="0" allowfullscreen></iframe>
</div>
<br>
#### Histogram
Histograms can be created for continuous data using the ```geom_histogram()``` function. Note that the ```aes()``` function only needs one argument here, since a histogram is a plot of the distribution of only one variable. As an example, let's consider our data set containing the music data:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Histogram"}
head(music_data)
```
Now we can create the histogram using ```geom_histogram()```. The argument ```binwidth``` specifies the range that each bar spans, ```col = "black"``` specifies the border to be black and ```fill = "darkblue"``` sets the inner color of the bars to dark blue. For brevity, we have now also started naming the x and y axis with the single function ```labs()```, instead of using the two distinct functions ```xlab()``` and ```ylab()```.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Histogram"}
ggplot(music_data,aes(streams)) +
geom_histogram(binwidth = 4000, col = "black", fill = "darkblue") +
labs(x = "Number of streams", y = "Frequency", title = "Distribution of streams") +
theme_bw()
```
If you would like to highlight certain points of the distribution, you can use the `geom_vline` (short for vertical line) function to do this. For example, we may want to highlight the mean and the median of the distribution.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Histogram 2"}
ggplot(music_data,aes(streams)) +
geom_histogram(binwidth = 4000, col = "black", fill = "darkblue") +
labs(x = "Number of streams", y = "Frequency", title = "Distribution of streams", subtitle = "Red vertical line = mean, green vertical line = median") +
geom_vline(xintercept = mean(music_data$streams), color = "red", size = 1) +
geom_vline(xintercept = median(music_data$streams), color = "green", size = 1) +
theme_bw()
```
In this case, you should indicate what the lines refer to. In the plot above, the 'subtitle' argument was used to add this information to the plot.
::: {.infobox_orange .hint data-latex="{hint}"}
Note the difference between a bar chart and the histogram. While a bar chart is used to visualize the frequency of observations for categorical variables, the histogram shows the frequency distribution for continuous variables.
:::
#### Boxplot
Another common way to display the distribution of continuous variables is through boxplots. ggplot will construct a boxplot if given the geom ```geom_boxplot()```. In our case we might want to show the difference in streams between the genres. For this analysis, we will transform the streaming variable using a logarithmic transformation, which is common with such data (as we will see later). So let's first create a new variable by taking the logarithm of the streams variable.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Boxplot by group"}
music_data$log_streams <- log(music_data$streams)
```
Now, let's create a boxplot based on these variables and plot the log-transformed number of streams by genre.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Boxplot by group"}
ggplot(music_data,aes(x = genre, y = log_streams, fill = genre)) +
geom_boxplot(coef = 3) +
labs(x = "Genre", y = "Number of streams (log-scale)") +
theme_minimal() +
scale_fill_brewer(palette = "Blues") +
theme(axis.text.x = element_text(angle=45,vjust=1.1,hjust=1),
plot.title = element_text(hjust = 0.5,color = "#666666"),
legend.position = "none"
)
```
The following graphic shows you how to interpret the boxplot:

Note that you could also flip the boxplot. To do this, you only need to exchange the x- and y-variables. If we provide the categorical variable to the y-axis as follows, the axis will be flipped.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Boxplot by group"}
ggplot(music_data,aes(x = log_streams, y = genre, fill = genre)) +
geom_boxplot(coef = 3) +
labs(x = "Number of streams (log-scale)", y = "Genre") +
theme_minimal() +
scale_fill_brewer(palette = "Blues") +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"),
legend.position = "none"
)
```
It is often meaningful to augment the boxplot with the data points using ```geom_jitter()```. This way, differences in the distribution of the variable between the genres become even more apparent.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Boxplot by group"}
ggplot(music_data,aes(x = log_streams , y = genre)) +
geom_boxplot(coef = 3) +
labs(x = "Number of streams (log-scale)", y = "Genre") +
theme_minimal() +
geom_jitter(colour="red", alpha = 0.1)
```
In case you would like to create the boxplot on the total data (i.e., not by group), just leave the ```x = ``` argument within the ```aes()``` function empty:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Single Boxplot"}
ggplot(music_data,aes(x = log_streams, y = "")) +
geom_boxplot(coef = 3,width=0.3) +
labs(x = "Number of streams (log-scale)", y = "")
```
#### Plot of means
Another way to get an overview of the difference between two groups is to plot their respective means with confidence intervals. The mean and confidence intervals will enter the plot separately, using the geoms ```geom_bar()``` and ```geom_errorbar()```. Don't worry if you don't know exactly how to interpret the confidence interval at this stage - we will cover this topic in the next chapter. Let's assume we would like to plot the difference in streams between the HipHop & Rap genre and all other genres. For this, we first need to create a dummy variable (i.e., a categorical variable with two levels) that indicates if a song is from the HipHop & Rap genre or from any of the other genres. We can use the `ifelse()` function to do this, which takes 3 arguments, namely 1) the if-statement, 2) the returned value if this if-statement is true, and 3) the value if the if-statement is not true.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
music_data$genre_dummy <- as.factor(ifelse(music_data$genre=="HipHop & Rap","HipHop & Rap","other"))
```
To make plotting the desired comparison easier, we can compute all relevant statistics first, using the ```summarySE()``` function from the `Rmisc` package.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
library(Rmisc)
mean_data <- summarySE(music_data, measurevar="streams", groupvars=c("genre_dummy"))
mean_data
```
The output tells you how many observations there are per group, the mean number of streams per group, as well as the group-specific standard deviation, the standard error, and the confidence interval (more on this in the next chapter). You can now create the plot as follows:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Plot of means"}
ggplot(mean_data,aes(x = genre_dummy, y = streams)) +
geom_bar(position=position_dodge(.9), colour="black", fill = "#CCCCCC", stat="identity", width = 0.65) +
geom_errorbar(position=position_dodge(.9), width=.15, aes(ymin=streams-ci, ymax=streams+ci)) +
theme_bw() +
labs(x = "Genre", y = "Average number of streams", title = "Average number of streams by genre")+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
As can be seen, there is a large difference between the genres with respect to the average number of streams.
#### Grouped plot of means
We might also be interested to investigate a second factor. Say, we would like to find out if there is a difference between genres with respect to the lyrics (i.e., whether the lyrics are explicit or not). Can we find evidence that explicit lyrics affect streams of songs from the HipHop & Rap genre differently compared to other genres. We can compute the statistics using the ```summarySE()``` function by simply adding the second variable to the 'groupvars' argument.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
mean_data2 <- summarySE(music_data, measurevar="streams", groupvars=c("genre_dummy","explicit"))
mean_data2
```
Now we obtained the results for four different groups (2 genres x 2 lyric types) and we can amend the plot easily by adding the 'fill' argument to the ```ggplot()``` function. The ```scale_fill_manual()``` function is optional and specifies the color of the bars manually.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Grouped plot of means"}
ggplot(mean_data2,aes(x = genre_dummy, y = streams, fill = explicit)) +
geom_bar(position=position_dodge(.9), colour="black", stat="identity") +
geom_errorbar(position=position_dodge(.9), width=.2, aes(ymin=streams-ci, ymax=streams+ci)) +
scale_fill_manual(values=c("#CCCCCC","#FFFFFF")) +
theme_bw() +
labs(x = "Genre", y = "Average number of streams", title = "Average number of streams by genre and lyric type")+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
The results of the analysis show that also in the HipHop & Rap genre, songs with non-explicit lyrics obtain more streams on average, contrary to our expectations.
#### Scatter plot
The most common way to show the relationship between two continuous variables is a scatterplot. As an example, let's investigate if there is an association between the number of streams a song receives and the speechiness of the song. The following code creates a scatterplot with some additional components. The ```geom_smooth()``` function creates a smoothed line from the data provided. In this particular example we tell the function to draw the best possible straight line (i.e., minimizing the distance between the line and the points) through the data (via the argument ```method = "lm"```). Note that the "shape = 1" argument passed to the ```geom_point()``` function produces hollow circles (instead of solid) and the "fill" and "alpha" arguments passed to the ```geom_smooth()``` function specify the color and the opacity of the confidence interval, respectively.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Scatter plot"}
ggplot(music_data, aes(speechiness, log_streams)) +
geom_point(shape =1) +
labs(x = "Genre", y = "Relative frequency") +
geom_smooth(method = "lm", fill = "blue", alpha = 0.1) +
labs(x = "Speechiness", y = "Number of streams (log-scale)", title = "Scatterplot of streams and speechiness") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
As you can see, there appears to be a positive relationship between advertising and sales.
##### Grouped scatter plot
It could be that customers from different store respond differently to advertising. We can visually capture such differences with a grouped scatter plot. By adding the argument ```colour = store``` to the aesthetic specification, ggplot automatically treats the two stores as distinct groups and plots accordingly.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Grouped scatter plot"}
ggplot(music_data, aes(speechiness, log_streams, colour = explicit)) +
geom_point(shape =1) +
geom_smooth(method="lm", alpha = 0.1) +
labs(x = "Speechiness", y = "Number of streams (log-scale)", title = "Scatterplot of streams and speechiness by lyric type", colour="Explicit") +
scale_color_manual(values=c("lightblue","darkblue")) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
It appears from the plot that the association between speechiness and the number of streams is stronger for songs without explicit lyrics.
#### Line plot
Another important type of plot is the line plot used if, for example, you have a variable that changes over time and you want to plot how it develops over time. To demonstrate this we will investigate a data set that show the development of the number of streams of the Top200 songs on a popular music streaming service for different region. Let's investigate the data first and bring all variables to the correct format.
```{r message=FALSE, warning=FALSE}
music_data_ctry <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/streaming_charts_ctry.csv",
sep = ",",
header = TRUE)
music_data_ctry$week <- as.Date(music_data_ctry$week)
music_data_ctry$region <- as.factor(music_data_ctry$region)
head(music_data_ctry)
```
In a first step, let's investigate the development for Austria, by subsetting the data to region 'at'.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE}
music_data_at <- subset(music_data_ctry, region == 'at')
```
Given the correct ```aes()``` and geom specification ggplot constructs the correct plot for us.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Line plot"}
ggplot(music_data_at, aes(x = week, y = streams)) +
geom_line() +
labs(x = "Week", y = "Total streams in Austria", title = "Weekly number of streams in Austria") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
There appears to be a positive trend in the market. Now let's compare Austria to other countries. Note that the ```%in%``` operator checks for us if any of the region names specified in the vector are included in the region column.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE}
music_data_at_compare <- subset(music_data_ctry, region %in% c('at','de','ch','se','dk','nl'))
```
We can now include the other specified countries in the plot by using the 'color' argument.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Line plot (by region)"}
ggplot(music_data_at_compare, aes(x = week, y = streams, color = region)) +
geom_line() +
labs(x = "Week", y = "Total streams", title = "Weekly number of streams by country") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
One issue in plot like this can be that the scales between countries is very different (i.e., in Germany there are many more streams because Germany is larger than the other countries). You could also use the ```facet_wrap()``` function to create one individual plot by region and specify 'scales = "free_y"' so that each individual plot has its own scale on the y-axis. We should also indicate the number of streams in millions by dividing the number of streams.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Line plot (facet wrap)"}
ggplot(music_data_at_compare, aes(x = week, y = streams/1000000)) +
geom_line() +
facet_wrap(~region, scales = "free_y") +
labs(x = "Week", y = "Total streams (in million)", title = "Weekly number of streams by country") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
Now it's easier to see that the trends are different between countries. While Sweden and Denmark appear to be saturated, the other market show strong growth.
#### Area plots
A slightly different why to plot this data is through area plot using the ```geom_area()``` function.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Line plot (facet wrap)"}
ggplot(music_data_at_compare, aes(x = week, y = streams/1000000)) +
geom_area(fill = "steelblue", color = "steelblue", alpha = 0.5) +
facet_wrap(~region, scales = "free_y") +
labs(x = "Week", y = "Total streams (in million)", title = "Weekly number of streams by country") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
If the relative share of the overall streaming volume is of interest, you could use a stacked area plot to visualize this.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Line plot (facet wrap)"}
ggplot(music_data_at_compare, aes(x = week, y = streams/1000000,group=region,fill=region,color=region)) +
geom_area(position="stack",alpha = 0.65) +
labs(x = "Week", y = "Total streams (in million)", title = "Weekly number of streams by country") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
In this type of plot it is easy to spot the relative size of the regions.
In some cases it could also make sense to add a secondary y-axis, for example, if you would like to compare two regions with very different scales in one plot. Let's assume, we would like to compare Austria and Sweden and take the corresponding subset.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE}
music_data_at_se_compare <- subset(music_data_ctry, region %in% c('at','se'))
```
In order to add the secondary y-axis, we need the data in a slightly different format where we have one column for each country. This is called the 'wide format' as opposed to the 'long format' where the data is stacked on top of each other for all regions. We can easily convert the data to the wide format by using the ```spread()``` function from the `tidyr` package.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE}
library(tidyr)
data_wide <- spread(music_data_at_se_compare, region, streams)
data_wide
```
As another intermediate step, we need to compute the ratio between the two variables we would like to plot on the two axis, since the scale of the second axis is determined based on the ratio with the other variable.
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE}
ratio <- mean(data_wide$at/1000000)/mean(data_wide$se/1000000)
```
Now we can create the plot with the secondary y-axis as follows:
```{r message=FALSE, warning=FALSE,eval=TRUE,fig.align="center", fig.cap = "Secondary y-axis"}
ggplot(data_wide) +
geom_area(aes(x = week, y = at/1000000, colour = "Austria", fill = "Austria"), alpha = 0.5) +
geom_area(aes(x = week, y = (se/1000000)*ratio, colour = "Sweden", fill = "Sweden"), alpha = 0.5) +
scale_y_continuous(sec.axis = sec_axis(~./ratio, name = "Total streams SE (in million)")) +
scale_fill_manual(values = c("#999999", "#E69F00")) +
scale_colour_manual(values = c("#999999", "#E69F00"),guide=FALSE) +
theme_minimal() +
labs(x = "Week", y = "Total streams AT (in million)", title = "Weekly number of streams by country") +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"),
legend.title = element_blank(),
legend.position = "bottom"
)
```
In this plot it is easy to see the difference in trends between the countries.
### Saving plots
To save the last displayed plot, simply use the function ```ggsave()```, and it will save the plot to your working directory. Use the arguments ```height```and ```width``` to specify the size of the file. You may also choose the file format by adjusting the ending of the file name. E.g., ```file_name.jpg``` will create a file in JPG-format, whereas ```file_name.png``` saves the file in PNG-format, etc..
```{r eval=FALSE}
ggsave("test_plot.jpg", height = 5, width = 8.5)
```
### ggplot extensions
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/X8zGovLeCrk" frameborder="0" allowfullscreen></iframe>
</div>
<br>
As the ggplot2 package became more and more popular over the past years, more and more extensions have been developed by users that can be used for specific purposes that are not yet covered by the standard functionality of ggplot2. You can find a list of the extensions <a href="https://exts.ggplot2.tidyverse.org/gallery/" target="_blank">here</a>. Below, you can find some example for the additional options.
#### Results of statistical tests (ggstatsplot)
You may use the <a href="https://indrajeetpatil.github.io/ggstatsplot/index.html" target="_blank">ggstatplot</a> package to augment your plots with the results from statistical tests, such as an ANOVA. You can find a presentation about the capabilities of this package <a href="https://indrajeetpatil.github.io/ggstatsplot_slides/slides/ggstatsplot_presentation.html#1" target="_blank">here</a>. The boxplot below shows an example.
```{r message=FALSE, warning=FALSE, echo=TRUE, fig.height=6, eval=TRUE, fig.align="center", fig.cap = "Boxplot using ggstatsplot package"}
library(ggstatsplot)
music_data_subs <- subset(music_data, genre %in% c("HipHop & Rap", "Soundtrack","Pop","Rock"))
ggbetweenstats(
data = music_data_subs,
title = "Number of streams by genre", # title for the plot
plot.type = "box",
x = genre, # 2 groups
y = log_streams,
type = "p", # default
messages = FALSE,
bf.message = FALSE,
pairwise.comparisons = TRUE # display results from pairwise comparisons
)
```
##### Combination of plots (ggExtra)
Using the ```ggExtra()``` package, you may combine two type of plots. For example, the following plot combines a scatterplot with a histogram:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Scatter plot with histogram"}
library(ggExtra)
p <- ggplot(music_data, aes(x = speechiness, y = log_streams)) +
geom_point() +
labs(x = "Speechiness", y = "Number of streams (log-scale)", title = "Scatterplot & histograms of streams and speechiness") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
ggExtra::ggMarginal(p, type = "histogram")
```
In this case, the ```type = "histogram"``` argument specifies that we would like to plot a histogram. However, you could also opt for ```type = "boxplot"``` or ```type = "density"``` to use a boxplot or density plot instead.
#### Location data (ggmap)
Now that we have covered the most important plots, we can look at what other type of data you may come across. One type of data that is increasingly available is the geo-location of customers and users (e.g., from app usage data). The following data set contains the app usage data of Shazam users from Germany. The data contains the latitude and longitude information where a music track was "shazamed".
```{r message=FALSE, warning=FALSE,echo=TRUE, eval=TRUE}
library(ggmap)
library(dplyr)
geo_data <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/geo_data.dat",
sep = "\t",
header = TRUE)
head(geo_data)
```
There is a package called "ggmap", which is an augmentation for the ggplot packages. It lets you load maps from different web services (e.g., Google maps) and maps the user location within the coordination system of ggplot. With this information, you can create interesting plots like heat maps. We won't go into detail here but you may go through the following code on your own if you are interested. However, please note that you need to register an API with Google in order to make use of this package.
```{r ggmaps, echo=TRUE, eval=TRUE,message=FALSE, warning=FALSE}
#register_google(key = "your_api_key")
# Download the base map
de_map_g_str <- get_map(location=c(10.018343,51.133481), zoom=6, scale=2) # results in below map (wohoo!)
# Draw the heat map
ggmap(de_map_g_str, extent = "device") +
geom_density2d(data = geo_data, aes(x = lon, y = lat), size = 0.3) +
stat_density2d(data = geo_data, aes(x = lon, y = lat, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 16, geom = "polygon") +
scale_fill_gradient(low = "green", high = "red") +
scale_alpha(range = c(0, 0.3), guide = FALSE)
```
## Learning check {-}
**(LC4.1) For which data types is it meaningful to compute the mean?**
- [ ] Nominal
- [ ] Ordinal
- [ ] Interval
- [ ] Ratio
**(LC4.2) How can you compute the standardized variate of a variable X?**
- [ ] $Z=\frac{X_i-\bar{X}}{s}$
- [ ] $Z=\frac{\bar{X}+X_i}{s}$
- [ ] $Z=\frac{s}{\bar{X}+X_i}$
- [ ] $Z=s*({\bar{X}+X_i)}$
- [ ] None of the above
**You wish to analyze the following data frame 'df' containing information about cars**
```{r,echo=FALSE}
head(mtcars,6)
```
**(LC4.3) How could you add a new variable containing the z-scores of the variable 'mpg' in R?**
- [ ] `df$mpg_std <- zscore(df$mpg)`
- [ ] `df$mpg_std <- stdv(df$mpg)`
- [ ] `df$mpg_std <- std.scale(df$mpg)`
- [ ] `df$mpg_std <- scale(df$mpg)`
- [ ] None of the above
**(LC4.4) How could you produce the below output?**
```{r,echo=FALSE,message=FALSE, warning=FALSE, error=FALSE}
library(psych)
as.data.frame(psych::describe(mtcars[,c("hp","mpg","qsec")]))
```
- [ ] `describe(mtcars[,c("hp","mpg","qsec")])`
- [ ] `summary(mtcars[,c("hp","mpg","qsec")])`
- [ ] `table(mtcars[,c("hp","mpg","qsec")])`
- [ ] `str(mtcars[,c("hp","mpg","qsec")])`
- [ ] None of the above
**(LC4.5) The last column "carb" indicates the number of carburetors each model has. By using a function we got to know the number of car models that have a certain number carburetors. Which function helped us to obtain this information?**
```{r,echo=FALSE}
table(mtcars$carb)
```
- [ ] `describe(mtcars$carb)`
- [ ] `table(mtcars$carb)`
- [ ] `str(mtcars$carb)`
- [ ] `prop.table(mtcars$carb)`
- [ ] None of the above
**(LC4.6) What type of data can be meaningfully depicted in a scatter plot?**
- [ ] Two categorical variables
- [ ] One categorical and one continuous variable
- [ ] Two continuous variables
- [ ] One continuous variable
- [ ] None of the above
**(LC4.7) Which statement about the graph below is true?**
```{r,echo=FALSE}
hist(mtcars$mpg,xlab="miles per gallon", main="miles per gallon")
```
- [ ] This is a bar chart
- [ ] This is a histogram
- [ ] It shows the frequency distribution of a continuous variable
- [ ] It shows the frequency distribution of a categorical variable
- [ ] None of the above
**(LC4.8) Which statement about the graph below is true?**
```{r, echo=FALSE,strip.white=TRUE, out.width="50%"}
boxplot(mtcars$mpg, outline = T, notch = F)
```
- [ ] This is a bar chart
- [ ] 50% of observations are contained in the gray area
- [ ] The horizontal black line indicates the mean
- [ ] This is a boxplot
- [ ] None of the above
**(LC4.9) Which function can help you to save a graph made with `ggplot()`?**
- [ ] `ggsave()`
- [ ] `write.plot()`
- [ ] `save.plot()`
- [ ] `export.plot()`
**(LC4.10) For a variable that follows a normal distribution, within how many standard deviations of the mean are 95% of values?**
- [ ] 1.645
- [ ] 1.960
- [ ] 2.580
- [ ] 3.210
- [ ] None of the above
## References {-}
* Field, A., Miles J., & Field, Z. (2012). Discovering Statistics Using R. Sage Publications.
* Chang, W. (2020). R Graphics Cookbook, 2nd edition (https://r-graphics.org/)
* Grolemund, G. & Wickham, H. (2020). R for Data Science (https://r4ds.had.co.nz/)