-
Notifications
You must be signed in to change notification settings - Fork 28
/
Copy path10_Boxplots_pdf.Rmd
387 lines (319 loc) · 18.7 KB
/
10_Boxplots_pdf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Creating plots in R using ggplot2 - part 10: boxplots"
author:
- Jodie Burchell
- Mauricio Vargas Sepúlveda
date: "`r Sys.Date()`"
output:
pdf_document:
keep_tex: yes
toc: yes
---
```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
knitr::opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 96, fig.width=6, fig.height=5, tidy = F, cache.path = '10_Boxplots_cache_pdf/', fig.path = '10_Boxplots_pdf/', dev = 'quartz_pdf', dev.args=list(pointsize=12))
```
This is the tenth tutorial in a series on using `ggplot2` I am creating with [Mauricio Vargas Sepúlveda](http://pachamaltese.github.io/). In this tutorial we will demonstrate some of the many options the `ggplot2` package has for creating and customising boxplots. We will use R's [airquality dataset](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html) in the `datasets` package.
In this tutorial, we will work towards creating the boxplot below. We will take you from a basic boxplot and explain all the customisations we add to the code step-by-step.
```{r box_final, echo = FALSE, cache = TRUE}
library(datasets)
library(ggplot2)
library(ggthemes)
library(grid)
library(RColorBrewer)
data(airquality)
airquality$Month <- factor(airquality$Month,
labels = c("May", "Jun", "Jul", "Aug", "Sep"))
airquality_trimmed2 <- airquality[which(airquality$Month == "Jul" |
airquality$Month == "Aug" |
airquality$Month == "Sep"), ]
airquality_trimmed2$Temp.f <- factor(ifelse(airquality_trimmed2$Temp > mean(airquality_trimmed2$Temp), 1, 0),
labels = c("Low temp ", "High temp "))
p10 <- ggplot(airquality_trimmed2, aes(x = Month, y = Ozone, fill = Temp.f)) +
geom_boxplot(alpha=0.7) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_bw() +
theme(plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
panel.border = element_rect(colour = "black", fill=NA, size=.5),
text = element_text(size = 12, family = "Tahoma"),
axis.title = element_text(face="bold"),
axis.text.x=element_text(size = 11),
legend.position = "bottom") +
scale_fill_brewer(palette = "Accent") +
labs(fill = "Temperature ")
p10
```
The first thing to do is load in the data and the libraries, as below. We'll convert `Month` into a labelled factor in order to use it as our grouping variable.
```{r load_in_data, cache = TRUE}
library(datasets)
library(ggplot2)
library(ggthemes)
library(grid)
library(RColorBrewer)
data(airquality)
airquality$Month <- factor(airquality$Month,
labels = c("May", "Jun", "Jul", "Aug", "Sep"))
```
# Basic boxplot
In order to initialise a plot we tell ggplot that `airquality` is our data, and specify that our x-axis plots the `Month` variable and our y-axis plots the `Ozone` variable. We then instruct ggplot to render this as a boxplot by adding the `geom_boxplot()` option.
```{r box_1, cache = TRUE}
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot()
p10
```
# Customising axis labels
In order to change the axis labels, we have a couple of options. In this case, we have used the `scale_x_discrete` and `scale_y_continuous` options, as these have further customisation options for the axes we will use below. In each, we add the desired name to the `name` argument as a string.
```{r box_2, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- p10 + scale_x_discrete(name = "Month") +
scale_y_continuous(name = "Mean ozone in parts per billion")
p10
```
ggplot also allows for the use of multiline names (in both axes and titles). Here, we've changed the y-axis label so that it goes over two lines using the `\n` character to break the line.
```{r box_3, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- p10 + scale_y_continuous(name = "Mean ozone in\nparts per billion")
p10
```
# Changing axis ticks
The next thing we will change is the axis ticks. Let's make the y-axis ticks appear at every 25 units rather than 50 using the `breaks = seq(0, 175, 25)` argument in `scale_y_continuous`. (The `seq` function is a base R function that indicates the start and endpoints and the units to increment by respectively. See `help(seq)` for more information.) We ensure that the y-axis begins and ends where we want by also adding the argument `limits = c(0, 175)` to `scale_y_continuous`.
```{r box_4, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- p10 + scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175))
p10
```
# Adding a title
To add a title, we include the option `ggtitle` and include the name of the graph as a string argument.
```{r box_5, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- p10 + ggtitle("Boxplot of mean ozone by month")
p10
```
# Changing the colour of the boxes
To change the line and fill colours of the box plot, we add a valid colour to the `colour` and `fill` arguments in `geom_boxplot()` (note that we assigned these colours to variables outside of the plot to make it easier to change them). A list of valid colours is [here](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).
```{r box_6, message = FALSE, warning = FALSE, cache = TRUE}
fill <- "gold1"; line <- "goldenrod2"
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month")
p10
```
If you want to go beyond the options in the list above, you can also specify exact HEX colours by including them as a string preceded by a hash, e.g., "#FFFFFF". Below, we have called two shades of blue for the fill and lines using their HEX codes.
```{r box_7, message = FALSE, warning = FALSE, cache = TRUE}
fill <- "#4271AE"; line <- "#1F3552"
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month")
p10
```
You can also specify the degree of transparency in the box fill area using the argument `alpha` in `geom_boxplot`. This ranges from 0 to 1.
```{r box_8, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(fill = fill, colour = line,
alpha = 0.7) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month")
p10
```
Finally, you can change the appearance of the outliers as well, using the arguments `outlier.colour` and `outlier.shape` in `geom_boxplot` to change the colour and shape respectively. An explanation of the allowed arguments for shape are described in [this article](http://sape.inf.usi.ch/quick-reference/ggplot2/shape), although be aware that because there is no "fill" argument for outlier, you cannot create circles with separate outline and fill colours. Here we will make the outliers small solid circles (using `outlier.shape = 20`) and make them the same colour as the box lines (using `outlier.colour = "#1F3552"`).
```{r box_9, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(fill = fill, colour = line, alpha = 0.7,
outlier.colour = "#1F3552", outlier.shape = 20) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month")
p10
```
# Using the white theme
As explained in the previous posts, we can also change the overall look of the plot using themes. We'll start using a simple theme customisation by adding `theme_bw() `. As you can see, we can further tweak the graph using the `theme` option, which we've used so far to change the legend.
```{r box_10, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- p10 + theme_bw()
p10
```
# Creating an XKCD style chart
Of course, you may want to create your own themes as well. `ggplot2` allows for a very high degree of customisation, including allowing you to use imported fonts. Below is an example of a theme Mauricio was able to create which mimics the visual style of [XKCD](http://xkcd.com/). In order to create this chart, you first need to import the XKCD font, and load it into R using the `extrafont` package.
```{r box_11, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(colour = "black", fill = "#56B4E9") +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme(axis.line.x = element_line(size=.5, colour = "black"),
axis.line.y = element_line(size=.5, colour = "black"),
axis.text.x=element_text(colour="black", size = 10),
axis.text.y=element_text(colour="black", size = 10),
legend.position="bottom",
legend.direction="horizontal",
legend.box = "horizontal",
legend.key = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
plot.title=element_text(family="xkcd-Regular"),
text=element_text(family="xkcd-Regular"))
p10
```
# Using 'The Economist' theme
There are a wider range of pre-built themes available as part of the `ggthemes` package (more information on these [here](https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html)). Below we've applied `theme_economist()`, which approximates graphs in the Economist magazine.
```{r box_12, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_economist() + scale_fill_economist() +
theme(axis.line.x = element_line(size=.5, colour = "black"),
axis.title = element_text(size = 12),
legend.position="bottom",
legend.direction="horizontal",
legend.box = "horizontal",
legend.text = element_text(size = 10),
text = element_text(family = "OfficinaSanITC-Book"),
plot.title = element_text(family="OfficinaSanITC-Book"))
p10
```
# Using 'Five Thirty Eight' theme
Below we've applied `theme_fivethirtyeight()`, which approximates graphs in the nice [FiveThirtyEight](http://fivethirtyeight.com/) website. Again, it is also important that the font change is optional and it's only to obtain a more similar result compared to the original. For an exact result you need 'Atlas Grotesk' and 'Decima Mono Pro' which are commercial font and are available [here](https://commercialtype.com/catalog/atlas) and [here](https://www.myfonts.com/fonts/tipografiaramis/decima-mono-pro/).
```{r box_13, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_fivethirtyeight() + scale_fill_fivethirtyeight() +
theme(axis.title = element_text(family="Atlas Grotesk Regular"),
legend.position="bottom",
legend.direction="horizontal",
legend.box = "horizontal",
legend.title=element_text(family="Atlas Grotesk Regular", size = 10),
legend.text=element_text(family="Atlas Grotesk Regular", size = 10),
plot.title=element_text(family="Atlas Grotesk Medium"),
text=element_text(family="DecimaMonoPro"))
p10
```
# Creating your own theme
As before, you can modify your plots a lot as `ggplot2` allows many customisations. Here is a custom plot where we have modified the axes, background and font.
```{r box_14, message = FALSE, warning = FALSE, cache = TRUE}
fill <- "#4271AE"; lines <- "#1F3552"
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(colour = lines, fill = fill,
size = 1) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_bw() +
theme(panel.border = element_rect(colour = "black", fill=NA, size=.5),
panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
panel.border = element_blank(), panel.background = element_blank(),
plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
text=element_text(family = "Tahoma"),
axis.title = element_text(face="bold"),
axis.text.x = element_text(colour="black", size = 11),
axis.text.y = element_text(colour="black", size = 9))
p10
```
# Boxplot extras
An extra feature you can add to boxplots is to overlay all of the points for that group on each boxplot in order to get an idea of the sample size of the group. This can be achieved using by adding the `geom_jitter` option.
```{r box_15, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- p10 + geom_jitter()
p10
```
We can see that June has a pretty small sample, indicating that information based on this group may not be very reliable.
Another thing you can do with your boxplot is add a notch to the box where the median sits to give a clearer visual indication of how the data are distributed within the IQR. You achieve this by adding the argument `notch = TRUE` to the `geom_boxplot` option. You can see on our graph that the box for June looks a bit weird due to the very small gap between the 25th percentile and the median.
```{r box_16, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality, aes(x = Month, y = Ozone)) +
geom_boxplot(colour = lines, fill = fill,
size = 1, notch = TRUE) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_bw() +
theme(panel.border = element_rect(colour = "black", fill=NA, size=.5),
panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
panel.border = element_blank(), panel.background = element_blank(),
plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
text=element_text(family="Tahoma"),
axis.title = element_text(face="bold"),
axis.text.x=element_text(colour="black", size = 11),
axis.text.y=element_text(colour="black", size = 9))
p10
```
# Grouping by another variable
You can also easily group box plots by the levels of another variable. There are two options, in separate (panel) plots, or in the same plot.
We first need to do a little data wrangling. In order to make the graphs a bit clearer, we've kept only months "July", "Aug" and "Sep" in a new dataset `airquality_trimmed`. We've also mean-split `Temp` so that this is also categorical, and made it into a new labelled factor variable called `Temp.f`.
In order to produce a panel plot by Temperature , we add the `facet_grid(. ~ Temp.f)` option to the plot.
```{r box_17, message = FALSE, warning = FALSE, cache = TRUE}
airquality_trimmed <- airquality[which(airquality$Month == "Jul" |
airquality$Month == "Aug" | airquality$Month == "Sep"), ]
airquality_trimmed$Temp.f <- factor(ifelse(airquality_trimmed$Temp > mean(airquality_trimmed$Temp), 1, 0),
labels = c("Low temp ", "High temp "))
p10 <- ggplot(airquality_trimmed, aes(x = Month, y = Ozone)) +
geom_boxplot(fill = fill, colour = line,
alpha = 0.7) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 50), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_bw() +
theme(plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
panel.border = element_rect(colour = "black", fill=NA, size=.5),
text = element_text(size = 12, family = "Tahoma"),
axis.title = element_text(face="bold"),
axis.text.x=element_text(size = 11)) +
facet_grid(. ~ Temp.f)
p10
```
In order to plot the two Temperature levels in the same plot, we need to add a couple of things. Firstly, in the `ggplot` function, we add a `fill = Temp.f` argument to `aes`. Secondly, we customise the colours of the boxes by adding the `scale_fill_brewer` to the plot from the `RColorBrewer` package. [This](http://moderndata.plot.ly/create-colorful-graphs-in-r-with-rcolorbrewer-and-plotly/) blog post describes the available packages.
```{r box_18, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality_trimmed, aes(x = Month, y = Ozone, fill = Temp.f)) +
geom_boxplot(alpha=0.7) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_bw() +
theme(plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
panel.border = element_rect(colour = "black", fill=NA, size=.5),
text = element_text(size = 12, family = "Tahoma"),
axis.title = element_text(face="bold"),
axis.text.x=element_text(size = 11)) +
scale_fill_brewer(palette = "Accent")
p10
```
# Formatting the legend
Finally, we can format the legend. Firstly, we can change the position by adding the `legend.position = "bottom"` argument to the `theme` option, which moves the legend under the plot. Secondly, we can fix the title by adding the `labs(fill = "Temperature ")` option to the plot.
```{r box_19, message = FALSE, warning = FALSE, cache = TRUE}
p10 <- ggplot(airquality_trimmed, aes(x = Month, y = Ozone, fill = Temp.f)) +
geom_boxplot(alpha=0.7) +
scale_y_continuous(name = "Mean ozone in\nparts per billion",
breaks = seq(0, 175, 25), limits=c(0, 175)) +
scale_x_discrete(name = "Month") +
ggtitle("Boxplot of mean ozone by month") +
theme_bw() +
theme(plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
panel.border = element_rect(colour = "black", fill=NA, size=.5),
text = element_text(size = 12, family = "Tahoma"),
axis.title = element_text(face="bold"),
axis.text.x=element_text(size = 11),
legend.position = "bottom") +
scale_fill_brewer(palette = "Accent") +
labs(fill = "Temperature ")
p10
```