statistical_notes/ancova_example.Rmd at main · clayford/statistical_notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: "ANCOVA"
author: "Clay Ford"
date: "1/7/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

ANCOVA stands for Analysis of Covariance. It's simply linear regression with one numeric predictor and one categorical predictor. It can help us assess how the linear association between the numeric variable and our dependent variable changes between the groups.

## Load data

Read and prepare Prism data. I believe this data corresponds to Fig 2D.

```{r message=FALSE}
library(pzfx)
library(tidyverse)
H2SO4_d1 <- read_pzfx("21.09.21 Six Month H2SO4 Study.pzfx")

# reshape data to "long format"
d <- H2SO4_d1 %>%
  pivot_longer(cols = -ROWTITLE, names_to = "g", values_to = "hue") %>%
  rename(month = ROWTITLE) %>%
  mutate(g = factor(str_remove(g, "_[0-9]$")),
         month = as.numeric(str_remove(month, "Month ")))
head(d)
```

## Visualize data

Raw data.

```{r}
ggplot(d) +
  aes(x = month, y = hue, color = g) +
  geom_point(position = position_dodge(width = 0.4)) +
  scale_x_continuous(breaks = 1:6)
```

Raw data with smooth trend lines.

```{r message=FALSE}
ggplot(d) +
  aes(x = month, y = hue, color = g) +
  geom_point(position = position_dodge(width = 0.4)) +
  geom_smooth() +
  scale_x_continuous(breaks = 1:6)
```

Means over time +/- 1 SD.

```{r}
ggplot(d) +
  aes(x = month, y = hue, color = g, group = g) +
  stat_summary(fun.data = "mean_sdl",
               position = position_dodge(width = 0.4)) +
  scale_x_continuous(breaks = 1:6)

```

Often want to show +/- 1 _standard error_ (SE) (ie, $\frac{SD}{\sqrt n}$).

```{r}
ggplot(d) +
  aes(x = month, y = hue, color = g, group = g) +
  stat_summary(fun.data = "mean_se",
               position = position_dodge(width = 0.4)) +
  scale_x_continuous(breaks = 1:6)
```


## Run ANCOVA

We use the `lm()` function to fit a "linear model". Since month is numeric and g is categorical, this is an ANCOVA. The formula `hue ~ g * month` means model hue as a function of g, month, and their interaction.

```{r}
m <- lm(hue ~ g * month, data = d)
anova(m)
```

The interaction appears to be significant. This says the effect of month depends on the group and vice versa. How so? Let's follow up with some plots and comparisons.

## Plot model

The ggeffects package allows us to quickly visualize our ANCOVA model.

```{r}
library(ggeffects)
plot(ggpredict(m, terms = c("month", "g")))
```

This shows the fitted trajectory of the effect of month on hue for the different groups. We can see the significant interaction appears to be due to the different slopes for `(+) BPB H20` (green) and `(-) BPB H20` (red).

Are these fitted slopes a "good" fit for the data? We can add data to the plot to assess.

```{r}
plot(ggpredict(m, terms = c("month", "g")), add.data = TRUE)
```

The slope doesn't seem to fit month 3 very well. There's also _much more_ variability around the fitted line for `(-) BPB H20` (red). One of the assumptions of ANCOVA is that the variance around the fitted line is the same for each group.

## Are slopes different from 0?

The `emtrends` function in the emmeans package can help us quickly assess this.

In the **first section** ("emtrends") we get the slope for each group in the "month.trend" column as well as 95% confidence intervals for each slope. We see that `(-) BPB H2O` and `(+) BPB H2O` are the two groups with confidence intervals that _do not_ contain 0. Therefore we're confident these slopes are _different from 0_. (Are these estimated slopes practically significant?) The slopes for the acid groups are small and indistinguishable from 0.

In the **second section** ("contrasts") we get Tukey pairwise comparisons between the slopes. The slope for `(-) BPB H2O)` appears to be different from all other groups.

```{r}
library(emmeans)
emtrends(m, pairwise ~ g, var = "month")
```

We can plot each of these results if we like.

```{r}
em <- emtrends(m, pairwise ~ g, var = "month")
plot(em$emtrends)
```


```{r}
plot(em$contrasts)
```