Quanteda_Workshop/ExampleCaseSpeeches.Rmd at main · LucaVellage/Quanteda_Workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
title: "Starting with Quanteda"
author: "Aranxa Marquez"
date: "2023-10-09"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
## Install quanteda package
# install.packages("quanteda")
```

```{r}

library(rvest)
library(stringr)
library(tidyverse)
library(quanteda)
library(quanteda.textplots)
library(readtext)
library(gt)

```

```{r initialise2, warning=FALSE}
library(quanteda.textstats)
```

## Try it out! Comparing political discourses

Now we're going to give you an example where you can apply this for on of the most used cases: analyzing political discourses.

**Step 1**. Scraping a discourse To practice you can use the <http://obamaspeeches.com> website to explore on his best\* speeches(according to the webpage creators). Let's try with his inaugural speech.

```{r}
#Save the link as object
obama_speeches <- "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
obama_inaugural <- read_html(obama_speeches)

```

With Selector Gadget we identify the structure in the html containing the text. Copy it direct from the bar, do not click on the xpath function.

```{r}
inaugural_speech_container <- obama_inaugural |>  html_nodes("br+ table font+ font")

# Extract the text from the container
inaugural_speech_text <- html_text(inaugural_speech_container)

# Print the speech text
cat(inaugural_speech_text, sep = "\n")

```

Here you can have a look at the different speeches you would like to work on later that are available on the website.

```{r}

titles_speeches <- obama_inaugural |>
  html_nodes("table p")

titles_speeches <- sapply(titles_speeches, html_text, USE.NAMES = FALSE)

cat(titles_speeches , sep = "\n")
```

Following on what you just learned, you can now explore with the 'corpus' and the 'token' functions how this speech is structured.

```{r}

inaugural_df <- rbind(inaugural_speech_text)
inaugural_speech_corpus <- corpus(inaugural_df )
inaugural_speech_tokens <- tokens(inaugural_speech_corpus)

summary(inaugural_speech_corpus)

```

Transform the speech into a document-feature matrix (dmf)

```{r}

inaugural_speech_dfm <- dfm(inaugural_speech_tokens)

print(inaugural_speech_dfm)

```

Explore which are the most frequend words in the discoure.

```{r}
topfeatures(inaugural_speech_dfm )
```

To keep on working, let's clean the speech a little.

```{r}

inaugural_speech_tokens <- tokens(inaugural_speech_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)

inaugural_speech_tokens <- tokens_remove(inaugural_speech_tokens, stopwords("en"))
inaugural_speech_tokens  <-  tokens_remove(inaugural_speech_tokens, c('the','and', 'that','to', 'can', 'must', 'of', 'every', 'words', 'let', 'end', 'whether', 'things'))

inaugural_speech_dfm <- dfm(inaugural_speech_tokens)

topfeatures(inaugural_speech_dfm)
```

Let's see the wordcloud from this speech

```{r}
textplot_wordcloud(inaugural_speech_dfm, max_words = 40)

```

Let's look at the lexical diversity of this

```{r}

typeof(textstat_lexdiv(inaugural_speech_dfm))

as.data.frame(textstat_lexdiv(inaugural_speech_dfm))|>
  arrange(desc(TTR))|>  gt()

```

**Step 2** We'll compare it with another speech, this time with "An Honest Government - A Hopeful Future" which Obama hold at the University of Nairobi under the topic: Our Past, Our Future & Vision for America (August 28, 2006).

```{r}
#Save the link as object
obama_speeches_2 <- "http://obamaspeeches.com/088-An-Honest-Government-A-Hopeful-Future-Obama-Speech.htm"
obama_honest_gov <- read_html(obama_speeches_2)

```

With Selector Gadget we identify the structure in the html containing the text. Copy it direct from the bar, do not click on the xpath function.

```{r}
honest_gov_speech_container <- obama_honest_gov |>  html_nodes("font p font")

# Extract the text from the container
honest_gov_speech_text <- html_text(honest_gov_speech_container)

# Print the speech text
cat(honest_gov_speech_text, sep = "\n")

```

Following on what you just learned, you can now explore with the 'corpus' and the 'token' functions how this speech is structured.

```{r}

honest_gov_df <- rbind(honest_gov_speech_text)
honest_gov_speech_corpus <- corpus(honest_gov_df)
honest_gov_speech_tokens <- tokens(honest_gov_speech_corpus)

summary(honest_gov_speech_corpus)

```

Transform the speech into a document-feature matrix (dmf)

```{r}

honest_gov_speech_dfm <- dfm(honest_gov_speech_tokens)

print(honest_gov_speech_dfm)

```

Explore which are the most frequent words in the discourse.
```{r}
topfeatures(honest_gov_speech_dfm )
```

To keep on working, let's clean the speech a little.

```{r}

honest_gov_speech_tokens <- tokens(honest_gov_speech_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)

honest_gov_speech_tokens <- tokens_remove(honest_gov_speech_tokens, stopwords("en"))
honest_gov_speech_tokens  <-  tokens_remove(honest_gov_speech_tokens, c('the','and', 'that','to', 'can', 'must', 'of', 'every', 'words', 'let', 'end', 'whether', 'kenya', 'kenyan', 'kenyans', 'also', 'sometimes'))

# We removed "Kenya" and "kenyan" to focus more on the topics around it, since we already know that's the main topic. We want to know what are other concepts associated to it.

honest_gov_speech_dfm <- dfm(honest_gov_speech_tokens)

topfeatures(honest_gov_speech_dfm)
```

Let's see the wordcloud from this speech

```{r}
textplot_wordcloud(honest_gov_speech_dfm, max_words = 40)
```