Skip to content

aleszu/text-mining-course

Repository files navigation

Text Mining and Sentiment Analysis in R

Instructor: Aleszu Bajak Link: bit.ly/textcourse

Introduction

This O'Reilly course will introduce participants to the techniques and applications of text mining and sentiment analysis by training them in easy-to-use open-source tools and scalable, replicable methodologies that will make them stronger data scientists and more thoughtful communicators.

Using RStudio and several engaging and topical datasets sourced from politics, social science, and social media, this course will introduce techniques for collecting, wrangling, mining and analyzing text data.

The course will also have participants derive and communicate insights based on their textual analysis using a set of data visualization methods. The techniques that will be used include n-gram analysis, sentiment analysis, and parts-of-speech analysis.

By the end of this live, hands-on, online course, you’ll understand:

  • The techniques and applications of textual analysis
  • How to convert unstructured text from politics, social science, and social media into data
  • Techniques like n-gram analysis, sentiment analysis and parts-of-speech analysis

And you’ll be able to:

  • Ingest various text formats into RStudio
  • Wrangle and analyze that text
  • Visualize and communicate insights about those textual data

Requirements and accessing data

Ideally, participants will have the latest versions of R and RStudio and the tidytext and tidyverse packages. To access all R scripts, participants should next download this Github repository and set it as their working directory in RStudio.

This course can also be accessed on RStudio Cloud here, though a free account is required.

Download repo and set up RStudio

Navigate to the Github repo for this course, click the green "Clone or download" button and then click "Download ZIP." Unzip the file once you've downloaded it. Remember where you've placed the text-mining-course-master folder.

img

Next, in RStudio, go to RStudio -> Preferences... and click "Browse..." under Default working directory. Browse to the unzipped text-mining-course-master folder and click "Open." Then hit "Apply."

img

Note: The working directory should be set. You may need to restart RStudio. Now, we're ready to begin!

Table of contents

  1. Course outline
  2. Datasets used
  3. Text analysis in the wild
  4. Text analysis methods
  5. Sentiment analysis methods
  6. Visualization and communication
  7. Final activity

Course outline

  • We'll first explore several real-world applications of data mining and sentiment analysis.
  • Next, we'll walk through several techniques using R.
  • Finally, we'll explore techniques to visualize and communicate insights about those textual data.

Datasets used

Text as data

Text mining is all about making sense of text. That could mean counting the frequency of specific words, understanding the overall sentiment of a document, or applying statistical techniques to draw big-picture conclusions from a corpus. Whether one is analyzing social media posts, customer reviews or news articles, these techniques can be essential to understanding and deriving meaningful insights.

Note: Though there are several ways to mine data and perform sentiment analysis in R -- with packages such as tm, quanteda, udpipe, and sentimentr -- this course uses R's tidytext package, developed by Julia Silge and David Robinson, and several tidy tools found in the tidyverse package.

Text analysis in the wild

BuzzFeed's analysis of U.S. State of the Union speeches over time is a great example of text analysis. As an added bonus, journalist Peter Aldhous shared all his data and open-sourced his methodology as an Rmarkdown document. Related: New York Times science graphics editor Jonathan Corum also has a cool State of the Union visualization tool on his website.

img img img

The New York Times' Mueller Report citations article is another example of a text analysis in mainstream media, used to explain which of and how often Trump's associates appeared in the report. Check out my Storybench tutorial that includes R code for mining the Mueller Report for specific keywords.

img

FiveThirtyEight published an analysis tallying the instances of the name "Trump" in 2020 candidate messaging. The dataset was 2020 candidate emails sent to subscribers.

img

The Boston Globe's Arresting Words investigation visualized transcripts of police arrests to isolate tops words uttered by those being hauled in.

img

For USA TODAY, I looked at how the antifa conspiracy theory traveled from the fringe to the floor of Congress using text anlysis and visualizations of social media data that researchers had cataloged from Parler, 4chan, Gab and Twitter.

img

Text analysis in marketing

Crimson Hexagon, recently acquired by Brandwatch, delivers "actionable social insights for the enterprise," i.e. how is Under Armour clothing or 5-hour Energy Drink being discussed online?

img img

Sentiment analysis in the wild

FiveThirtyEight used sentiment analysis to help contrast presidential inauguration speeches. Does the "More positive words" annotation and x-axis ticks make this graphic easier to understand?

img

In Roll Call, I published a sentiment analysis of tweets by politicians in the run-up to the 2018 Midterms. We'll get into this data and analysis later in this course.

img

img

For USA TODAY, I looked at how calls for civil war intensified on Parler as Trump urged his followers to march on the Capitol. I used text and sentiment analysis to explore 80,000 posts from Jan. 6.

img

Sentiment analysis in finance

Bloomberg routinely analyzes Twitter sentiment surrounding keywords, companies and entities, such as this 2017 Vodaphone analysis, to better inform the trading strategies of its clients.

img

J.P. Morgan has published about sentiment analysis it has applied to analyst reports and news articles to assess the relationship between stock trades and news sentiment.

img

Text mining for fun

The community of tidytext users is large and very open to sharing code. Here are some examples of informal text mining and sentiment analyses that have been popoular on the Internet: craft beer reviews and Harry Potter books.

img

img

More ambitious NLP

Topic modeling

img

Cluster analysis

img

img

Principal component analysis

img

Discussion

Question 1: What real-world text analysis projects stuck out to you as memorable? Why? What was harder to get your head around and why?

Question 2: Choose one of the projects presented and write out some potential caveats, assumptions and/or problems faced with data collection, analysis or communication. Share via the group chat.

Text analysis methods

This section will introduce methods for tokenization, n-gram analysis and parts-of-speech analysis. We will then conduct a brief text analysis activity to isolate top words, top phrases and top parts of speech for a dataset.

Tokenization

First let's do some basic text ingestion and analysis using tidytext functions like unnest_tokens() for tokenizing and count() for, well, counting.

# Load packages
#install.packages("tidyverse")
#install.packages("tidytext")
library(tidyverse)
library(tidytext)

# Text and tokenization
line <- c("The quick brown fox jumps over the lazy dog.") 
line 

line_tbl <- as_tibble(line) # get into Tidy format
line_tbl

line_tokenized <- line_tbl %>%
  unnest_tokens(word, value)  # tokenize!
line_tokenized

line_tokenized %>% count(word) # count
line_tokenized %>% count(word, sort=TRUE) # count and sort

Let's also remove the stop words using the anti_join() function, which is well explained here. As the slide shows, anti_join() removes everything that matches a specified column in a supplied table. We'll use left_join() later in this course. SQL and Python users will know this as a merge function.

img

We can always inspect this with glimpse().

line_clean <- line_tokenized %>% 
  anti_join(stop_words) # cut out stopwords 
line_clean 

glimpse(stop_words) # let's inspect stopwords. 

Question: Are all of these stop words really worth cutting out? Can you find one that you want to include in your analysis?

n-gram analysis

Google's n-gram viewer is probably the most well-known example of n-gram analysis.

img

We'll use President Donald Trump's 2018 State of the Union speech to explore 1-grams, 2-grams and 3-grams.

First, we'll ingest the text file, tidy the data and then tokenize the text.

trump_speech <- read_file("trump2018.txt") # alternatively, use read_file(file.choose())
trump_speech

trump_speech_tbl <- as_tibble(trump_speech) # tidy the data
trump_speech_tbl

trump_counts <- trump_speech_tbl %>%
  unnest_tokens(word, value) # tokenize
trump_counts # 5,864 words

We'll next use the count() function we previously introduced. Wow, that's a lot of and's, the's and to's. Let's remove the stop words and count the words. You can change the "n=" argument in head() to customize the output. Note: I like to save my progress by writing CSVs of my tables. You can do this with "write.csv(trump_counts, "trump_counts.csv")"

trump_counts <- trump_speech_tbl %>% 
  unnest_tokens(word, value) %>%
  count(word, sort = TRUE) %>% # count words
  glimpse()

trump_counts <- trump_speech_tbl %>%
  unnest_tokens(word, value) %>%
  anti_join(stop_words) %>% # remove stopwords
  count(word, sort=TRUE) %>%
  glimpse()

head(trump_counts)
head(trump_counts, n=15)

Ok, moving on to bigrams. Let's use unnest_token()'s "token" and "n" arguments. Then we'll count the bigrams and view them with head().

# Bigrams
bigrams <- trump_speech_tbl %>%
  unnest_tokens(bigram, value, token = "ngrams", n = 2) %>%
  count(bigram, sort=TRUE)
head(bigrams, n=20) # too many stopwords!

This works but it's not terribly insightful because of all the stop words. Let's remove stop words according to the way Julia Silge and David Robinson suggest in Text Mining with R.

# Better bigrams
trump_bigrams <- trump_speech_tbl %>%
  unnest_tokens(bigram, value, token = "ngrams", n = 2) 

bigrams_separated <- trump_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") # separate bigram by space

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% # filter out stopwords from word1 column
  filter(!word2 %in% stop_words$word) # filter out stopwords from word2 column

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE) # count new bigrams

bigram_counts 

Now we see some interesting bigrams like "MS 13," "North Korea" and "immigration system."

Question: How would you export this table as a CSV? Can you write the function in R?

Finally, let's change the "n" argument to "3" and count the trigrams.

# Trigrams
trump_trigrams <- trump_speech_tbl %>%
  unnest_tokens(trigram, value, token = "ngrams", n = 3) 

trigrams_separated <- trump_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") # separate bigram by space

trigrams_filtered <- trigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% # filter out stopwords from word1 column
  filter(!word2 %in% stop_words$word) %>% # filter out stopwords from word2 column
  filter(!word3 %in% stop_words$word) # filter out stopwords from word3 column

trigram_counts <- trigrams_filtered %>% 
  count(word1, word2, word3, sort = TRUE) # count new bigrams

trigram_counts 

Question 1: How would you summarize these results for a non-technical audience? Could you design a top 10 table with your exported CSV and embed it alongside your code?

Question 2: What would your headline be for this 2018 State of the Union speech, based on these n-grams, bigrams and/or trigrams?

Doing string calculations

The stringr package brings together loads of useful tools for string manipulation and calculation. Below, run through the code to see how str_length() can be used to calculate the length of strings. Note: str_length() also counts spaces and punctuation.

# Calculate length of strings 
line <- c("The quick brown fox jumps over the lazy dog.") 
line 

library(stringr)
str_length(line) 

Next, we'll pull in the State of the Union speeches CSV file and create a new object with a new column "length." In this new column, we'll store the length of each speech we calculate with str_length(). Using the ggplot2 package we can plot a scatterplot of date vs. length to visualize the length of the speeches over time. We'll return to more data visualization with ggplot2 soon.

sou <- read_csv("sou.csv")
glimpse(sou)

length_of_sous <- sou %>%
  mutate(length = str_length(text))
glimpse(length_of_sous)

# Plot it 
ggplot(length_of_sous, aes(date, length)) +
  geom_point()

img

We can also create something akin to the Google n-gram viewer by "searching" through the speeches using "str_count" and calculating the number of times a specific keyword or phrase appears. We can then plot the output as a line chart.

# Search for a string with "str_detect"
speeches_w_keyword <- sou %>%
  group_by(text, date, president, message) %>%
  mutate(count = str_count(text, "health care")) # try "people" or "crime" 
speeches_w_keyword

# Plot it
ggplot(speeches_w_keyword, aes(date,count)) +
  geom_line(stat="identity")

img

Question: Could you imagine an interactive app that allows users to search a dataset for specific keywords? What dataset would you use? What would the user interface look like? Shiny apps in R can be built to do this. See this prototype of mine and this documentation on Shiny apps from RStudio.

Parts-of-speech analysis

Adjectives can drive the tone of a sentence. Or a tweet. That's where parts-of-speech analysis comes in.

Let's look at Donald Trump's adjective use on Twitter to illustrate the mechanics of parts-of-speech analysis. We'll also try to visualize the results in a couple different ways. These tweets were collected with R's "retweet" package and some great tutorials can be found here.

First, we'll pull in the dataset of tweets collected for the month of July, 2019. Then we'll tokenize and remove stop words. The very next operation uses inner_join() to merge in the "parts_of_speech" dataset, which comes with tidytext and includes more than 208,000 word/pos combinations.

What remains are the words from Trump's tweets that can be tagged with a specific part of speech. If we isolate adjectives, we can glimpse() the results and plot them as a bar chart.

Trump <- read_csv("Trump_tweets.csv")
glimpse(Trump)

Trump_tokenized_pos <- Trump %>%
  unnest_tokens(word, text) %>% # tokenize the headlines
  anti_join(stop_words) %>%
  inner_join(parts_of_speech) # join parts of speech dictionary

parts_of_speech

glimpse(Trump_tokenized_pos)

Trump_adj <- Trump_tokenized_pos %>%
  group_by(word) %>% 
  filter(pos == "Adjective") %>%  # filter for adjectives
  count(word, sort = TRUE) %>% 
  glimpse()

head(Trump_adj, n=10)

ggplot(head(Trump_adj, n=10), aes(reorder(word, n), n)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  xlab("")+ 
  coord_flip()

Q&A

Question: Based on the n-gram analysis, string-based calculations and parts-of-speech analysis we've performed, what other questions could you come up with for the State of the Union speeches, a politician's tweets or another dataset might you want to compile for one of the methodologies practices above?

Sentiment analysis methods

Sentiment analysis is being widely applied to understand politics, finance or sports, as we saw from our examples analyzing the sentiment of social media from candidates running for Senate, news articles about a particular company or product, and Reddit comments from baseball fans.

While sentiment analysis can involve complex natural language processing models like word2vec, for the purposes of this course we'll explore its simplest form: scoring individual words based on a dictionary of word/score pairs. Let's look at some sentiment dictionaries.

# Sentiment dictionaries
afinn <- get_sentiments("afinn")
afinn 

bing <- get_sentiments("bing")
bing

nrc <- get_sentiments("nrc")
nrc

labMT<- read.csv("labMT.csv")
labMT

Let's score the sentiment of a very simple sentence using the afinn dictionary. We'll ingest the sentence, tidy it, tokenize it and then use inner_join() to merge in the afinn word/score dictionary and leave only those words that have been scored. Finally, we'll calculate the average score of the five words that were scored. (Notice that we started out with 10 words.)

alexander <- c("Alexander and the terrible, horrible, no good, very bad day")
alexander <- as.tibble(alexander) 

alexander_scored <- alexander %>%
  unnest_tokens(word, value) %>%
  inner_join(afinn, by="word") %>% 
  glimpse()

mean(alexander_scored$score)

Ok, let's go to a bigger dataset and do this at scale: the State of the Union speeches. Let's merge in the afinn dictionary and then create a pivot table that summarizes the average sentiment score by president using the mean() function embedded in the summarise() function. Finally, we'll plot the results as a bar chart.

sentiment_sou <- sou %>%
  unnest_tokens(word, text) %>%
  inner_join(afinn, by = "word") # we'll join in the AFINN dictionary
glimpse(sentiment_sou)

sentiment_by_president <- sentiment_sou %>%
  group_by(president) %>% 
  summarise(avgscore = mean(score)) %>%
  arrange(desc(avgscore)) %>%
  glimpse()

ggplot(sentiment_by_president, aes(reorder(president, avgscore), avgscore)) +
  geom_col() +
  coord_flip()

img

Question 1: Do you have any idea why a particular president is in a particular spot? Why might FDR, for example, be near the bottom?

Question 2: How would you label the x-axis? What's going to be most clear and who is your intended audience?

Let's now look at the sentiment of State of the Union speeches over time. Instead of organizing by president, we can group_by() message and date. Plotting that as a scatterplot and fitting a linear regression to the data, we see that the speeches appear to be relatively stable, in terms of sentiment, across time.

sentiment_sou_afinn <- sentiment_sou %>%
  group_by(message, date) %>% # We "group by" message and date instead of by president
  summarise(avgscore = mean(score)) %>%
  glimpse()

ggplot(sentiment_sou_afinn, aes(date, avgscore)) +
  geom_point() + 
  geom_smooth(method="lm")

img

Question: What labels or other insights might you want to communicate along with this chart?

Activity

You have the "labMT" dictionary. Swap it in for "afinn" in the inner_join() function and recreate the scatterplot with labMT-scored speeches.

Question: What difference do you notice when the speeches are scored with the labMT dictionary?

Bonus: Applying sentiment analysis to social media

Now, let's recreate the sentiment analysis my graduate student Floris Wu and I performed for Roll Call on Senate candidate tweets in the lead-up to the 2018 midterms.

img

First, let's pull in the tweets, which were gathered using the rtweet package. Full methodology is published at the bottom of the Roll Call article and on my Github account. For the purposes of this activity, I've added some extra columns including party affiliation and share of the vote won in the 2018 elections. This information is all in the alltweets.zip file.

Let's do some exploratory data visualization and create a histogram using ggplot2. We can add custom colors and change the theme with a couple extra lines:

candidate_tweets <- read_csv("alltweets.zip")
glimpse(candidate_tweets)

ggplot(candidate_tweets, aes(date, fill=party)) + 
  geom_histogram(stat = "count") +
  ylim(0, 500) +
  scale_fill_manual(values=c("#404f7c", "forestgreen", "#c63b3b")) +
  theme_minimal() 

img

Next, we tokenize and score the tweets with the labMT dictionary.

tokenized_tweets <- candidate_tweets %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  glimpse()

all_sentiment <- tokenized_tweets %>%  
  inner_join(labMT, by = "word") %>%
  group_by(status_id, name, party, followers_count, percent_of_vote) %>%  
  summarise(sentiment = mean(score)) %>% 
  arrange(desc(sentiment))  %>%
  glimpse()

Next, we create a pivot table that calculates the average score per candidate and plot that as a scatterplot with custom colors, a custom scale range for the points (whose size is mapped to the candidate's number of followers), and trend lines for each party.

Your graphic should look very similar to the final visual that appeared with the article. Notice we can customize the axis labels using ggplot2's xlab() and ylab().

final_pivot <- all_sentiment %>% 
  group_by(name, party, followers_count, percent_of_vote) %>% 
  summarise(avgscore = mean(sentiment)) %>% 
  glimpse()

ggplot(final_pivot, aes(y=percent_of_vote, x=avgscore, color=party)) + 
  geom_point(aes(size=followers_count)) + 
  scale_size(name="", range = c(1.5, 8)) +
  geom_smooth(method="lm", se = FALSE) + 
  labs(title = "As they go low... we go lower?",
       subtitle = "Democrats with more negative tweets won more often in 2018",
       caption = "Source: Twitter") +
  scale_color_manual(values=c("#404f7c", "#34a35c", "#34a35c", "#c63b3b")) +
  xlab("Average sentiment of tweets") +
  ylab("Percent of vote in 2018 midterms") +
  theme_minimal()

img

Question: What caveats would you include if you were publishing with this kind of social media sentiment analysis? What tweet-level data would you want to inspect and mention that speaks to the shortcomings of these methods?

Visualization and communication

Data analysis and extracted insights are nothing if they aren't communicated effectively. This section will introduce several visual formats to help improve your data-driven storytelling.

Exporting CSVs

Whenever I reach a stage in my data analysis where I want to save a snapshot of where I am, whether to share with a colleague or to keep to myself, I write my dataframe into a CSV. Setting the "row.names=FALSE" argument to TRUE, which is default, will add a unique number to every row.

sou <- read_csv("sou.csv")
write.csv(sou, "state-of-the-union-speeches.csv", row.names=FALSE)

Tables

Properly formatted and designed for comprehension, tables can go a long way in communicating data, whether that's a ranked list or just a random selection of variables. You can create publication-ready tables in RStudio with packages like DT and formattable. Here's a preview of what they look like.

#install.packages("DT")
#install.packages("formattable")

library(DT)
datatable(Trump_adj)

library(formattable)
formattable(Trump_adj,
            align =c("l","c"))
            
formattable(head(Trump_adj, n=15),
            align =c("l","c"))

formattable(Trump_adj, 
            align =c("l","r"),
            list(
  n = color_bar("lightblue")
                )
            )

img

Wordclouds

Although wordclouds have gotten their fair share of critique, they can still be powerful. I like the ggwordcloud package because it allows a lot of customization, see here and here for more documentation.

Let's take the State of the Union addresses and filter for Barack Obama's most frequently used words. Then we'll build two different wordclouds.

#install.packages("ggwordcloud")
library(ggwordcloud)

sou <- read_csv("sou.csv")
glimpse(sou)

sou_words_by_president <- sou %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, president, sort=TRUE)

glimpse(sou_words_by_president)

Obama_words <- sou_words_by_president %>%
  filter(president == "Barack Obama") %>%
  glimpse()

ggplot(head(Obama_words, n=25), aes(label = word, 
                                           size=n)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 14) +
  theme_minimal()

ggwordcloud(Obama_words$word, Obama_words$n, 
            min.freq = 10, 
            max.words=50)

img

Wordtrees

I love word trees, like this one from Jason Davies visualizing Bob Dylan lyrics.

img

Using the wordtreer package, which is a wrapper for Google's Word Tree chart, we can recreate this visualization of Blowin' in the Wind. Note: I used the datapasta package to paste these lyrics into RStudio with a simple keyboard shortcut that pastes data from the clipboard as "character vector, formatted vertically, one element per line." Exactly what I needed!

#devtools::install_github("DataStrategist/wordTreeR")
library(wordtreer)

Dylan <- c("How many roads must a man walk down",
  "Before you call him a man?",
  "How many seas must a white dove sail",
  "Before she sleeps in the sand?",
  "How many times must the cannon balls fly",
  "Before they're forever banned?",
  "The answer, my friend, is blowin' in the wind",
  "The answer is blowin' in the wind",
  "How many years can a mountain exist",
  "Before it's washed to the sea?",
  "How many years can some people exist",
  "Before they're allowed to be free?",
  "How many times can a man turn his head",
  "And pretend that he just doesn't see?",
  "The answer, my friend, is blowin' in the wind",
  "The answer is blowin' in the wind",
  "How many times must a man look up",
  "Before he can see the sky?",
  "How many ears must one man have",
  "Before he can hear people cry?",
  "How many deaths will it take till he knows",
  "That too many people have died?",
  "The answer, my friend, is blowin' in the wind",
  "The answer is blowin' in the wind")

wordtree(text=Dylan,
         targetWord = "How",
         direction="suffix",
         Number_words = 20,
         fileName="dylan.html")
browseURL("dylan.html")

img

To create a similar data frame of sentences containing a specified keyword, I've found the following code useful:

sou_sentences <- sou %>%
  filter(president == "Barack Obama") %>%
  unnest_tokens(sentences, text, token="sentences", to_lower = FALSE) %>%
  filter(str_detect(sentences, "terror")) %>%
  select(sentences)
glimpse(sou_sentences)
sou_all_sentences <- unlist(sou_sentences)

Scatterplots and more

In the preceding sections we've created scatterplots, line charts and bar charts. Those were all created with the ggplot2 package. Let's refresh our memory about the code needed to build a bar chart, a scatterplot where text replaces points, and a more traditional scatterplot.

Remember we can customize the axis labels, theme, titles, captions and subtitles in ggplot2. For tips on changing the color scheme, read through this documentation and know that a little Googling goes a long way for troubleshooting your visualizations in R.

Tip: hrbrthemes is a great package with several very nice built-in themes. Other packages exist to mimick FiveThirtyEight and The Economist. I'm a big fan of the BBC's R cookbook which walks users through creating BBC-style graphics. Having trouble with overlapping labels? Try the ggrepel package.

# A bar chart

sou <- read_csv("sou.csv")

sent_by_president <- sou %>%
  unnest_tokens(word, text) %>%
  inner_join(afinn, by = "word") %>%
  group_by(president) %>% 
  summarise(avgscore = mean(score)) %>%
  arrange(desc(avgscore))

ggplot(sent_by_president, aes(reorder(president, avgscore), avgscore)) +
  geom_col() +
  coord_flip()

# A scatterplot where text replaces points 

Trump <- read_csv("Trump_tweets.csv")

Trump_adj_sent <- Trump %>%
  unnest_tokens(word, text) %>% # tokenize the headlines
  anti_join(stop_words) %>%
  inner_join(parts_of_speech) %>% # join parts of speech dictionary
  group_by(word) %>% 
  filter(pos == "Adjective") %>% 
  count(word, sort = TRUE) %>%
  inner_join(labMT, by="word") %>%  # add in sentiment dictionary
  glimpse()

ggplot(Trump_adj_sent, aes(n, score, color = score>5)) +
  geom_text(aes(label=word), check_overlap = TRUE) +
  theme_minimal() +
  scale_color_manual(values=c("#c63b3b", "#404f7c")) +
  theme(legend.position = "none")
   
# A scatterplot with regression lines and title, subtitle and caption

ggplot(final_pivot, aes(y=percent_of_vote, x=avgscore, color=party)) + 
  geom_point(aes(size=followers_count)) + 
  scale_size(name="", range = c(1.5, 8)) +
  geom_smooth(method="lm", se = FALSE) + 
  labs(title = "As they go low... we go lower?",
       subtitle = "Democrats with more negative tweets won more often in 2018",
       caption = "Source: Twitter") +
  scale_color_manual(values=c("#404f7c", "#34a35c", "#34a35c", "#c63b3b")) +
  xlab("Average sentiment of tweets") +
  ylab("Percent of vote in 2018 midterms") +
  theme_minimal()

Activity

For one final activity in text mining, let's explore a dataset of 130,000 wine reviews from WineEnthusiast.com, which I downloaded from Kaggle. Below, I've ingested the data and filtered for only French wine reviews. Next, I've kept some of the more interesting variables (like points, price, province and variety) and tokenized the reviews and removed stop words.

You now have 444,000 words organized by price, points, variety and province in France. Your mission is to use one of the methods we've learned in this course to continue analyzing the text of these wine reviews. Take 20 minutes and return to the group chat with an insight from this dataset. If you wish to share one to three bullet points and a visualization through a Google doc, you can access that here.

Note: You can access the document either with your Google Account or anonymously (likely through an Incognito or Private type of browser session). A Google Account is not required. If you choose to authenticate with your Google Account, your username and/or other identifying information could be visible to others.

reviews <- read_csv("wine-reviews.zip")
glimpse(reviews)

french_reviews <- reviews %>%
  filter(country == "France") %>% # Filter for only wines from "France"
  glimpse()

french_reviews_tokenized <- french_reviews %>%
  select(description, points, price, province, variety) %>% # keep the columns we want
  unnest_tokens(word, description) %>% # tokenize the description column
  anti_join(stop_words)  # remove stopwords 

glimpse(french_reviews_tokenized) 

About

This O'Reilly course will introduce participants to the techniques and applications of text mining and sentiment analysis by training them in easy-to-use open-source tools and scalable, replicable methodologies that will make them stronger data scientists and more thoughtful communicators.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages