Sunday, March 25, 2018

Sentiment Analysis in R The Tidy Way

library(tidytext)
library(ggplot2)
library(dplyr)
library(tidyr)

1. Tweets across the United States

Sentiment Analysis and tidy tools - Video

Sentiment lexicons

There are several different sentiment lexicons available for sentiment analysis. You will explore three in this course that are available in the tidytext package:
  • afinn from Finn Årup Nielsen,
  • bing from Bing Liu and collaborators, and
  • nrc from Saif Mohammad and Peter Turney.
You will see how these lexicons can be used as you work through this course. The decision about which lexicon to use often depends on what question you are trying to answer. In this exercise, you will use dplyr's count() function. If you pass count() a variable, it will count the number of rows that share each distinct value of that variable.
INSTRUCTIONS 100XP
  • Load the dplyr and tidytext packages.
  • Add an argument to get_sentiments() to see what the "bing" lexicon looks like.
  • Then call get_sentiments() for the "nrc" lexicon.
  • Add an argument to count() so you can see how many words the nrc lexicon has for each sentiment category.
In [ ]:
# Load dplyr and tidytext
library(dplyr)
library(tidytext)

# Choose the bing lexicon
#get_sentiments("bing")
get_sentiments("bing")

# Choose the nrc lexicon
get_sentiments("nrc") %>%
  count(sentiment) # Count words by sentiment
In [ ]:
head(get_sentiments("bing"))
In [ ]:
# Choose the nrc lexicon
head(get_sentiments("nrc") %>% count(sentiment)) # Count words by sentiment
Great! While the "bing" lexicon classifies words into 2 sentiments, positive or negative, there are 10 sentiments conveyed in the "nrc" lexicon.
Words in lexicons Sentiment lexicons include many words, but some words are unlikely to be found in a sentiment lexicon or dictionary. Which of the following words is most unlikely to be found in a sentiment lexicon?
ANSWER THE QUESTION
Possible Answers
  • pessimism
  • press 1
  • peace
  • press 2
  • merry
  • press 3
  • and (Incorrect)
  • press 4
  • repulsion
  • press 5
Correct! A word like “and” is neutral and unlikely to be included in a sentiment lexicon.

Sentiment analysis via inner join - Video

In [3]:
load("geocoded_tweets.rda")
head(geocoded_tweets)
statewordfreq
alabamaa16256685.699
alabamaa-5491.100
alabamaa-day3991.764
alabamaaa4739.479
alabamaaaliyah8251.955
alabamaaamu4305.704

Implement an inner join

In this exercise you will implement sentiment analysis using an inner join. The inner_join() function from dplyr will identify which words are in both the sentiment lexicon and the text dataset you are examining. To learn more about joining data frames using dplyr, check out Joining Data in R with dplyr.
The geocoded_tweets dataset is taken from Quartz and contains three columns:
state, a state in the United States word, a word used in tweets posted on Twitter freq, the average frequency of that word in that state (per billion words) If you look at this dataset in the console, you will notice that the word "a" has a high frequency compared to the other words near the top of the sorted data frame; this makes sense! You can use inner_join() to implement a sentiment analysis on this dataset because it is in a tidy format.
INSTRUCTIONS 100XP
  • In the console, take a look at the geocoded_tweets object.
  • Use get_sentiments() to access the "bing" lexicon and assign it to bing.
  • Use an inner_join() to implement sentiment analysis on the geocoded tweet data using the bing lexicon.
In [ ]:
# geocoded_tweets has been pre-defined
geocoded_tweets

# Access bing lexicon: bing
bing <- get_sentiments("bing")

# Use data frame with text data
geocoded_tweets %>%
  # With inner join, implement sentiment analysis using `bing`
  inner_join(bing)
Marvelous! By inner joining geocoded_tweets with the "bing" lexicon, you can see the average frequency and the sentiment associated with each word that exists in both data frames.

Using dplyr verbs to analyze sentiment analysis results - Video

What are the most common sadness words?

After you have implemented sentiment analysis using inner_join(), you can use dplyr functions such as group_by() and summarize() to understand your results. For example, what are the most common words related to sadness in this Twitter dataset?
INSTRUCTIONS
Take a look at the tweets_nrc object, the output of an inner join just like the one you did in the last exercise. Then manipulate it to find the most common words that are related to sadness.
  • Filter only the rows that have words associated with sadness.
  • Group by word to find the average across the United States.
  • Use the summarize() and arrange() verbs find the average frequency for each word, and then sort.
  • Be aware that this is real data from Twitter and there is some use of profanity; the sentiment lexicons include profane and curse words.
In [ ]:
# Access nrc lexicon: nrc
nrc <- get_sentiments("nrc")

# Use data frame with text data
tweets_nrc = geocoded_tweets %>%
  # With inner join, implement sentiment analysis using `nrc`
  inner_join(nrc)
In [ ]:
head(tweets_nrc)
In [ ]:
tweets_nrc %>%
  # Filter to only choose the words associated with sadness
  filter(sentiment=="sadness") %>%
  # Group by word
  group_by(word) %>%
  # Use the summarize verb to find the mean frequency
  summarize(freq = mean(freq)) %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))
No need to be sad anymore! Next you will look at words associated with joy.

What are the most common joy words?

You can use the same approach from the last exercise to find the most common words associated with joy in these tweets. Use the same pattern of dplyr verbs to find a new result.
INSTRUCTIONS 100XP
  • First, filter to find only words associated with "joy".
  • Next, group by word.
  • Summarize for each word to find the average frequency freq across the whole United States.
  • Arrange in descending order of frequency.
Now you can make a visualization using ggplot2 to see these results.
  • Load the ggplot2 package.
  • Put words on the x-axis and frequency on the y-axis.
  • Use geom_col() to make a bar chart. (If you are familiar with geom_bar(stat = "identity"), geom_col() does the same thing.)
In [ ]:
# tweets_nrc has been pre-defined
tweets_nrc

joy_words <- tweets_nrc %>%
  # Filter to choose only words associated with joy
  filter(sentiment=="joy") %>%
  # Group by each word
  group_by(word) %>%
  # Use the summarize verb to find the mean frequency
  summarize(freq = mean(freq)) %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))

# Load ggplot2
library(ggplot2)

joy_words %>%
  top_n(20) %>%
  mutate(word = reorder(word, freq)) %>%
  # Use aes() to put words on the x-axis and frequency on the y-axis
  ggplot(aes(x=word, y=freq)) +
  # Make a bar chart with geom_col()
  geom_col() +
  coord_flip()
Plotting the above plot
In [ ]:
joy_words %>%
  top_n(20) %>%
  mutate(word = reorder(word, freq)) %>%
  # Use aes() to put words on the x-axis and frequency on the y-axis
  ggplot(aes(x=word, y=freq)) +
  # Make a bar chart with geom_col()
  geom_col() +
  coord_flip()

Looking at differences by state - Video

Do people in different states use different words?

So far you have looked at the United States as a whole, but you can use this dataset to examine differences in word use by state. In this exercise, you will examine two states and compare their use of joy words. Do they use the same words associated with joy? Do they use these words at the same rate?
INSTRUCTIONS 100XP
  • Use the correct dplyr verb to find only the rows for the state of Utah.
  • Add another condition inside the parentheses to find only the rows for the words associated with joy.
  • Use the dplyr verb that arranges a data frame to sort in order of descending frequency.
  • Repeat these steps for the state of Louisiana.
In [ ]:
# tweets_nrc has been pre-defined
tweets_nrc

tweets_nrc %>%
  # Find only the words for the state of Utah and associated with joy
  filter(state == "utah",
      sentiment == "joy") %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))

tweets_nrc %>%
  # Find only the words for the state of Louisiana and associated with joy
  filter(state == "louisiana",
      sentiment == "joy") %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))
Interesting! Words like “baby” and “money” are popular in Louisiana but not in Utah.

Which states have the most positive Twitter users?

For the last exercise in this chapter, you will determine how the overall sentiment of Twitter sentiment varies from state to state. You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier. Check out what tweets_bing looks like in the console.
You can use group_by() and summarize() to find which states had the highest frequency of positive and negative words, then pipe to ggplot2 (after some tidyr manipulation) to make a clear, interesting visualization.
INSTRUCTIONS 100XP
  • Choose variables in the call to group_by() so that you can summarize() by first state and the sentiment.
  • After using spread() from tidyr and ungrouping, calculate the ratio of positive to negative words for each state.
  • To make a plot, set up aes() so that states will go on the x-axis and the ratio will go on the y-axis.
  • Add the correct geom_* layer to make points on the plot.
  • The call to coord_flip() flips the axes so you can read the names of the states more easily.
In [ ]:
# Access bing lexicon: bing
bing <- get_sentiments("bing")

# Use data frame with text data
tweets_bing = geocoded_tweets %>%
  # With inner join, implement sentiment analysis using `nrc`
  inner_join(bing)
In [ ]:
library(tidyr)
In [ ]:
?spread
In [ ]:
# tweets_bing has been pre-defined
tweets_bing

tweets_bing %>% 
  # Group by two columns: state and sentiment
  group_by(state, sentiment) %>%
  # Use summarize to calculate the mean frequency for these groups
  summarize(freq = mean(freq)) %>%
  spread(sentiment, freq) %>%
  ungroup() %>%
  # Calculate the ratio of positive to negative words
  mutate(ratio = positive / negative,
         state = reorder(state, ratio)) %>%
  # Use aes() to put state on the x-axis and ratio on the y-axis
  ggplot(aes(state, ratio)) +
  # Make a plot with points using geom_point()
  geom_point() +
  coord_flip()
In [ ]:
Plotting the above plot
In [20]:
tweets_bing %>% 
  # Group by two columns: state and sentiment
  group_by(state, sentiment) %>%
  # Use summarize to calculate the mean frequency for these groups
  summarize(freq = mean(freq)) %>%
  spread(sentiment, freq) %>%
  ungroup() %>%
  # Calculate the ratio of positive to negative words
  mutate(ratio = positive / negative,
         state = reorder(state, ratio)) %>%
  # Use aes() to put state on the x-axis and ratio on the y-axis
  ggplot(aes(state, ratio)) +
  # Make a plot with points using geom_point()
  geom_point() +
  coord_flip()
Wonderful! Combining your data with a sentiment lexicon, you can do all sorts of exploratory data analysis. Looks like Missouri tops the list for this one!

2. Shakespeare gets Sentimental

Your next real-world text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.

Tidying Shakespearean plays - Video

In [5]:
load("shakespeare.rda")
In [6]:
head(shakespeare)
titletypetext
The Tragedy of Romeo and JulietTragedyThe Complete Works of William Shakespeare
The Tragedy of Romeo and JulietTragedy
The Tragedy of Romeo and JulietTragedyThe Tragedy of Romeo and Juliet
The Tragedy of Romeo and JulietTragedy
The Tragedy of Romeo and JulietTragedyThe Library of the Future Complete Works of William Shakespeare
The Tragedy of Romeo and JulietTragedyLibrary of the Future is a TradeMark (TM) of World Library Inc.
In [23]:
tail(shakespeare)
titletypetext
Hamlet, Prince of DenmarkTragedy
Hamlet, Prince of DenmarkTragedy
Hamlet, Prince of DenmarkTragedy
Hamlet, Prince of DenmarkTragedy
Hamlet, Prince of DenmarkTragedyThe End of Project Gutenberg Etext of Hamlet by Shakespeare
Hamlet, Prince of DenmarkTragedyPG has multiple editions of William Shakespeare's Complete Works

To be, or not to be

Let's take a look at the dataset you will use in this chapter to learn more about tidying text and sentiment analysis. The shakespeare dataset contains three columns:
  • title, the title of a Shakespearean play,
  • type, the type of play, either tragedy or comedy, and
  • text, a line from that play.
This data frame contains the entire texts of six plays.
INSTRUCTIONS 100XP
  • In the console, take a look at the shakespeare object.
  • Pipe the data frame with the Shakespeare texts to the next line.
  • Use count() with two arguments to find out which titles are in this dataset, whether they are tragedies or comedies, and how many lines they have.
In [ ]:
# The data set shakespeare in available in the workspace
shakespeare

# Pipe the shakespeare data frame to the next line
shakespeare %>% 
  # Use count to find out how many titles/types there are
  count(title, type)
In [24]:
# The data set shakespeare in available in the workspace
head(shakespeare)

# Pipe the shakespeare data frame to the next line
shakespeare %>% 
  # Use count to find out how many titles/types there are
  count(title, type)
titletypetext
The Tragedy of Romeo and JulietTragedyThe Complete Works of William Shakespeare
The Tragedy of Romeo and JulietTragedy
The Tragedy of Romeo and JulietTragedyThe Tragedy of Romeo and Juliet
The Tragedy of Romeo and JulietTragedy
The Tragedy of Romeo and JulietTragedyThe Library of the Future Complete Works of William Shakespeare
The Tragedy of Romeo and JulietTragedyLibrary of the Future is a TradeMark (TM) of World Library Inc.
titletypen
A Midsummer Night's DreamComedy3459
Hamlet, Prince of DenmarkTragedy6776
Much Ado about NothingComedy3799
The Merchant of VeniceComedy4225
The Tragedy of MacbethTragedy3188
The Tragedy of Romeo and JulietTragedy4441
Amazing! Passing a dataset and a variable to count() returns the unique values in that variable and the corresponding count n, which in this case, is the number of lines in a play.

Unnesting from text to word

The shakespeare dataset is not yet compatible with tidy tools. You need to first break the text into individual tokens (the process of tokenization); a token is a meaningful unit of text for analysis, in many cases, just synonymous with a single word. You also need to transform the text to a tidy data structure with one token per row. You can use tidytext’s unnest_tokens() function to accomplish all of this at once.
INSTRUCTIONS 100XP
  • Load the tidytext package.
  • Group by title to annotate the data frame by line number.
  • Define a new column using mutate() called linenumber that keeps track of which line of the play text is from. (Check out row_number() to do this!)
  • Use unnest_tokens() to transform the non-tidy text data to a tidy text dataset.
  • Pipe the tidy Shakespeare data frame to the next line.
  • Use count() to find out how many times each word is used in Shakespeare's plays.
In [ ]:
# Load tidytext
library(tidytext)

tidy_shakespeare <- shakespeare %>%
  # Group by the titles of the plays
  group_by(title) %>%
  # Define a new column linenumber
  mutate(linenumber = row_number()) %>%
  # Transform the non-tidy text data to tidy text data
  unnest_tokens(word, text) %>%
  ungroup()

# Pipe the tidy Shakespeare data frame to the next line
tidy_shakespeare %>% 
  # Use count to find out how many times each word is used
  count(word, sort = TRUE)
In [11]:
head(tidy_shakespeare)
titletypelinenumberword
The Tragedy of Romeo and JulietTragedy1the
The Tragedy of Romeo and JulietTragedy1complete
The Tragedy of Romeo and JulietTragedy1works
The Tragedy of Romeo and JulietTragedy1of
The Tragedy of Romeo and JulietTragedy1william
The Tragedy of Romeo and JulietTragedy1shakespeare
In [26]:
head(tidy_shakespeare %>% 
  # Use count to find out how many times each word is used
  count(word, sort = TRUE))
wordn
the4651
and4170
i3296
to3047
of2645
a2511
Great! Notice how the most common words in the data frame are words like “the”, “and”, and “i” that have no sentiments associated with them. In the next exercise, you'll join the data with a lexicon to implement sentiment analysis.

Sentiment analysis of Shakespeare

You learned how to implement sentiment analysis with a join in the first chapter of this course. After transforming the text of these Shakespearean plays to a tidy text dataset in the last exercise, the resulting data frame tidy_shakespeare is ready for sentiment analysis with such an approach. Once you have performed the sentiment analysis, you can find out how many negative and positive words each play has with just one line of code.
INSTRUCTIONS 100XP
  • Use the correct kind of join to implement sentiment analysis.
  • Add the "bing" lexicon as the argument to the join function.
  • Find how many positive and negative words each play has by using two arguments in count().
In [27]:
shakespeare_sentiment <- tidy_shakespeare %>%
  # Implement sentiment analysis with the "bing" lexicon
  inner_join(get_sentiments("bing"))

shakespeare_sentiment %>%
  # Find how many positive/negative words each play has
  count(title,sentiment)
Joining, by = "word"
titlesentimentn
A Midsummer Night's Dreamnegative681
A Midsummer Night's Dreampositive773
Hamlet, Prince of Denmarknegative1323
Hamlet, Prince of Denmarkpositive1223
Much Ado about Nothingnegative767
Much Ado about Nothingpositive1127
The Merchant of Venicenegative740
The Merchant of Venicepositive962
The Tragedy of Macbethnegative914
The Tragedy of Macbethpositive749
The Tragedy of Romeo and Julietnegative1235
The Tragedy of Romeo and Julietpositive1090
In [28]:
#Just to make it more readable, using the spread function in tidyr
shakespeare_sentiment %>%
  # Find how many positive/negative words each play has
  count(title,sentiment) %>%
    spread(sentiment, n)
titlenegativepositive
A Midsummer Night's Dream681773
Hamlet, Prince of Denmark13231223
Much Ado about Nothing7671127
The Merchant of Venice740962
The Tragedy of Macbeth914749
The Tragedy of Romeo and Juliet12351090
Fabulous! Passing two variables to count() returns the count n for each unique combination of the two variables. In this case, you have 6 plays and 2 sentiments, so count() returns 6 x 2 = 12 rows.

Using count and mutate - Video

In [29]:
#write.csv(shakespeare, "shakespeare.csv", row.names=FALSE)

Tragedy or comedy?

The tidy dataset you created, tidy_shakespeare, is again available in your environment. Which plays have a higher percentage of negative words? Do the tragedies have more negative words than the comedies? Instructions 100xp
  • First, calculate how many negative and positive words each play used.
    • Implement sentiment analysis using the "bing" lexicon.
    • Use count() to find the number of words for each combination of title, type, and sentiment.
  • Now, find the percentage of negative words for each play.
    • Group by the titles of the plays.
    • Find the total number of words in each play using sum().
    • Calculate a percent for each play that is the number of words of each sentiment divided by the total words in that play.
    • Filter the results for only negative sentiment.
In [30]:
sentiment_counts <- tidy_shakespeare %>%
    # Implement sentiment analysis using the "bing" lexicon
    inner_join(get_sentiments("bing")) %>%
    # Count the number of words by title, type, and sentiment
    count(title,type, sentiment)

sentiment_counts %>%
    # Group by the titles of the plays
    group_by(title) %>%
    # Find the total number of words in each play
    mutate(total = sum(n),
    # Calculate the number of words divided by the total
           percent = n/total) %>%
    # Filter the results for only negative sentiment
    filter(sentiment=="negative") %>%
    arrange(percent)
Joining, by = "word"
titletypesentimentntotalpercent
Much Ado about NothingComedynegative76718940.4049630
The Merchant of VeniceComedynegative74017020.4347826
A Midsummer Night's DreamComedynegative68114540.4683631
Hamlet, Prince of DenmarkTragedynegative132325460.5196386
The Tragedy of Romeo and JulietTragedynegative123523250.5311828
The Tragedy of MacbethTragedynegative91416630.5496091

Most common positive and negative words

You found in the previous exercise that Shakespeare's tragedies use proportionally more negative words than the comedies. Now you can explore which specific words are driving these sentiment scores. Which are the most common positive and negative words in these plays?
There are three steps in the code in this exercise. The first step counts how many times each word is used, the second step takes the top 10 most-used positive and negative words, and the final step makes a plot to visualize this result. Instructions 100xp
  • Implement sentiment analysis using the "bing" lexicon.
  • Use count() to find word counts by sentiment.
  • Group by sentiment so you can take the top 10 words in each sentiment.
  • Notice what the line mutate(word = reorder(word, n)) does; it converts word from a character that would be plotted in alphabetical order to a factor that will be plotted in order of n.
    • Now you can make a visualization of top_words using ggplot2 to see these results.
  • Put word on the x-axis and n on the y-axis.
  • Use geom_col() to make a bar chart.
In [31]:
word_counts <- tidy_shakespeare %>%
      # Implement sentiment analysis using the "bing" lexicon
      inner_join(get_sentiments("bing")) %>%
      # Count by word and sentiment
      count(word,sentiment)

    top_words <- word_counts %>%
      # Group by sentiment
      group_by(sentiment) %>%
      # Take the top 10 for each sentiment
      top_n(10) %>%
      ungroup() %>%
      # Make word a factor in order of n
      mutate(word = reorder(word, n))

    # Use aes() to put words on the x-axis and n on the y-axis
    ggplot(top_words, aes(word, n, fill = sentiment)) +
      # Make a bar chart with geom_col()
      geom_col(show.legend = FALSE) +
      facet_wrap(~sentiment, scales = "free") +  
      coord_flip()
Joining, by = "word"
Selecting by n
You did it! Death is pretty negative and love is positive, but are there words in that list that had different connotations during Shakespeare's time? Do you see a word that the lexicon has misidentified?

Which word was misidentified?

In the last exercise, you found the top 10 words that contributed most to negative sentiment in these Shakespearean plays, but lexicons are not always foolproof tools for use with all kinds of text. Which of those top 10 negative words used by Shakespeare was actually misidentified as negative by the sentiment lexicon? Answer the question 50xp Possible Answers
- death
- press 1
- wilt (Correct)
- press 2
- poor
- press 3
- mad
- press 4
- die
- press 5
Correct! The word “wilt” was used differently in Shakespeare's time and was not negative; the lexicon has misidentified it. For example, from Romeo and Juliet, “For thou wilt lie upon the wings of night”. It is important to explore the details of how words were scored when performing sentiment analyses.

Sentiment contributions by individual words - Video

Word contributions by play

You have already explored how words contribute to sentiment scores for Shakespeare's plays as a whole. In this exercise, you will look at differences between titles. You will also practice using a different sentiment lexicon, the "afinn" lexicon in which words have a score from -5 to 5. Different lexicons take different approaches to quantifying the emotion/opinion content of words.
Which words contribute to the overall sentiment in which plays? In this exercise, you will look specifically at Macbeth. Instructions 100xp
  • Use count() to find how many times each word is used in each play.
  • Implement sentiment analysis with the "afinn" lexicon. (Notice that it is possible to perform sentiment analysis on count data, not only the original tidy data frame.)
  • Filter to only look at the sentiment scores for Macbeth; the title for Macbeth is "The Tragedy of Macbeth".
  • In a second argument to filter(), only examine words with negative sentiment.
In [ ]:
tidy_shakespeare %>%
  # Count by title and word
  count(title, word, sort = TRUE) %>%
  # Implement sentiment analysis using the "afinn" lexicon
  inner_join(get_sentiments("afinn")) %>%
  # Filter to only examine the scores for Macbeth that are negative
  filter(title == "The Tragedy of Macbeth", score < 0)
In [12]:
#by me - for display, adding head() at the end
tidy_shakespeare %>%
  # Count by title and word
  count(title, word, sort = TRUE) %>%
  # Implement sentiment analysis using the "afinn" lexicon
  inner_join(get_sentiments("afinn")) %>%
  # Filter to only examine the scores for Macbeth that are negative
  filter(title == "The Tragedy of Macbeth", score < 0) %>% head()
Joining, by = "word"
titlewordnscore
The Tragedy of Macbethno73-1
The Tragedy of Macbethfear35-2
The Tragedy of Macbethdeath20-2
The Tragedy of Macbethbloody16-3
The Tragedy of Macbethpoor16-2
The Tragedy of Macbethstrange16-1
Marvelous! Notice the use of words specific to Macbeth like “bloody”.

Calculating a contribution score

In the last exercise, you saw how words in Macbeth were used different number of times and also had different sentiment scores in the "afinn" lexicon, from -5 to 5. Since this lexicon provides these scores for each word, you can calculate a relative contribution for each word in each play. This contribution can be found by multiplying the score for each word by the times it is used in each play and divided by the total words in each play.
INSTRUCTIONS
  • Use count() to find how many times each word is used in each play.
  • Implement sentiment analysis with the "afinn" lexicon.
  • Group by the titles of the plays to get ready to calculate a total for each play in the next line.
  • Calculate a contribution for each word in each play; the contribution can be found by multiplying each word's score by the times it is used in the play and divided by the total words in the play.
In [13]:
sentiment_contributions <- tidy_shakespeare %>%
  # Count by title and word
  count(title, word, sort = TRUE) %>%
  # Implement sentiment analysis using the "afinn" lexicon
  inner_join(get_sentiments("afinn")) %>%
  # Group by title
  group_by(title) %>%
  # Calculate a contribution for each word in each title
  mutate(contribution = n*score/sum(n)) %>%
  ungroup()
    
sentiment_contributions
Joining, by = "word"
titlewordnscorecontribution
Hamlet, Prince of Denmarkno143-1-0.06520748
The Tragedy of Romeo and Julietlove14030.21319797
Much Ado about Nothingno132-1-0.07683353
Much Ado about Nothinghero11420.13271246
A Midsummer Night's Dreamlove11030.26982829
Hamlet, Prince of Denmarkgood10930.14911081
Nice work! Notice that “hero” shows up in your results there; that is the name of one of the characters in “Much Ado About Nothing”.

Alas, poor Yorick!

The sentiment_contributions that you calculated in the last exercise is available in your environment. It's time to explore some of your results! Look at Hamlet and The Merchant of Venice to see what negative and positive words are important in these two plays.
INSTRUCTIONS 100XP
  • Look at sentiment_contributions in the console and use whatever strategy you like to find the exact titles for Hamlet and The Merchant of Venice. (Perhaps count()?)
  • Filter for Hamlet and arrange() in ascending order (the default order) of contribution to see the words that contributed most negatively.
  • Filter for The Merchant of Venice and arrange() in descending order of contribution to see the words that contributed most positively.
In [112]:
sentiment_contributions %>%
  # Filter for Hamlet
  filter(title == "Hamlet, Prince of Denmark") %>%
  # Arrange to see the most negative words
  arrange(contribution) 

sentiment_contributions %>%
  # Filter for The Merchant of Venice
  filter(title == "The Merchant of Venice") %>%
  # Arrange to see the most positive words
  arrange(desc(contribution))
titlewordnscorecontribution
Hamlet, Prince of Denmarkno143-1-0.06520748
Hamlet, Prince of Denmarkdead33-3-0.04514364
Hamlet, Prince of Denmarkdeath38-2-0.03465572
Hamlet, Prince of Denmarkmadness22-3-0.03009576
Hamlet, Prince of Denmarkmad21-3-0.02872777
Hamlet, Prince of Denmarkfear21-2-0.01915185
titlewordnscorecontribution
The Merchant of Venicegood6330.12945205
The Merchant of Venicelove6030.12328767
The Merchant of Venicefair3520.04794521
The Merchant of Venicelike3420.04657534
The Merchant of Venicetrue2420.03287671
The Merchant of Venicesweet2320.03150685
Amazing! These are definitely characteristic words for these two plays.

Which words are important in each play? - Video

Sentiment changes through a play

In the last set of exercises in this chapter, you will examine how sentiment changes through the narrative arcs of these Shakespearean plays. We will start by first implementing sentiment analysis using inner_join(), and then use count() with four arguments:
title, type, an index that will section together lines of the play, and sentiment. After these lines of code, you will have the number of positive and negative words used in each index-ed section of the play. These sections will be 70 lines long in your analysis here. You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure). In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays.
INSTRUCTIONS 100XP
Implement sentiment analysis using the "bing" lexicon.
  • Use count() to find the number of words for each sentiment used in each play in sections, using four arguments.
  • The first argument for count() maps to the plays themselves.
  • The second argument keeps track of whether the play is a comedy or tragedy.
  • The third argument is defined by you; call it index and set it equal to linenumber %/% 70. This index makes chunks of text that are 70 lines long using integer division (%/%).
  • The fourth argument maps to the different sentiment categories.
In [19]:
tidy_shakespeare %>%
  # Implement sentiment analysis using "bing" lexicon
  inner_join(get_sentiments("bing")) %>%
  # Count using four arguments
  count(title, type, index = linenumber %/% 70 , sentiment)
Joining, by = "word"
titletypeindexsentimentn
A Midsummer Night's DreamComedy0negative4
A Midsummer Night's DreamComedy0positive11
A Midsummer Night's DreamComedy1negative7
A Midsummer Night's DreamComedy1positive19
A Midsummer Night's DreamComedy2negative20
A Midsummer Night's DreamComedy2positive23
Excellent! This is the first step in looking at narrative arcs.

Calculating net sentiment

Now you will build on the code from the previous exercise and continue to move forward to see how sentiment changes through these Shakespearean plays. The next steps involve spread() from the tidyr package. After these lines of code, you will have the net sentiment in each index-ed section of the play; net sentiment is the negative sentiment subtracted from the positive sentiment.
INSTRUCTIONS 100XP
  • Load the tidyr package.
  • Use spread() to spread sentiment and n across multiple columns.
  • Take a look at the output of the process after the spread() line in the console.
  • Make a new column using mutate() that has the net sentiment found by subtracting negative sentiment from positive.
In [20]:
# Load the tidyr package
library(tidyr)

tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  # Spread sentiment and n across multiple columns
  spread(sentiment, n, fill = 0) %>%
  # Use mutate to find net sentiment
  mutate(sentiment = positive - negative)
Joining, by = "word"
titletypeindexnegativepositivesentiment
A Midsummer Night's DreamComedy04117
A Midsummer Night's DreamComedy171912
A Midsummer Night's DreamComedy220233
A Midsummer Night's DreamComedy312186
A Midsummer Night's DreamComedy492718
A Midsummer Night's DreamComedy5112110
In [ ]:
Fabulous! You are closer to plotting the sentiment through these plays.
In [ ]:

In [22]:
library(tidyr)
# Load the ggplot2 package
library(ggplot2)

tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  # Put index on x-axis, sentiment on y-axis, and map comedy/tragedy to fill
  ggplot(aes(index, sentiment, fill=type)) +
  # Make a bar chart with geom_col()
  geom_col() +
  # Separate panels for each title with facet_wrap()
  facet_wrap(~title,scales = "free_x")
Joining, by = "word"
Awesome! These plots show how sentiment changes through these plays. Notice how the comedies have happier endings and more positive sentiment than the tragedies.

3. Analyzing TV News

Text analysis using tidy principles can be applied to diverse kinds of text, and in this chapter, you will explore a dataset of closed captioning from television news. You will apply the skills you have learned so far to explore how different stations report on a topic with different words, and how sentiment changes with time.
In [25]:
load("climate_text.rda")
head(climate_text)
stationshowshow_datetext
MSNBCMorning Meeting2009-09-22 13:00:00the interior positively oozes class raves car magazine slick and sensuous boasts the washington times the most striking vw in recent memory declares okay i get it already i think we were in a car commercial yeah yeah we join the president at the u.n on climate change climate change is serious
MSNBCMorning Meeting2009-10-23 13:00:00corporations have withdrawn from the chamber of commerce because of their disagreement with the chamber on the subject of climate change we asked the chamber for comment on the latest legal fight and they referred us to their original statement on the yes men which they said public relations hoax is undermine the effort to find real solutions on the challenge of climate change joining us now on the set is mike part of the activism group called the yes american what is your objective forget climate change
CNNCNN Newsroom2009-12-03 20:00:00he says he was bumped by the greeter but cops didn't buy it and the guy is now under arrest all right you're looking at two years in the life of a glacier right there here's my question climate change
CNNAmerican Morning2009-12-07 11:00:00especially at at time now where the climate change conference is beginning in copenhagen a lot of the world paying very close attention to this controversy and the issue in general our president headed there next week that e mail scandal may overshadow what we've been talking about this big u.n conference on climate change
MSNBCMorning Meeting2009-12-08 14:00:00lots more coming up quite simply here green peace activists going nuts like this is new anyway a few study claiming that global warming itself my friends could actually drive you nuts climate change
MSNBCCountdown With Keith Olbermann2009-12-10 06:00:00so they're carrying a lot of water for john yoo judgment at nuremberg but never judgment at the white house jonathan turley of george washington university as always great thanks for making it intelligible to folks like me thanks keith sarah palin admits climate change e mails were stolen admits there is climate change

Tidying TV news

Take a look at the dataset of TV news text about climate change you will use in this chapter. The climate_text dataset contains almost 600 closed captioning snippets and four columns:
station, the TV news station where the text is from, show, the show on that station where the text was spoken, show_date, the broadcast date of the spoken text, and text, the actual text spoken on TV. Type climate_text in the console to take a look at the dataset before getting started with transforming it to a tidy format.
INSTRUCTIONS 100XP
  • Load the tidytext package.
  • Pipe the original dataset to the next line.
  • Use unnest_tokens() to transform the non-tidy text data to a tidy text dataset, with a word column in the output.
In [26]:
# Load the tidytext package
library(tidytext)

# Pipe the climate_text dataset to the next line
tidy_tv <- climate_text %>%
    # Transform the non-tidy text data to tidy text data
    unnest_tokens(word,text)
In [ ]:
#parts_of_speech         Parts of speech for English words from the Moby Project
dim(parts_of_speech)
count(parts_of_speech, pos)

Counting totals

Now that you have transformed the TV news data to a tidy data structure, you can find out what words are most common when discussing climate change on TV news, as well as the total number of words from each station. These are both helpful exploratory steps before moving on in analysis!
INSTRUCTIONS 100XP
  • Find the most common words in this dataset with count() using sort = TRUE. (The command anti_join(stop_words) removes common words like "and", "of", and "to.")
  • You will now calculate the total number of words from each station, a quantity you'll use to find proportions later.

  • Use count() with one argument to find how many words came from each stations.
  • Change the name of the new column with rename() so that it is called station_total instead of n.
In [84]:
tidy_tv %>% 
    anti_join(stop_words) %>%
    # Count by word with sort = TRUE
    count(word, sort=TRUE) 
    
tidy_tv %>%
    # Count by station
    count(station) %>%
    # Rename the new column station_total
    rename(station_total=n)
Joining, by = "word"
wordn
climate1627
change1615
people139
real125
president112
global107
stationstation_total
CNN10713
FOX News10876
MSNBC19487
Great job! Notice that common words include “issue”, “global”, and “job”.
In [82]:
edx = readLines("WY-6JLNfM-s.txt")
edx = data.frame(sno = 1:length(edx), text = as.character(edx))
head(edx)

edx$text = as.character(edx$text)

edx_tidy = edx %>% unnest_tokens(word, text)

edx_count = edx_tidy %>% count(word, sort=TRUE)

edx_clean = edx_count %>% anti_join(stop_words)

head(edx_clean)
snotext
1
2Hi, and welcome to week 6 of 15.053x.
3And this week we're going to talk
4about nonlinear programming.
5
6Non-linear optimization and non-linear programming
Joining, by = "word"
wordn
linear8
optimization6
programming5
week5
convex3
talk3

Sentiment analysis of TV news

After transforming the TV news texts to a tidy format in a previous exercise, the resulting data frame tidy_tv is ready for sentiment analysis using tidy data principles. Before you implement the inner join, add new column with the total number of words from each station so you can calculate proportions soon.
INSTRUCTIONS 100XP
  • Define groups for each station in the dataset using group_by().
  • Make a new column called station_total in the dataframe that tallies the total number of words from each station; the mutate() verb will make a new column and the function n() counts the number of observations in the current group.
  • Finally, implement sentiment analysis using the correct kind of join and the "nrc" lexicon as the argument to the join function.
In [85]:
tv_sentiment <- tidy_tv %>% 
    # Group by station
    group_by(station) %>% 
    # Define a new column station_total
    mutate(station_total = n()) %>%
    ungroup() %>%
    # Implement sentiment analysis with the NRC lexicon
    inner_join(get_sentiments("nrc"))
Joining, by = "word"
In [86]:
head(tv_sentiment)
stationshowshow_datewordstation_totalsentiment
MSNBCMorning Meeting2009-09-22 13:00:00interior19487disgust
MSNBCMorning Meeting2009-09-22 13:00:00interior19487positive
MSNBCMorning Meeting2009-09-22 13:00:00interior19487trust
MSNBCMorning Meeting2009-09-22 13:00:00sensuous19487joy
MSNBCMorning Meeting2009-09-22 13:00:00sensuous19487positive
MSNBCMorning Meeting2009-09-22 13:00:00striking19487positive
In [88]:
unique(get_sentiments("nrc")$sentiment)
  1. 'trust'
  2.  
  3. 'fear'
  4.  
  5. 'negative'
  6.  
  7. 'sadness'
  8.  
  9. 'anger'
  10.  
  11. 'surprise'
  12.  
  13. 'positive'
  14.  
  15. 'disgust'
  16.  
  17. 'joy'
  18.  
  19. 'anticipation'
Fabulous! You have implemented sentiment analysis with an inner join once again.
In [89]:
filter(get_sentiments("nrc"), word == "interior")
wordsentiment
interiordisgust
interiorpositive
interiortrust
In [92]:
filter(get_sentiments("nrc"), word == "congress")
wordsentiment
congressdisgust
congresstrust

Which station uses the most positive or negative words?

You performed sentiment analysis on this dataset of TV news text, and the results are available in your environment in tv_sentiment. How do the words used when discussing climate change compare across stations? Which stations use more positive words? More negative words?
INSTRUCTIONS 100XP Start off by looking at negative words.
  • Define a new column percent using mutate() that is n divided by station_total, the proportion of words that belong to that sentiment.
  • Filter only for the negative sentiment rows.
  • Arrange by percent so you can see the results sorted by proportion of negative words.
  • Now repeat these steps to examine positive words!
In [90]:
# Which stations use the most negative words?
tv_sentiment %>% 
    count(station, sentiment, station_total) %>%
    # Define a new column percent
    mutate(percent = n/station_total) %>%
    # Filter only for negative words
    filter(sentiment == "negative") %>%
    # Arrange by percent
    arrange(percent)
    
# Now do the same but for positive words
tv_sentiment %>% 
    count(station, sentiment, station_total) %>%
    # Define a new column percent
    mutate(percent = n/station_total) %>%
    # Filter only for negative words
    filter(sentiment == "positive") %>%
    # Arrange by percent
    arrange(percent)
stationsentimentstation_totalnpercent
MSNBCnegative194875260.02699235
CNNnegative107133310.03089704
FOX Newsnegative108764030.03705406
stationsentimentstation_totalnpercent
FOX Newspositive108765140.04726002
CNNpositive107135220.04872585
MSNBCpositive194879530.04890440
Wonderful! Notice that MSNBC used a low proportion of negative words but a high proportion of positive words, the reverse is true of FOX News, and CNN is middle of the pack.

Which words contribute to the sentiment scores?

It's important to understand which words specifically are driving sentiment scores, and when you use tidy data principles, it's not too difficult to check. In this exercise, you will make a plot showing which words contribute the most to the ten types of sentiment in the NRC lexicon. Look at the result, and think about which words might not be appropriate in these contexts. Are there proper names? Are there words which used in these contexts are neutral?
If so, you can always remove these words from your dataset (or the sentiment lexicon) using anti_join().
INSTRUCTIONS 100XP
  • Count by word and sentiment to find which words are contributing most overall to the sentiment scores.
  • Group by sentiment.
  • Take the top 10 words for each sentiment using top_n().
  • Set up the plot using aes(), with the words on the x-axis, the number of uses n on the y-axis, and fill corresponding to sentiment.
In [91]:
tv_sentiment %>%
    # Count by word and sentiment
    count(word, sentiment) %>%
    # Group by sentiment
    group_by(sentiment) %>%
    # Take the top 10 words for each sentiment
    top_n(10) %>%
    ungroup() %>%
    mutate(word = reorder(word, n)) %>%
    # Set up the plot with aes()
    ggplot(aes(word,n, fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ sentiment, scales = "free") +
    coord_flip()
Selecting by n
Excellent! Notice that you see proper names like Gore and Trump, which should be treated as neutral, and that “change” was a strong driver of fear sentiment, even though it is by definition part of these texts on climate change. It is important to see which words contribute to your sentiment scores so you can adjust the sentiment lexicons if appropriate.

Word choice and TV station

In the last exercise, you saw which words contributed to which sentiment for this dataset of closed captioning texts about climate change from TV news station. Now it's time to explore the different words that each station used in the context of discussing climate change. Which negative words did each station use when talking about climate change on the air?
INSTRUCTIONS 100XP
  • Filter for only negative words.
  • Count by word and station to find which words are contributing most overall to the sentiment scores.
  • Group by TV station.
  • Take the top 10 words for each station.
  • Set up the plot using aes(), with the words on the x-axis, the number of uses n on the y-axis, and fill corresponding to station.
In [93]:
tv_sentiment %>%
    # Filter for only negative words
    filter(sentiment == "negative") %>%
    # Count by word and station
    count(word, station) %>%
    # Group by station
    group_by(station) %>%
    # Take the top 10 words for each station
    top_n(10) %>%
    ungroup() %>%
    mutate(word = reorder(paste(word, station, sep = "__"), n)) %>%
    # Set up the plot with aes()
    ggplot(aes(x=word, y=n, fill=station)) +
    geom_col(show.legend = FALSE) +
    scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
    facet_wrap(~ station, nrow = 2, scales = "free") +
    coord_flip()
Selecting by n
You did it! Some words, like “threat” are used by all three stations but some word choices are quite different. FOX News talks about terrorism and hurricanes, while CNN discusses hoaxes.

Sentiment changes with time - Video

Visualizing sentiment over time

You have compared how TV stations use positive and negative words; now it is time to see how sentiment is changing over time. Are TV news stations using more negative words as time passes? More positive words? You will use a function called floor_date() from the lubridate package to count uses of positive and negative words over time.
The tidy_tv dataframe you created near the beginning of this chapter is available in your environment.
INSTRUCTIONS 100XP
  • Load the lubridate package.
  • Define the new column with a mutate() statement using the floor_date() function, rounding each date down to the nearest 6-month unit.
  • Group by the new date column (each 6 months).
  • Implement sentiment analysis using the correct kind of join and the "nrc" sentiment lexicon.
  • Now you have a dataframe with the number of words per sentiment per 6 months, as well as the total words used in each 6 months!

  • Filter for both positive and negative words so you can plot both.
  • Count with three arguments: date, sentiment, and the total number of words.
  • Set up your plot with aes() and put date on the x-axis, percent on the y-axis, and have color correspond to sentiment.
In [95]:
# Load the lubridate package
library(lubridate)

sentiment_by_time <- tidy_tv %>%
    # Define a new column using floor_date()
    mutate(date = floor_date(show_date, unit = "6 months")) %>%
    # Group by date
    group_by(date) %>%
    mutate(total_words = n()) %>%
    ungroup() %>%
    # Implement sentiment analysis using the NRC lexicon
    inner_join(get_sentiments("nrc"))

sentiment_by_time %>%
    # Filter for positive and negative words
    filter(sentiment %in% c("positive", "negative")) %>%
    # Count by date, sentiment, and total_words
    count(date, sentiment, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words) %>%
    # Set up the plot with aes()
    ggplot(aes(date, percent, col=sentiment)) +
    geom_line(size = 1.5) +
    geom_smooth(method = "lm", se = FALSE, lty = 2) +
    expand_limits(y = 0)
Joining, by = "word"
Nice job on a complex task! The proportion of positive words looks flat, and the proportion of negative words may be increasing.

Word changes over time

You can also use tidy data principles to explore how individual words have been used over time. In the final exercise of this chapter, you will take the tidy_tv dataframe you created earlier and make a plot to see how certain words used in the context of climate change have changed in use over time. You will again use the floor_date() function, but this time to count monthly uses of words.
INSTRUCTIONS 100XP
  • Define a new column within the mutate() statement with the floor_date() function, rounding each date down to the nearest 1-month unit.
  • Count with 2 arguments: date and word.
  • Set up your plot with aes() so that date is on the x-axis, n (the monthly number of uses) is on the y-axis), and color corresponds to word.
  • Use facet_wrap to make a separate panel in your plot for each word.
In [96]:
tidy_tv %>%
    # Define a new column that rounds each date to the nearest 1 month
    mutate(date = floor_date(show_date, unit = "1 months")) %>%
    filter(word %in% c("threat", "hoax", "denier",
                       "real", "warming", "hurricane")) %>%
    # Count by date and word
    count(date, word) %>%
    ungroup() %>%
    # Set up your plot with aes()
    ggplot(aes(date, n, col=word)) +
    # Make facets by word
    facet_wrap(~word) +
    geom_line(size = 1.5, show.legend = FALSE) +
    expand_limits(y = 0)
What an interesting plot! You can see that words like “hoax” and “denier” have been used only recently, and “warming” is decreasing in monthly uses. You can see when a hurricane was being discussed as well.

Singing a Happy Song (or Sad?!) - Video

In this final chapter on sentiment analysis using tidy principles, you will explore pop song lyrics that have topped the charts from the 1960s to today. You will apply all the techniques we have explored together so far, and use linear modeling to find what the sentiment of song lyrics can predict.
In [100]:
load("song_lyrics.rda")
head(song_lyrics,3)
ranksongartistyearlyrics
1wooly bullysam the sham and the pharaohs1965sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs domingo samudio uno dos one two tres quatro matty told hatty about a thing she saw had two big horns and a wooly jaw wooly bully wooly bully wooly bully wooly bully wooly bully hatty told matty lets dont take no chance lets not belseven come and learn to dance wooly bully wooly bully wooly bully wooly bully wooly bully matty told hatty thats the thing to do get you someone really to pull the wool with you wooly bully wooly bully wooly bully wooly bully wooly bully lseven the letter l and the number 7 when typed they form a rough square l7 so the lyrics mean lets not be square
2i cant help myself sugar pie honey bunchfour tops1965sugar pie honey bunch you know that i love you i cant help myself i love you and nobody elsein and out my life you come and you go leaving just your picture behind and i kissed it a thousand timeswhen you snap your finger or wink your eye i come arunning to you im tied to your apron strings and theres nothing that i can docant help myself no i cant help myselfsugar pie honey bunch im weaker than a man should be i cant help myself im a fool in love you seewanna tell you i dont love you tell you that were through and ive tried but every time i see your face i get all choked up insidewhen i call your name girl it starts the flame burning in my heart tearing it all apart no matter how i try my love i cannot hidecause sugar pie honey bunch you know that im weak for you cant help myself i love you and nobody elsesugar pie honey bunch do anything you ask me to cant help myself i want you and nobody elsesugar pie honey bunch you know that i love you i cant help myself i cant help myself
4you were on my mindwe five1965when i woke up this morning you were on my mind and you were on my mind i got troubles whoaoh i got worries whoaoh i got wounds to bind so i went to the corner just to ease my pains yeah just to ease my pains i got troubles whoaoh i got worries whoaoh i came home again when i woke up this morning you were on my miiiind and you were on my mind i got troubles whoaoh i got worries whoaoh i got wounds to bind and i got a feelin down in my shooooooes said way down in my shooooes yeah i got to ramble whoaoh i got to move on whoaoh i got to walk away my blues when i woke up this morning you were on my mind you were on my mind i got troubles whoaoh i got worries whoaoh i got wounds to bind

Tidying song lyrics

Let's take a look at the dataset you will use in this final chapter to practice your sentiment analysis skills. The song_lyrics dataset contains five columns:
  • rank, the rank a song achieved on the Billboard Year-End Hot 100,
  • song, the song's title,
  • artist, the artist who recorded the song,
  • year, the year the song reached the given rank on the Billboard chart, and
  • lyrics, the lyrics of the song.
This dataset contains over 5000 songs, from 1965 to the present. The lyrics are all in one column, so they are not yet in a tidy format, ready for analysis using tidy tools. It's your turn to tidy this text data!
INSTRUCTIONS 100XP
  • Load the tidytext package.
  • Pipe the song_lyrics object to the next line.
  • Use unnest_tokens() to unnest the lyrics column into a new word column.
In [101]:
# Load the tidytext package
library(tidytext)

# Pipe song_lyrics to the next line
tidy_lyrics <- song_lyrics %>% 
  # Transform the lyrics column to a word column
  unnest_tokens(word, lyrics)

# Print tidy_lyrics
tidy_lyrics
ranksongartistyearword
1wooly bullysam the sham and the pharaohs1965sam
1wooly bullysam the sham and the pharaohs1965the
1wooly bullysam the sham and the pharaohs1965sham
1wooly bullysam the sham and the pharaohs1965miscellaneous
1wooly bullysam the sham and the pharaohs1965wooly
1wooly bullysam the sham and the pharaohs1965bully
Great work! The unnest_tokens() function tokenizes the input column into words by default.

Calculating total words per song

For some next steps in this analysis, you need to know the total number of words sung in each song. Use count() to count up the words per song, and then left_join() these word totals to the tidy data set. You can specify exactly which column to use when joining the two data frames if you add by = "song".
INSTRUCTIONS 100XP
  • Count by song to find the word totals.
  • With the rename() function, change the name of the new n column to total_words.
  • Use left_join() to combine total with tidy_lyrics using the song column.
In [102]:
totals <- tidy_lyrics %>%
  # Count by song to find the word totals for each song
  count(song) %>%
  # Rename the new column
  rename(total_words = n)

# Print totals    
totals

lyric_counts <- tidy_lyrics %>%
  # Combine totals with tidy_lyrics using the "song" column
  left_join(totals, by = "song")
songtotal_words
0 to 100 the catch up894
1 2 3 4 sumpin new670
1 2 3 red light145
1 2 step437
1 thing532
100 pure love590
Excellent! Now you have the total number of words for each song.

Sentiment analysis on song lyrics

You have been practicing how to implement sentiment analysis with a join throughout this course. After transforming the text of these songs to a tidy text dataset and preparing the data frame, the resulting data frame lyric_counts is ready for you to perform sentiment analysis once again. Once you have done the sentiment analysis, you can learn which songs have the most sentiment words from the NRC lexicon. Remember that the NRC lexicon has 10 categories of sentiment:
  • anger
  • anticipation
  • disgust
  • fear
  • joy
  • negative
  • positive
  • sadness
  • surprise
  • trust
INSTRUCTIONS 100XP
  • Use the correct kind of join to implement sentiment analysis.
  • Add the "nrc" lexicon as the argument to the join function.
  • Find the songs with the most sentiment words by using two arguments in count(), along with sort = TRUE.
In [104]:
lyric_sentiment <- lyric_counts %>%
    # Implement sentiment analysis with the "nrc" lexicon
    inner_join(get_sentiments("nrc"))

lyric_sentiment %>%
    # Find how many sentiment words each song has
    count(song, sentiment, sort = TRUE)
Joining, by = "word"
songsentimentn
babypositive264
babyjoy255
real lovepositive213
angelpositive193
disturbianegative182
live your lifepositive174
my lovepositive173
angeljoy164
damnnegative164
disturbiasadness164
In [ ]:
Fantastic! The song “Baby” has the highest number of positive words while “Disturbia” has the highest number of negative words.

The most positive and negative songs

You have successfully implemented sentiment analysis on this dataset of song lyrics, and now you can ask question such as, "Which songs have the highest proportion of positive words? Of negative words?" You calculated the total number of words for each song earlier, so now you need to count the number of words for each sentiment and song.
INSTRUCTIONS 100XP
  • Use count() with three arguments to find the number of sentiment words for each song and total number of words.
  • Make a new column using mutate() that is named percent, equal to n (the output of count()) divided by the total number of words.
  • Filter for only negative words.
  • Arrange by descending percent.
  • Now repeat these same steps for positive words.
In [105]:
# What songs have the highest proportion of negative words?
lyric_sentiment %>%
    # Count using three arguments
    count(song, sentiment, total_words) %>%
    ungroup() %>%
    # Make a new percent column with mutate 
    mutate(percent = n/total_words) %>%
    # Filter for only negative words
    filter(sentiment=="negative") %>%
    # Arrange by descending percent
    arrange(desc(percent))

# What songs have the highest proportion of positive words?
lyric_sentiment %>%
    # Count using three arguments
    count(song, sentiment, total_words) %>%
    ungroup() %>%
    # Make a new percent column with mutate 
    mutate(percent = n/total_words) %>%
    # Filter for only negative words
    filter(sentiment=="positive") %>%
    # Arrange by descending percent
    arrange(desc(percent))
songsentimenttotal_wordsnpercent
bad boynegative237770.3248945
rack citynegative4581420.3100437
ill tumble 4 yanegative269790.2936803
time wont let menegative154420.2727273
bang bang my baby shot me downnegative163400.2453988
the strokenegative279570.2043011
songsentimenttotal_wordsnpercent
love to love you babypositive240780.3250000
dance dance dance yowsah yowsah yowsahpositive305940.3081967
i got the feelinpositive141350.2482270
i love musicpositive252610.2420635
sweet and innocentpositive218510.2339450
me and baby brotherpositive181420.2320442
Fabulous! One of the songs with the highest proportion of positive words is “Dance, Dance, Dance (Yowsah, Yowsah, Yowsah)” from 1977.

Connecting sentiment to other quantities - Video

Sentiment and Billboard rank

The lyric_sentiment data frame that you created earlier by using inner_join() is available in your environment. You can now explore how the sentiment score of a song is related to other aspects of that song. First, start with Billboard rank, how high on the annual Billboard chart the song reached. Do songs that use more positive or negative words achieve higher or lower ranks? Start with positive words, and make a visualization to see how these characteristics are related.
INSTRUCTIONS 100XP
  • Count with three arguments: song, Billboard rank, and the total number of words that you calculated before
  • Use the correct dplyr function to make two new columns, percent and a rounded version of rank
  • Call the correct ggplot2 geom_* to make a boxplot
In [106]:
lyric_sentiment %>%
    filter(sentiment == "positive") %>%
    # Count by song, Billboard rank, and the total number of words
    count(song, rank, total_words) %>%
    ungroup() %>%
    # Use the correct dplyr verb to make two new columns
    mutate(percent = n / total_words,
           rank = 10 * floor(rank / 10)) %>%
    ggplot(aes(as.factor(rank), percent)) +
    # Make a boxplot
    geom_boxplot()
Nice work! Notice that there is not a visible trend relating Billboard rank and positive sentiment.

More on Billboard rank and sentiment scores

In the last exercise, you explored how positive sentiment and Billboard rank are related using a visualization, and you found that there was no visible trend. Songs with more positive words do not reach higher or lower ranks on the Billboard chart. Next, check the same relationship using the same visualization but for negative words. The lyric_sentiment data frame that you created earlier is still available in your environment.
INSTRUCTIONS 100XP
Filter for only negative words.
  • Count using three arguments: song, Billboard rank, and total number of words.
  • Define a new percent column with mutate() that is equal to n divided by total_words.
  • Then use ggplot2 to make your visualization.
Call aes() to put rank on the x-axis and percent on the y-axis; like in the previous exercise, treat rank as a factor with as.factor(rank). Make a boxplot.
In [108]:
lyric_sentiment %>%
    # Filter for only negative words
    filter(sentiment == "negative") %>%
    # Count by song, Billboard rank, and the total number of words
    count(song, rank, total_words) %>%
    ungroup() %>%
    # Mutate to make a percent column
     mutate(percent = n / total_words,
           rank = 10 * floor(rank / 10)) %>%
    # Use ggplot to set up a plot with rank and percent
    ggplot(aes(as.factor(rank), percent)) +
    # Make a boxplot
    geom_boxplot()
That is exactly right, but still no trend! Sentiment and Billboard rank appear to be unrelated.

Moving from song rank to year - Video

Sentiment scores by year

You are going to make two more exploratory plots in this exercise, much like the plots you made for Billboard rank. This time, you are going to explore how sentiment has been changing with time. Are songs on the Billboard chart changing in their use of negative or positive words through the decades?
INSTRUCTIONS 100XP
  • Filter for only negative words.
  • Use count() with three arguments to find the number of sentiment words for each song, year, and total number of words.
  • Use ggplot() to set up a plot with year on the x-axis (remember to treat it as a factor with as.factor(year)) and percent on the y-axis.
  • Now repeat these same steps for positive words.
In [109]:
# How is negative sentiment changing over time?
lyric_sentiment %>%
    # Filter for only negative words
    filter(sentiment == "positive") %>%
    # Count by song, year, and the total number of words
    count(song, year, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words,
           year = 10 * floor(year / 10)) %>%
    # Use ggplot to set up a plot with year and percent
    ggplot(aes(as.factor(year), percent)) +
    geom_boxplot()
    
# How is positive sentiment changing over time?
lyric_sentiment %>%
    filter(sentiment == "negative") %>%
    count(song, year, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words,
           year = 10 * floor(year / 10)) %>%
    ggplot(aes(as.factor(year), percent)) +
    geom_boxplot()
Wonderful! Notice that the proportion of negative words does not change significantly but the proportion of positive words exhibits a decrease over time.

Modeling negative sentiment

You saw in your visualizations in the last exercise how positive and negative sentiment appear to be related to year. Now, you can explore that relationship with linear modeling. One more time, make a dataframe with one row per song that contains the proportion of negative words. Then, build a linear model and see whether the relationship is significant. The lyric_sentiment data frame that you created earlier is still available in your environment.
INSTRUCTIONS 100XP
  • Filter for only negative words
  • Use mutate() to define a new percent column that is n divided by total_words
  • When fitting the linear model with lm(), percent will be your response and year will be your predictor.
  • To see the results of your model fitting, call summary on your model fit object
In [110]:
negative_by_year <- lyric_sentiment %>%
    # Filter for negative words
    filter(sentiment=="negative") %>%
    count(song, year, total_words) %>%
    ungroup() %>%
    # Define a new column: percent
    mutate(percent = n / total_words)

# Specify the model with percent as the response and year as the predictor
model_negative <- lm(percent ~ year, data = negative_by_year)

# Use summary to see the results of the model fitting
summary(model_negative)
Call:
lm(formula = percent ~ year, data = negative_by_year)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.030288 -0.017205 -0.005778  0.010505  0.294194 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.809e-02  5.022e-02   0.758    0.448
year        -3.720e-06  2.523e-05  -0.147    0.883

Residual standard error: 0.02513 on 4624 degrees of freedom
Multiple R-squared:  4.702e-06, Adjusted R-squared:  -0.0002116 
F-statistic: 0.02174 on 1 and 4624 DF,  p-value: 0.8828
Great job fitting this model! Notice how high the p-values are; negative sentiment does not change significantly with year.

Modeling positive sentiment

Now it's time for you to build a linear model for positive sentiment in this dataset of pop songs, just like you did for negative sentiment in the last exercise. Use the same approach and see what the results are!
INSTRUCTIONS 100XP
  • Count using three arguments: song, year, and total number of words.
  • Use mutate() to define a new percent column that is n divided by total_words
  • Specify a linear model with lm() in the same way as the last exercise, but with data = positive_by_year this time.
  • Explore the results of the model fitting with summary().
In [111]:
positive_by_year <- lyric_sentiment %>%
    filter(sentiment == "positive") %>%
    # Count by song, year, and total number of words
    count(song, year, total_words) %>%
    ungroup() %>%
    # Define a new column: percent
    mutate(percent = n / total_words)

# Fit a linear model with percent as the response and year as the predictor
model_positive <- lm(percent ~ year, data = positive_by_year)

# Use summary to see the results of the model fitting
summary(model_positive)
Call:
lm(formula = percent ~ year, data = positive_by_year)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.058050 -0.024032 -0.007756  0.014774  0.269726 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.117e+00  6.859e-02   16.29   <2e-16 ***
year        -5.373e-04  3.446e-05  -15.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03495 on 4770 degrees of freedom
Multiple R-squared:  0.0485, Adjusted R-squared:  0.0483 
F-statistic: 243.1 on 1 and 4770 DF,  p-value: < 2.2e-16
You did it! Notice how low these p-values are and which direction the slope is; positive sentiment does change significantly with year, in contrast with negative sentiment.

How is sentiment in pop songs changing over the decades?

Based on the visualizations you made and the linear modeling you explored, how is sentiment changing in this dataset of songs over time?
ANSWER THE QUESTION 50XP
  • Possible Answers
  • More positive words
  • press 1
  • Fewer positive words (Correct)
  • press 2
  • More negative words
  • press 3
  • Fewer negative words
  • press 4
Correct! As years have passed, the proportion of positive words has decreased in these songs; you saw this both in the plots and the modeling.

Wrapping up

Thank You

In [ ]:

No comments:

Post a Comment