Foreword

  • Snippets and results.
  • Source: 'Text Mining, Bag of Words' from DataCamp fitted into Jupyter/IPython using the IRkernel.


Jumping into text mining with bag of words

Quick taste of text mining

It is always fun to jump in with a quick and easy example. Sometimes we can find out the author's intent and main ideas just by looking at the most common words.

At its heart, bag of words text mining represents a way to count terms, or n-grams, across a collection of documents. Consider the following sentences:

text <- "Text mining usually involves the process of structuring the input text. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods."

Manually counting words in the sentences above is a pain! Fortunately, the qdap package offers a better alternative. You can easily find the top 4 most frequent terms (including ties) in text by calling the freq_terms function and specifying 4.

frequent_terms <- freq_terms(text, 4)

The frequent_terms object stores all unique words and their count. You can then make a bar chart simply by calling the plot function on the frequent_terms object.

In [1]:
#install.packages('qdap') in R
#install.packages('qdapTools') in R
In [6]:
# Load qdap and qdapTools
library(qdap)
library(qdapTools)
In [7]:
text <- "Text mining usually involves the process of structuring the input text. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods."

# Print and plot text
frequent_terms <- freq_terms(text, 4)

plot(frequent_terms)

Load some text

Text mining begins with loading some text data into R, which we'll do with the read.csv() function. By default, read.csv() treats character strings as factor levels like Male/Female. To prevent this from happening, it's very important to use the argument stringsAsFactors = FALSE.

A best practice is to examine the object you read in to make sure you know which column(s) are important. The str() function provides an efficient way of doing this. You can also count the number of documents using the nrow() function on the new object. In this example, it will tell you how many coffee tweets are in the vector.

If the data frame contains columns that are not text, you may want to make a new object using only the correct column of text (e.g. some_object$column_name).

In [9]:
#install.packages('xlsx') in R
# load it
library(xlsx)
In [215]:
# Import data
# sheetIndex = 1 for coffee
tweets <- read.xlsx("Text.xls", sheetIndex = 1)
str(tweets)
'data.frame':	1000 obs. of  15 variables:
 $ num         : num  1 2 3 4 5 6 7 8 9 10 ...
 $ text        : Factor w/ 910 levels "- my brain, after coffee, right now #emojiexplanatoryemotions #idonthaveemojis",..: 63 664 526 589 767 346 462 85 704 683 ...
 $ favorited   : Factor w/ 2 levels "0     FALSE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ replyToSN   : Factor w/ 176 levels "_anaele","_Camilo30",..: 23 5 5 5 5 5 5 45 5 5 ...
 $ created     : Factor w/ 13 levels "8/9/2013 02:31:00",..: 13 13 13 13 13 13 13 13 13 13 ...
 $ truncated   : Factor w/ 1 level "FALSE": 1 1 1 1 1 1 1 1 1 1 ...
 $ replyToSID  : Factor w/ 34 levels "3.45203e+17",..: 32 34 34 34 34 34 34 32 34 34 ...
 $ id          : num  3.66e+17 3.66e+17 3.66e+17 3.66e+17 3.66e+17 ...
 $ replyToUID  : Factor w/ 176 levels "1039188956","1040337193",..: 43 176 176 176 176 176 176 13 176 176 ...
 $ statusSource: Factor w/ 66 levels "<a href=\"http://ask.fm/\" rel=\"nofollow\">Ask.fm</a>",..: 28 28 66 27 28 27 27 28 28 18 ...
 $ screenName  : Factor w/ 926 levels "__catttiie","__OneTouch__",..: 826 162 396 44 715 254 168 421 473 525 ...
 $ retweetCount: num  0 1 0 0 2 0 0 0 1 2 ...
 $ retweeted   : Factor w/ 1 level "FALSE": 1 1 1 1 1 1 1 1 1 1 ...
 $ longitude   : Factor w/ 1 level "NA": 1 1 1 1 1 1 1 1 1 1 ...
 $ latitude    : Factor w/ 1 level "NA": 1 1 1 1 1 1 1 1 1 1 ...
In [216]:
tweets$num <- as.integer(tweets$num)
tweets$text <- as.character(tweets$text)
tweets$favorited <- as.logical(tweets$favorited)
tweets$replyToSN <- as.character(tweets$replyToSN)
tweets$created <- as.character(tweets$created)
tweets$truncated <- as.logical(tweets$truncated)
tweets$replyToSID <- as.numeric(as.character(tweets$replyToSID))
tweets$replyToUID <- as.integer(tweets$replyToUID)
tweets$statusSource <- as.character(tweets$statusSource)
tweets$screenName <- as.character(tweets$screenName)
tweets$retweeted <- as.integer(tweets$retweeted)
tweets$longitude <- as.logical(tweets$longitude)
tweets$latitude <- as.logical(tweets$latitude)
Warning message in eval(expr, envir, enclos):
"NAs introduits lors de la conversion automatique"
In [217]:
# View the structure of tweets
str(tweets)
'data.frame':	1000 obs. of  15 variables:
 $ num         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ text        : chr  "@ayyytylerb that is so true drink lots of coffee" "RT @bryzy_brib: Senior March tmw morning at 7:25 A.M. in the SENIOR lot. Get up early, make yo coffee/breakfast, cus this will "| __truncated__ "If you believe in #gunsense tomorrow would be a very good day to have your coffee any place BUT @Starbucks Guns+Coffee=#nosense"| __truncated__ "My cute coffee mug. http://t.co/2udvMU6XIG" ...
 $ favorited   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ replyToSN   : chr  "ayyytylerb" "<NA>" "<NA>" "<NA>" ...
 $ created     : chr  "8/9/2013 02:43:00" "8/9/2013 02:43:00" "8/9/2013 02:43:00" "8/9/2013 02:43:00" ...
 $ truncated   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ replyToSID  : num  3.66e+17 NA NA NA NA ...
 $ id          : num  3.66e+17 3.66e+17 3.66e+17 3.66e+17 3.66e+17 ...
 $ replyToUID  : int  43 176 176 176 176 176 176 13 176 176 ...
 $ statusSource: chr  "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "web" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" ...
 $ screenName  : chr  "thejennagibson" "carolynicosia" "janeCkay" "AlexandriaOOTD" ...
 $ retweetCount: num  0 1 0 0 2 0 0 0 1 2 ...
 $ retweeted   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ longitude   : logi  NA NA NA NA NA NA ...
 $ latitude    : logi  NA NA NA NA NA NA ...
In [203]:
# Print out the number of rows in tweets
nrow(tweets)
1000
In [204]:
# Isolate text from tweets: coffee_tweets
coffee_tweets <- as.character(tweets$text)
str(coffee_tweets)
head(coffee_tweets)
 chr [1:1000] "@ayyytylerb that is so true drink lots of coffee" ...
  1. '@ayyytylerb that is so true drink lots of coffee'
  2. 'RT @bryzy_brib: Senior March tmw morning at 7:25 A.M. in the SENIOR lot. Get up early, make yo coffee/breakfast, cus this will only happen ?'
  3. 'If you believe in #gunsense tomorrow would be a very good day to have your coffee any place BUT @Starbucks Guns+Coffee=#nosense @MomsDemand'
  4. 'My cute coffee mug. http://t.co/2udvMU6XIG'
  5. 'RT @slaredo21: I wish we had Starbucks here... Cause coffee dates in the morning sound perff!'
  6. 'Does anyone ever get a cup of coffee before a cocktail??'

Make the vector a VCorpus object (1)

Recall that you've loaded your text data as a vector called coffee_tweets. Your next step is to convert this vector containing the text data to a corpus. A corpus is a collection of documents, but it's also important to know that in the tm domain, R recognizes it as a data type.

There are two kinds of the corpus data type, the permanent corpus, PCorpus, and the volatile corpus, VCorpus. In essence, the difference between the two has to do with how the collection of documents is stored in your computer. We will use the volatile corpus, which is held in your computer's RAM rather than saved to disk, just to be more memory efficient.

To make a volatile corpus, R needs to interpret each element in our vector of text, coffee_tweets, as a document. And the tm package provides what are called Source functions to do just that! In this exercise, we'll use a Source function called VectorSource() because our text data is contained in a vector. The output of this function is called a Source object. Give it a shot!

In [15]:
#install.packages('tm') in R
# Load it
library(tm)
In [16]:
# Make a vector source: coffee_source
coffee_source <- VectorSource(coffee_tweets)

Make the vector a VCorpus object (2)

Now that we've converted our vector to a Source object, we pass it to another tm function, VCorpus(), to create our volatile corpus. Pretty straightforward, right?

The VCorpus object is a nested list. That means that at each index of the VCorpus object, there is a PlainTextDocument object, which is essentially a list that contains the actual text data, as well as some corresponding metadata. It can help to visualize a VCorpus object to conceptualize the whole thing.

To examine the contents of the second tweet in coffee_corpus, you'd need to subset twice: once to specify the second PlainTextDocument that corresponds to the second tweet and again to extract the first element of that PlainTextDocument.

In [17]:
# Make a volatile corpus: coffee_corpus
coffee_corpus <- VCorpus(coffee_source)

# Print out coffee_corpus
coffee_corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1000
In [18]:
# Print data on the 15th tweet in coffee_corpus
coffee_corpus[[15]]

# Print the text from the 15th tweet in coffee_corpus
coffee_corpus[[15]][1]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 111
$content = '@HeatherWhaley I was about 2 joke it takes 2 hands to hold hot coffee...then I read headline! #Don\'tDrinkNShoot'

Make a VCorpus from a data frame

Because another common text source is a data frame, there is a Source function called DataframeSource(). The DataframeSource() function treats the entire row as a complete document, so be careful you don't pick up non-text data like customer IDs when sourcing a document this way.

In [239]:
# Import data
# sheetIndex = 2 for example_text
example_text <- read.xlsx("Text.xls", sheetIndex = 2)

example_text$Author1 <- as.character(example_text$Author1)
example_text$Author2 <- as.character(example_text$Author2)

str(example_text)
'data.frame':	3 obs. of  3 variables:
 $ num    : num  1 2 3
 $ Author1: chr  "Text mining is a great time." "Text analysis provides insights" "qdap and tm are used in text mining"
 $ Author2: chr  "R is a great language" "R has many uses" "DataCamp is cool!"
In [20]:
# Print example_text to the console
example_text
numAuthor1Author2
1 Text mining is a great time. R is a great language
2 Text analysis provides insights R has many uses
3 qdap and tm are used in text miningDataCamp is cool!
In [21]:
# Create a DataframeSource on columns 2 and 3: df_source
df_source <- DataframeSource(example_text[,2:3])

# Convert df_source to a corpus: df_corpus
df_corpus <- VCorpus(df_source)

# Examine df_corpus
df_corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3
In [22]:
# Create a VectorSource on column 3: vec_source
vec_source <- VectorSource(example_text[,3])

# Convert vec_source to a corpus: vec_corpus
vec_corpus <- VCorpus(vec_source)

# Examine vec_corpus
vec_corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

Common cleaning functions from tm

Now that you know two ways to make a corpus, we can focus on cleaning, or preprocessing, the text. First, we'll clean a small piece of text so you can see how it works. Then we will move on to actual corpora.

In bag of words text mining, cleaning helps aggregate terms. For example, it may make sense that the words "miner", "mining" and "mine" should be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different.

Common preprocessing functions include:

  • tolower(): Make all characters lowercase.
  • removePunctuation(): Remove all punctuation marks.
  • removeNumbers(): Remove numbers.
  • stripWhitespace(): Remove excess whitespace.

Note that tolower() is part of base R, while the other three functions come from the tm package. Going forward, we'll load the tm and qdap for you when they are needed. Every time we introduce a new package, we'll have you load it the first time.

In [23]:
# Create the object: text
text <- "<b>She</b> woke up at       6 A.M. It\'s so early!  She was only 10% awake and began drinking coffee in front of her computer."

# from the tm package
# All lowercase
tolower(text)
'<b>she</b> woke up at 6 a.m. it\'s so early! she was only 10% awake and began drinking coffee in front of her computer.'
In [24]:
# from the gdap package
# Remove punctuation
removePunctuation(text)
'bSheb woke up at 6 AM Its so early She was only 10 awake and began drinking coffee in front of her computer'
In [25]:
# Remove numbers
removeNumbers(text)
'<b>She</b> woke up at A.M. It\'s so early! She was only % awake and began drinking coffee in front of her computer.'
In [26]:
# Remove whitespace
stripWhitespace(text)
'<b>She</b> woke up at 6 A.M. It\'s so early! She was only 10% awake and began drinking coffee in front of her computer.'

Cleaning with qdap

The qdap package offers other text cleaning functions. Each is useful in its own way and is particularly powerful when combined with the others.

  • bracketX(): Remove all text within brackets (e.g. "It's (so) cool" becomes "It's cool").
  • replace_number(): Replace numbers with their word equivalents (e.g. "2" becomes "two").
  • replace_abbreviation(): Replace abbreviations with their full text equivalents (e.g. "Sr" becomes "Senior").
  • replace_contraction(): Convert contractions back to their base words (e.g. "shouldn't" becomes "should not").
  • replace_symbol() Replace common symbols with their word equivalents (e.g. "$" becomes "dollar").
In [27]:
# from the qdap package
# Remove text within brackets
bracketX(text)
'She woke up at 6 A.M. It\'s so early! She was only 10% awake and began drinking coffee in front of her computer.'
In [28]:
# Replace numbers with words
replace_number(text)
'<b>She</b> woke up at six A.M. It\'s so early! She was only ten% awake and began drinking coffee in front of her computer.'
In [29]:
# Replace abbreviations
replace_abbreviation(text)
'<b>She</b> woke up at 6 AM It\'s so early! She was only 10% awake and began drinking coffee in front of her computer.'
In [30]:
# Replace contractions
replace_contraction(text)
'<b>She</b> woke up at 6 A.M. it is so early! She was only 10% awake and began drinking coffee in front of her computer.'
In [31]:
# Replace symbols with words
replace_symbol(text)
'<b>She</b> woke up at 6 A.M. It\'s so early! She was only 10 percent awake and began drinking coffee in front of her computer.'

All about stop words

Often there are words that are frequent but provide little information. So you may want to remove these so-called stop words. Some common English stop words include "I", "she'll", "the", etc. In the tm package, there are 174 stop words on this common list.

In fact, when you are doing an analysis you will likely need to add to this list. In our coffee tweet example, all tweets contain "coffee", so it's important to pull out that word in addition to the common stop words. Leaving it in doesn't add any insight and will cause it to be overemphasized in a frequency analysis.

Using the c() function allows you to add new words (separated by commas) to the stop words list. For example, the following would add "word1" and "word2" to the default list of English stop words:

all_stops <- c("word1", "word2", stopwords("en"))

Once you have a list of stop words that makes sense, you will use the removeWords() function on your text. removeWords() takes two arguments: the text object to which it's being applied and the list of words to remove.

In [32]:
# List standard English stop words
stopwords('en')

# Print text without standard stop words
removeWords(text, stopwords('en'))
  1. 'i'
  2. 'me'
  3. 'my'
  4. 'myself'
  5. 'we'
  6. 'our'
  7. 'ours'
  8. 'ourselves'
  9. 'you'
  10. 'your'
  11. 'yours'
  12. 'yourself'
  13. 'yourselves'
  14. 'he'
  15. 'him'
  16. 'his'
  17. 'himself'
  18. 'she'
  19. 'her'
  20. 'hers'
  21. 'herself'
  22. 'it'
  23. 'its'
  24. 'itself'
  25. 'they'
  26. 'them'
  27. 'their'
  28. 'theirs'
  29. 'themselves'
  30. 'what'
  31. 'which'
  32. 'who'
  33. 'whom'
  34. 'this'
  35. 'that'
  36. 'these'
  37. 'those'
  38. 'am'
  39. 'is'
  40. 'are'
  41. 'was'
  42. 'were'
  43. 'be'
  44. 'been'
  45. 'being'
  46. 'have'
  47. 'has'
  48. 'had'
  49. 'having'
  50. 'do'
  51. 'does'
  52. 'did'
  53. 'doing'
  54. 'would'
  55. 'should'
  56. 'could'
  57. 'ought'
  58. 'i\'m'
  59. 'you\'re'
  60. 'he\'s'
  61. 'she\'s'
  62. 'it\'s'
  63. 'we\'re'
  64. 'they\'re'
  65. 'i\'ve'
  66. 'you\'ve'
  67. 'we\'ve'
  68. 'they\'ve'
  69. 'i\'d'
  70. 'you\'d'
  71. 'he\'d'
  72. 'she\'d'
  73. 'we\'d'
  74. 'they\'d'
  75. 'i\'ll'
  76. 'you\'ll'
  77. 'he\'ll'
  78. 'she\'ll'
  79. 'we\'ll'
  80. 'they\'ll'
  81. 'isn\'t'
  82. 'aren\'t'
  83. 'wasn\'t'
  84. 'weren\'t'
  85. 'hasn\'t'
  86. 'haven\'t'
  87. 'hadn\'t'
  88. 'doesn\'t'
  89. 'don\'t'
  90. 'didn\'t'
  91. 'won\'t'
  92. 'wouldn\'t'
  93. 'shan\'t'
  94. 'shouldn\'t'
  95. 'can\'t'
  96. 'cannot'
  97. 'couldn\'t'
  98. 'mustn\'t'
  99. 'let\'s'
  100. 'that\'s'
  101. 'who\'s'
  102. 'what\'s'
  103. 'here\'s'
  104. 'there\'s'
  105. 'when\'s'
  106. 'where\'s'
  107. 'why\'s'
  108. 'how\'s'
  109. 'a'
  110. 'an'
  111. 'the'
  112. 'and'
  113. 'but'
  114. 'if'
  115. 'or'
  116. 'because'
  117. 'as'
  118. 'until'
  119. 'while'
  120. 'of'
  121. 'at'
  122. 'by'
  123. 'for'
  124. 'with'
  125. 'about'
  126. 'against'
  127. 'between'
  128. 'into'
  129. 'through'
  130. 'during'
  131. 'before'
  132. 'after'
  133. 'above'
  134. 'below'
  135. 'to'
  136. 'from'
  137. 'up'
  138. 'down'
  139. 'in'
  140. 'out'
  141. 'on'
  142. 'off'
  143. 'over'
  144. 'under'
  145. 'again'
  146. 'further'
  147. 'then'
  148. 'once'
  149. 'here'
  150. 'there'
  151. 'when'
  152. 'where'
  153. 'why'
  154. 'how'
  155. 'all'
  156. 'any'
  157. 'both'
  158. 'each'
  159. 'few'
  160. 'more'
  161. 'most'
  162. 'other'
  163. 'some'
  164. 'such'
  165. 'no'
  166. 'nor'
  167. 'not'
  168. 'only'
  169. 'own'
  170. 'same'
  171. 'so'
  172. 'than'
  173. 'too'
  174. 'very'
'<b>She</b> woke 6 A.M. It\'s early! She 10% awake began drinking coffee front computer.'
In [33]:
# Add "coffee" and "bean" to the list: new_stops
new_stops <- c("coffee", "bean", stopwords('en'))

# Remove stop words from text
removeWords(text, new_stops)
'<b>She</b> woke 6 A.M. It\'s early! She 10% awake began drinking front computer.'

Intro to word stemming and stem completion

Still another useful preprocessing step involves word stemming and stem completion. The tm package provides the stemDocument() function to get to a word's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument.

For example,

stemDocument(c("computational", "computers", "computation"))

returns "comput" "comput" "comput". But because "comput" isn't a real word, we want to re-complete the words so that "computational", "computers", and "computation" all refer to the same word, say "computer", in our ongoing analysis.

We can easily do this with the stemCompletion() function, which takes in a character vector and an argument for the completion dictionary. The completion dictionary can be a character vector or a Corpus object. Either way, the completion dictionary for our example would need to contain the word "computer" for all the words to refer to it.

In [34]:
#install.packages('SnowballC') in R to complement package tm
library(SnowballC)
In [35]:
# Create complicate
complicate <- c("complicated", "complication", "complicatedly")
In [36]:
# Perform word stemming: stem_doc
stem_doc <- stemDocument(complicate)
stem_doc
  1. 'complic'
  2. 'complic'
  3. 'complic'
In [37]:
# Create the completion dictionary: comp_dict
comp_dict <- "complicate"

# Perform stem completion: complete_text 
complete_text <- stemCompletion(stem_doc, comp_dict)
complete_text
complic
'complicate'
complic
'complicate'
complic
'complicate'
In [38]:
# Print complete_text
complete_text
     complic      complic      complic 
"complicate" "complicate" "complicate"
Error in parse(text = x, srcfile = src): <text>:3:19: unexpected symbol
2: complete_text
3:      complic      complic
                     ^
Traceback:

Word stemming and stem completion on a sentence

Let's consider the following sentence as our document for this exercise:

"In a complicated haste, Tom rushed to fix a new complication, too complicatedly."

This sentence contains the same three forms of the word "complicate" that we saw in the previous exercise. The difference here is that even if you called stemDocument() on this sentence, it would return the sentence without stemming any words. Take a moment and try it out in the console. Be sure to include the punctuation marks.

This happens because stemDocument() treats the whole sentence as one word. In other words, our document is a character vector of length 1, instead of length n, where n is the number of words in the document. To solve this problem, we first remove the punctation marks with the removePunctuation() function you learned a few exercises back. We then strsplit() this character vector of length 1 to length n, unlist(), then proceed to stem and re-complete.

Don't worry if that was confusing. Let's go through the process step by step!

In [39]:
# load some text
text_data <- "In a complicated haste, Tom rushed to fix a new complication, too complicatedly."
In [40]:
# Remove punctuation: rm_punc
rm_punc <- removePunctuation(text_data)
rm_punc
'In a complicated haste Tom rushed to fix a new complication too complicatedly'
In [41]:
# Create character vector: n_char_vec
n_char_vec <- unlist(strsplit(rm_punc, split = ' '))
n_char_vec
  1. 'In'
  2. 'a'
  3. 'complicated'
  4. 'haste'
  5. 'Tom'
  6. 'rushed'
  7. 'to'
  8. 'fix'
  9. 'a'
  10. 'new'
  11. 'complication'
  12. 'too'
  13. 'complicatedly'
In [42]:
# Perform word stemming: stem_doc
stem_doc <- stemDocument(n_char_vec)

# Print stem_doc
stem_doc
  1. 'In'
  2. 'a'
  3. 'complic'
  4. 'hast'
  5. 'Tom'
  6. 'rush'
  7. 'to'
  8. 'fix'
  9. 'a'
  10. 'new'
  11. 'complic'
  12. 'too'
  13. 'complic'
In [43]:
# Re-complete stemmed document: complete_doc
complete_doc <- stemCompletion(stem_doc, comp_dict)

# Print complete_doc
complete_doc
In
''
a
''
complic
'complicate'
hast
''
Tom
''
rush
''
to
''
fix
''
a
''
new
''
complic
'complicate'
too
''
complic
'complicate'

Apply preprocessing steps to a corpus

The tm package provides a special function tm_map() to apply cleaning functions to a corpus. Mapping these functions to an entire corpus makes scaling the cleaning steps very easy.

To save time (and lines of code) it's a good idea to use a custom function like the one displayed in the editor, since you may be applying the same functions over multiple corpora. You can probably guess what the clean_corpus() function does. It takes one argument, corpus, and applies a series of cleaning functions to it in order, then returns the final result.

Notice how the tm package functions do not need content_transformer(), but base R and qdap functions do.

Be sure to test your function's results. If you want to draw out currency amounts, then removeNumbers() shouldn't be used! Plus, the order of cleaning steps makes a difference. For example, if you removeNumbers() and then replace_number(), the second function won't find anything to change! Check, check, and re-check!

In [44]:
# just a little conversion
tweet_corp <- coffee_corpus
tweet_corp
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1000
In [45]:
# Alter the function code to match the instructions
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace) # tm
  corpus <- tm_map(corpus, removePunctuation) # tm
  corpus <- tm_map(corpus, tolower) # R
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee", "mug")) # tm
  return(corpus)
}

# Apply your customized function to the tweet_corp: clean_corp
clean_corp <- clean_corpus(tweet_corp)

# Print out a cleaned up tweet
clean_corp[[227]][1]
'also dogs arent smart enough dip donut eat part thats dipped ladyandthetramp'
In [46]:
# Print out the same tweet in original form
tweet_corp[[227]][1]
$content = 'Also, dogs aren\'t smart enough to dip the donut in the coffee and then eat the part that's been dipped. #ladyandthetramp'
In [47]:
tweet_corp[[1]]
clean_corp[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 48
'ayyytylerb true drink lots '

Make a document-term matrix

Beginning with the coffee.csv file, we have used common transformations to produce a clean corpus called clean_corp.

The document-term matrix is used when you want to have each document represented as a row. This can be useful if you are comparing authors within rows, or the data is arranged chronologically and you want to preserve the time series.

In [48]:
# clean coffee_tweets and create clean_tweets
clean_tweets <- stripWhitespace(coffee_tweets)
# carry on the cleaning
clean_tweets <- removePunctuation(clean_tweets)
clean_tweets <- tolower(clean_tweets)
clean_tweets <- removeWords(clean_tweets, c(stopwords("en"), "coffee"))

str(clean_tweets)
 chr [1:1000] "ayyytylerb    true drink lots  " ...
In [49]:
# create a corpus
clean_corp <- VCorpus(VectorSource(clean_tweets))
In [50]:
coffee_dtm <- DocumentTermMatrix(clean_corp)

# Print out coffee_dtm data
coffee_dtm
<<DocumentTermMatrix (documents: 1000, terms: 3076)>>
Non-/sparse entries: 7391/3068609
Sparsity           : 100%
Maximal term length: 27
Weighting          : term frequency (tf)
In [51]:
# Convert coffee_dtm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_dtm)

# Print the dimensions of coffee_m
dim(coffee_m)

# Review a portion of the matrix
coffee_m[148:150, 2587:2590]
  1. 1000
  2. 3076
stalkedstampedebluestandstar
1480000
1490000
1500000

Make a term-document matrix

In this case, the term-document matrix has terms in the first column and documents across the top as individual column names.

The TDM is often the matrix used for language analysis. This is because you likely have more terms than authors or documents and life is generally easier when you have more rows than columns. An easy way to start analyzing the information is to change the matrix into a simple matrix using as.matrix() on the TDM.

In [52]:
# Create a TDM from clean_corp: coffee_tdm
coffee_tdm <- TermDocumentMatrix(clean_corp)

# Print coffee_tdm data
coffee_tdm
<<TermDocumentMatrix (terms: 3076, documents: 1000)>>
Non-/sparse entries: 7391/3068609
Sparsity           : 100%
Maximal term length: 27
Weighting          : term frequency (tf)
In [53]:
# Convert coffee_tdm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)

# Print the dimensions of the matrix
dim(coffee_m)

# Review a portion of the matrix
coffee_m[2587:2590, 148:150]
  1. 3076
  2. 1000
148149150
stalked000
stampedeblue000
stand000
star000


Word clouds and more interesting visuals

Frequent terms with tm**

Calling rowSums() on your newly made matrix aggregates all the terms used in a passage. Once you have the rowSums(), you can sort() them with decreasing = TRUE, so you can focus on the most common terms.

In [54]:
# Create a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)

# Calculate the rowSums: term_frequency (rows = terms)
term_frequency <- rowSums(coffee_m)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)

# View the top 10 most common words
head(term_frequency, 10)
like
111
cup
103
shop
69
just
66
get
62
morning
57
want
49
drinking
47
can
45
looks
45
In [55]:
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:10], col = "tan", las = 2)

Frequent terms with qdap

A fast way to get frequent terms is with freq_terms() from qdap.

The function accepts a text variable, which in our case is the tweets$text vector. You can specify the top number of terms to show with the top argument, a vector of stop words to remove with the stopwords argument, and the minimum character length of a word to be included with the at.least argument. qdap has its own list of stop words that differ from those in tm. Our exercise will show you how to use either and compare their results.

Making a basic plot of the results is easy. Just call plot() on the freq_terms() object.

In [56]:
# from the original variable 'tweets', not the corpus
# Create frequency
frequency <- freq_terms(tweets$text, top = 10, at.least = 3, stopwords = "Top200Words")

# Make a frequency barchart
plot(frequency)
In [57]:
# Create frequency2
frequency2 <- freq_terms(tweets$text, top = 10, at.least = 3, tm::stopwords("english"))

# Make a frequency2 barchart
plot(frequency2)

A simple word cloud

Let's try our hand on another batch of 1000 tweets. Let's see if you can figure it out using a word cloud.

A word cloud is a visualization of terms. In a word cloud, size is often scaled to frequency and in some cases the colors may indicate another measurement. For now, we're keeping it simple: size is related to individual word frequency and we are just selecting a single color.

The wordcloud() function works like this:

wordcloud(words, frequencies, max.words = 500, colors = "blue")

Text mining analyses often include simple word clouds. In fact, they are probably over used, but can still be useful for quickly understanding a body of text!

In [62]:
# install the wordcloud package
# load it
library(wordcloud)
In [59]:
# Print the first 10 entries in term_frequency
head(term_frequency, 10)
like
111
cup
103
shop
69
just
66
get
62
morning
57
want
49
drinking
47
can
45
looks
45
In [64]:
# Create word_freqs
word_freqs <- data.frame(term = names(term_frequency), num = term_frequency)
head(word_freqs)
termnum
likelike 111
cupcup 103
shopshop 69
justjust 66
getget 62
morningmorning 57
In [65]:
# Create a wordcloud for the values in word_freqs
wordcloud(word_freqs$term, word_freqs$num, max.words = 100, colors = "red")

Recap (example)

Code.

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "amp"))
  return(corpus)
}

# if you added new stop words to clean_corpus()
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, 
                   c(stopwords("en"), "amp", "chardonnay", "wine", "glass"))
  return(corpus)
}

# Create clean_chardonnay
clean_chardonnay <- clean_corpus(chardonnay_corp)

# Create chardonnay_tdm
chardonnay_tdm <- TermDocumentMatrix(clean_chardonnay)

# Create chardonnay_m
chardonnay_m <- as.matrix(chardonnay_tdm)

# Create chardonnay_words
chardonnay_words <- rowSums(chardonnay_m)

Wordcloud code.

# Sort the chardonnay_words in descending order
chardonnay_words <- sort(chardonnay_words, decreasing = TRUE)

# Print the 6 most frequent chardonnay terms
head(chardonnay_words, 6)

# Create chardonnay_freqs
chardonnay_freqs <- data.frame(term = names(chardonnay_words), num = chardonnay_words)

# Create a wordcloud for the values in word_freqs
wordcloud(chardonnay_freqs$$term, chardonnay_freqs$num, max.words = 50, colors = "red")

Improve word cloud colors

So far, only a single hexidecimal color makes the word clouds. Instead of the #AD1DA5 in the code below, specify a vector of colors to make certain words stand out or to fit an existing color scheme.

wordcloud(chardonnay_freqs$term, 
          chardonnay_freqs$num, 
          max.words = 100, 
          colors = "#AD1DA5")

To change the colors argument of the wordcloud() function, you can use a vector of named colors like c("chartreuse", "cornflowerblue", "darkorange"). The function colors() will list all 657 named colors. You can also use this PDF as a reference.

In [72]:
# Print the list of colors
head(colors())
length(colors())
  1. 'white'
  2. 'aliceblue'
  3. 'antiquewhite'
  4. 'antiquewhite1'
  5. 'antiquewhite2'
  6. 'antiquewhite3'
657
In [73]:
# Print the wordcloud with the specified colors
wordcloud(word_freqs$term, 
          word_freqs$num, 
          max.words = 100, 
          colors = c("grey80", "darkgoldenrod1", "tomato"))

Use prebuilt color palettes

Use the RColorBrewer package to help. RColorBrewer color schemes are organized into three categories:

  • Sequential: Colors ascend from light to dark in sequence.
  • Qualitative: Colors are chosen for their pleasing qualities together.
  • Diverging: Colors have two distinct color spectra with lighter colors in between.

To change the colors parameter of the wordcloud() function you can use a select a palette from RColorBrewer such as "Greens". The function display.brewer.all() will list all predefined color palettes. More information on ColorBrewer (the framework behind RColorBrewer) is available on its website.

The function brewer.pal() allows you to select colors from a palette. Specify the number of distinct colors needed (e.g. 8) and the predefined palette to select from (e.g. "Greens"). Often in word clouds, very faint colors are washed out so it may make sense to remove the first couple from a brewer.pal() selection, leaving only the darkest.

Here's an example:

green_pal <- brewer.pal(8, "Greens")
green_pal <- green_pal[-(1:2)]

Then just add that object to the wordcloud() function.

wordcloud(chardonnay_freqs$term, chardonnay_freqs$num, max.words = 100, colors = green_pal)
In [75]:
# List the available colors
display.brewer.all()

Or this website

In [76]:
# Create purple_orange
purple_orange <- brewer.pal(10, "PuOr")

# Drop 2 faintest colors
purple_orange <- purple_orange[-(1:2)]

# Create a wordcloud with purple_orange palette
wordcloud(word_freqs$term, 
          word_freqs$num, 
          max.words = 100, 
          colors = purple_orange)

Note: for the remaining part of section 2, the blocks were computed outside IPython. Pieces (codes, results, images) were then imported into IPython. Reason: the data were not all available; it was impossible to load IPython with the data and fully reproduce the results.

Find common words

Say you want to visualize common words across multiple documents. You can do this with commonality.cloud().

Each of our coffee and chardonnay corpora is composed of many individual tweets. To treat the coffee tweets as a single document and likewise for chardonnay, you paste() together all the tweets in each corpus along with the parameter collapse = " ". This collapses all tweets (separated by a space) into a single vector. Then you can create a vector containing the two collapsed documents.

Code.

all_coffee <- paste(coffee$tweets, collapse = " ")
all_chardonnay <- paste(chardonnay$$tweets, collapse = " ")
all_tweets <- c(all_coffee, all_chardonnay)

Once you're done with these steps, you can take the same approach you've seen before to create a VCorpus() based on a VectorSource from the all_tweets object

# Create all_coffee
all_coffee <- paste(coffee_tweets$text, collapse = " ")

# Create all_chardonnay
all_chardonnay <- paste(chardonnay_tweets$text, collapse = " ")

# Create all_tweets
all_tweets <- c(all_coffee, all_chardonnay)

# Convert to a vector source
all_tweets <- VectorSource(all_tweets)

# Create all_corpus
all_corpus <- VCorpus(all_tweets)

Visualize common words

Now that you have a corpus filled with words used in both the chardonnay and coffee tweets files, you can clean the corpus, convert it into a TermDocumentMatrix, and then a matrix to prepare it for a commonality.cloud()

The commonality.cloud() function accepts this matrix object, plus additional arguments like max.words and colors to further customize the plot.

commonality.cloud(tdm_matrix, max.words = 100, colors = "springgreen")

Code.

# Clean the corpus
all_clean <- clean_corpus(all_corpus)

# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean)

# Create all_m
all_m <- as.matrix(all_tdm)

# Print a commonality cloud
commonality.cloud(all_m, colors = "steelblue1", max.words = 100)

Result.

Visualize dissimilar words

Say you want to visualize the words not in common. To do this, you can also use comparison.cloud() and the steps are quite similar with one main difference.

Like when you were searching for words in common, you start by unifying the tweets into distinct corpora and combining them into their own VCorpus() object. Next apply a clean_corpus() function and organize it into a TermDocumentMatrix.

To keep track of what words belong to coffee versus chardonnay, you can set the column names of the TDM like this:

colnames(all_tdm) <- c("chardonnay", "coffee")

Lastly, convert the object to a matrix using as.matrix() for use in comparison.cloud(). For every distinct corpora passed to the comparison.cloud() you can specify a color as in colors = c("red", "yellow", "green") to make the sections distinguishable.

Code.

# Clean the corpus
all_clean <- clean_corpus(all_corpus)

# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean)

# Give the columns distinct names
colnames(all_tdm) <- c("coffee", "chardonnay")

# Create all_m
all_m <- as.matrix(all_tdm)

# Create comparison cloud
comparison.cloud(all_m, colors = c("orange", "blue"), max.words = 50)

Result.

Polarized tag cloud (Pyramid Plot)

A commonality.cloud() may be misleading since words could be represented disproportionately in one corpus or the other, even if they are shared. In the commonality cloud, they would show up without telling you which one of the corpora has more term occurrences. To solve this problem, we can create a pyramid.plot() from the plotrix package.

Building on what you already know, we have created a simple matrix from the coffee and chardonnay tweets using all_tdm_m <- as.matrix(all_tdm). Recall that this matrix contains two columns: one for term frequency in the chardonnay corpus, and another for term frequency in the coffee corpus. So we can use the subset() function in the following way to get terms that appear one or more times in both corpora:

same_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)

Once you have the terms that are common to both corpora, you can create a new column in same_words that contains the absolute difference between how often each term is used in each corpus.

To identify the words that differ the most between documents, we must order() the rows of same_words by the absolute difference column with decreasing = TRUE like this:

same_words <- same_words[order(same_words[, 3], decreasing = TRUE), ]

Now that same_words is ordered by the absolute difference, let's create a small data.frame() of the 20 top terms so we can pass that along to pyramid.plot():

top_words <- data.frame(
  x = same_words[1:20, 1],
  y = same_words[1:20, 2],
  labels = rownames(same_words[1:20, ])
)

Note that top_words contains columns x and y for the frequency of the top words for each of the documents, and a third column, labels, that contains the words themselves.

Finally, create your pyramid.plot() and get a better feel for how the word usages differ by topic!

pyramid.plot(top_words$x, top_words$$y,
             labels = top_words$labels, gap = 8,
             top.labels = c("Chardonnay", "Words", "Coffee"),
             main = "Words in Common", laxlab = NULL, 
             raxlab = NULL, unit = NULL)

Code.

# Create common_words
# get terms that appear one or more times in both corpora
#                  Docs
# Terms            chardonnay coffee
#   aaliyahmaxwell          4      0
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)

# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])

# Combine common_words and difference
common_words <- cbind(common_words, difference)

#          chardonnay coffee difference
# actually          4      7          3
# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]

#      chardonnay coffee difference
# cup           3    103        100
# Create top25_df
top25_df <- data.frame(x = common_words[1:25, 1], 
                       y = common_words[1:25, 2], 
                       labels = rownames(common_words[1:25, ]))

# Create the pyramid plot
pyramid.plot(top25_df$x, top25_df$y,
                labels = top25_df$labels,
                main = "Words in Common",
                gap = 8, laxlab = NULL,
                raxlab = NULL, unit = NULL,
                top.labels = c("Chardonnay", "Words", "Coffee"))

Result.

Visualize word networks

Another way to view word connections is to treat them as a network, similar to a social network. Word networks show term association and cohesion. A word of caution: these visuals can become very dense and hard to interpret visually.

In a network graph, the circles are called nodes and represent individual terms, while the lines connecting the circles are called edges and represent the connections between the terms.

For the over-caffeinated text miner, qdap provides a shorcut for making word networks. The word_network_plot() and word_associate() functions both make word networks easy!

The sample code constructs a word network for words associated with "Marvin".

Code.

# Word association
word_associate(coffee_tweets$text, match.string = c("barista"), 
               stopwords = c(Top200Words, "coffee", "amp"), 
               network.plot = TRUE, cloud.colors = c("gray85", "darkred"))

# Add title
title(main = "Barista Coffee Tweets Associations")

Result.

Teaser: simple word clustering

Let's simply create a new visual called a dendrogram from our coffee_tweets. Hierarchical clustering for a dendrogram reduces information.

Code.

plot(hc)

Result.



Adding to your tm skills

Distance matrix and dendrogram

A simple way to do word cluster analysis is with a dendrogram on your term-document matrix. Once you have a TDM, you can call dist() to compute the differences between each row of the matrix.

Next, you call hclust() to perform cluster analysis on the dissimilarities of the distance matrix. Lastly, you can visualize the word frequency distances using a dendrogram and plot(). Often in text mining, you can tease out some interesting insights or word clusters based on a dendrogram.

Consider the table of annual rainfall . Cleveland and Portland have the same amount of rainfall, so their distance is 0. You might expect the two cities to be a cluster and for New Orleans to be on its own since it gets vastly more rain.

city rainfall
  Cleveland    39.14
   Portland    39.14
     Boston    43.77
New Orleans    62.45
In [242]:
# Import
# sheetIndex = 3 for
rain <- read.xlsx("Text.xls", sheetIndex = 3)

rain$city <- as.character(rain$city)

str(rain)
'data.frame':	4 obs. of  2 variables:
 $ city    : chr  "Cleveland" "Portland" "Boston" "New Orleans"
 $ rainfall: num  39.1 39.1 43.8 62.5
In [89]:
# Create dist_rain
dist_rain <- dist(rain[,2])

# View the distance matrix
dist_rain
      1     2     3
2  0.00            
3  4.63  4.63      
4 23.31 23.31 18.68
In [90]:
# Create hc
hc <- hclust(dist_rain)

# Plot hc
plot(hc, labels = rain$city)

Make a distance matrix and dendrogram from a TDM

First, limit the number of words in your TDM using removeSparseTerms() from tm. Why would you want to adjust the sparsity of the TDM/DTM?

TDMs and DTMs are sparse, meaning they contain mostly zeros. 1000 tweets can become a TDM with over 3000 terms!

A good TDM has between 25 and 70 terms. The lower the sparse value, the more terms are kept. The closer it is to 1, the fewer are kept. This value is a percentage cutoff of zeros for each term in the TDM.

In [135]:
# Print the dimensions of coffee_tdm
dim(coffee_tdm)
  1. 3076
  2. 1000
In [94]:
# Create tdm1
tdm1 <- removeSparseTerms(coffee_tdm, sparse = 0.95) # closer to 1, the fewer are kept

# Create tdm2
tdm2 <- removeSparseTerms(coffee_tdm, sparse = 0.975)

# Print tdm1
tdm1
<<TermDocumentMatrix (terms: 6, documents: 1000)>>
Non-/sparse entries: 418/5582
Sparsity           : 93%
Maximal term length: 7
Weighting          : term frequency (tf)
In [95]:
# Print tdm2
tdm2
<<TermDocumentMatrix (terms: 40, documents: 1000)>>
Non-/sparse entries: 1646/38354
Sparsity           : 96%
Maximal term length: 13
Weighting          : term frequency (tf)

Put it all together: a text based dendrogram

Dendrograms reduce information to help you make sense of the data. This is much like how an average tells you something, but not everything, about a population. Both can be misleading. With text, there are often a lot of nonsensical clusters, but some valuable clusters may also appear.

A peculiarity of TDM and DTM objects is that you have to convert them first to matrices (with as.matrix()), then to data frames (with as.data.frame()), before using them with the dist() function.

In [96]:
# Create coffee_tdm2
coffee_tdm2 <- removeSparseTerms(coffee_tdm, sparse = 0.975)

# Create coffee_m
coffee_m <- as.matrix(coffee_tdm2)
# terms decrease, sparsity decrease, max term length decreases

# Create coffee_df
coffee_df <- as.data.frame(coffee_m)

# Create coffee_dist
coffee_dist <- dist(coffee_df)

# Create hc
hc <- hclust(coffee_dist)

# Plot the dendrogram
plot(hc)

Dendrogram aesthetics

The dendextend package can help your audience by coloring branches and outlining clusters. dendextend is designed to operate on dendrogram objects, so you'll have to change the hierarchical cluster from hclust using as.dendrogram().

A good way to review the terms in your dendrogram is with the labels() function. It will print all terms of the dendrogram. To highlight specific branches, use branches_attr_by_labels(). First, pass in the dendrogram object, then a vector of terms as in c("data", "camp"). Lastly add a color such as "blue".

After you make your plot, you can call out clusters with rect.dendrogram(). This adds rectangles for each cluster. The first argument to rect.dendrogram() is the dendrogram, followed by the number of clusters (k). You can also pass a border argument specifying what color you want the rectangles to be (e.g. "green").

In [98]:
#install the dendextend package
# Load it
library(dendextend)
In [102]:
hc <- hclust(coffee_dist)

# Create hcd
hcd <- as.dendrogram(hc)

# Print the labels in hcd
labels(hcd)
  1. 'cup'
  2. 'like'
  3. 'shop'
  4. 'looks'
  5. 'show'
  6. 'hgtv'
  7. 'renovation'
  8. 'charlie'
  9. 'hosting'
  10. 'working'
  11. 'portland'
  12. 'movethesticks'
  13. 'whitehurst'
  14. 'just'
  15. 'get'
  16. 'good'
  17. 'morning'
  18. 'want'
  19. 'tea'
  20. 'drinking'
  21. 'can'
  22. 'starbucks'
  23. 'think'
  24. 'iced'
  25. 'half'
  26. 'chemicals'
  27. 'cancer'
  28. 'tested'
  29. '1000'
  30. 'single'
  31. 'need'
  32. 'ice'
  33. 'much'
  34. 'amp'
  35. 'now'
  36. 'right'
  37. 'love'
  38. 'make'
  39. 'dont'
  40. 'drink'
In [103]:
# Change the branch color to red
hcd <- branches_attr_by_labels(hcd, c("good", "morning"), col = 'red')

# Plot hcd
plot(hcd, main = "Better Dendrogram")

# Add cluster rectangles (k for cluster = 2 clusters with grey borders)
rect.dendrogram(hcd, k = 2, border = 'grey50')

Using word association

Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together, while a score of 0 means that they never appear together.

To use findAssocs() pass in a TDM or DTM, the search term, and a minimum correlation. The function will return a list of all other terms that meet or exceed the minimum threshold.

findAssocs(tdm, "word", 0.25)

Minimum correlation values are often relatively low because of word diversity. Don't be surprised if 0.10 demonstrates a strong pairwise term association.

For plotting, more ggplot2 themes are available on GitHub. Here is a list:

Geoms

  • geom_rangeframe : Tufte's range frame
  • geom_tufteboxplot: Tufte's box plot

Themes

  • theme_base: a theme resembling the default base graphics in R. See also theme_par.
  • theme_calc: a theme based on LibreOffice Calc.
  • theme_economist: a theme based on the plots in the The Economist magazine.
  • theme_excel: a theme replicating the classic ugly gray charts in Excel
  • theme_few: theme from Stephen Few's "Practical Rules for Using Color in Charts".
  • theme_fivethirtyeight: a theme based on the plots at fivethirtyeight.com.
  • theme_gdocs: a theme based on Google Docs.
  • theme_hc: a theme based on Highcharts JS.
  • theme_par: a theme that uses the current values of the base graphics parameters in par.
  • theme_pander: a theme to use with the pander package.
  • theme_solarized: a theme using the solarized color palette.
  • theme_stata: themes based on Stata graph schemes.
  • theme_tufte: a minimal ink theme based on Tufte's The Visual Display of Quantitative Information.
  • theme_wsj: a theme based on the plots in the The Wall Street Journal.

Scales

  • scale_colour_calc, scale_shape_calc: color and shape palettes from LibreOffice Calc.
  • scale_colour_colorblind: Colorblind safe palette from http://jfly.iam.u-tokyo.ac.jp/color/.
  • scale_colour_economist: colors used in plots in plots in The Economist.
  • scale_colour_excel: colors from new and old Excel.
  • scale_colour_few: color palettes from Stephen Few's "Practical Rules for Using Color in Charts".
  • scale_colour_gdocs: color palette from Google Docs.
  • scale_colour_hc: a theme based on Highcharts JS.
  • scale_colour_solarized: Solarized colors
  • scale_colour_stata, scale_shapes_stata, scale_linetype_stata: color, shape, and linetype palettes from Stata graph schemes.
  • scale_colour_tableau, scale_shape_tableau: color and shape palettes from Tableau.
  • scale_colour_pander, scale_fill_pander: scales to use with the pander package.
  • scale_colour_ptol, scale_fill_ptol: color palettes from Paul Tol's Colour Schemes
  • scale_shape_cleveland, scale_shape_tremmel, scale_shape_circlefill: shape scales from classic works in visual perception: Cleveland, Tremmel (1995), and Lewandowsky and Spence (1989).

Most of these scales also have associates palettes, as used in the scales package.

In [104]:
# Create associations (minimum correlations)
associations <- findAssocs(coffee_tdm, 'venti', 0.20)

# View the venti associations
associations
$venti =
breve
0.58
drizzle
0.58
entire
0.58
pumps
0.58
extra
0.47
cuz
0.41
forget
0.41
okay
0.41
hyper
0.33
mocha
0.33
vanilla
0.33
wtf
0.29
always
0.26
asleep
0.26
get
0.25
starbucks
0.25
white
0.23
In [111]:
# install, load ggplot2
library(ggplot2)

# install, load themes
library(ggthemes)
In [112]:
# Create associations_df (col 2 & 3 of the d.f)
associations_df <- list_vect2df(associations)[,2:3]

# Plot the associations_df values
# the gdocs theme is similar to Google Docs
ggplot(associations_df, 
        aes(y = associations_df[, 1])) + 
        geom_point(aes(x = associations_df[, 2]), 
             data = associations_df, size = 3) +
        theme_gdocs()

Changing n-grams

So far, we have only made TDMs and DTMs using single words. The default is to make them with unigrams, but you can also focus on tokens containing two or more words. This can help extract useful phrases which lead to some additional insights or provide improved predictive attributes for a machine learning algorithm.

The function below uses the RWeka package to create trigram (three word) tokens: min and max are both set to 3.

tokenizer <- function(x) 
  NGramTokenizer(x, Weka_control(min = 3, max = 3))

Then the customized tokenizer() function can be passed into the TermDocumentMatrix or DocumentTermMatrix functions as an additional parameter:

tdm <- TermDocumentMatrix(
  corpus, 
  control = list(tokenize = tokenizer)
)
In [119]:
# install and load the RWeka package
library(RWeka)
In [120]:
# Make tokenizer function 
tokenizer <- function(x) NGramTokenizer(
    x, Weka_control(min = 2, max = 2))
In [121]:
# Create unigram_dtm
unigram_dtm <- DocumentTermMatrix(clean_corp)

# Examine unigram_dtm
unigram_dtm
<<DocumentTermMatrix (documents: 1000, terms: 3076)>>
Non-/sparse entries: 7391/3068609
Sparsity           : 100%
Maximal term length: 27
Weighting          : term frequency (tf)
In [122]:
# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(clean_corp,
                                 control = list(tokenize = tokenizer))

# Examine bigram_dtm
bigram_dtm
<<DocumentTermMatrix (documents: 1000, terms: 5561)>>
Non-/sparse entries: 7226/5553774
Sparsity           : 100%
Maximal term length: 41
Weighting          : term frequency (tf)

How do bigrams affect word clouds?

The new tokenization method affects not only the matrices, but also any visuals or modeling based on the matrices.

Using bigram tokenization grabs all two word combinations. Observe what happens to the word cloud in this exercise.

In [124]:
# Create bigram_dtm_m
bigram_dtm_m <- as.matrix(bigram_dtm)

# Create freq (colSums: i a dtm, terms are in columns, you want the sum of the columns)
freq <- colSums(bigram_dtm_m)

# Create bi_words (2-grams)
bi_words <- names(freq)

# Examine part of bi_words (pairs of words)
bi_words[2577:2587]
  1. 'jive loves'
  2. 'joannnakatarina quit'
  3. 'job samshearer1'
  4. 'job sound'
  5. 'joe take'
  6. 'john 1026'
  7. 'johnbirmingham damiansharry'
  8. 'join silent'
  9. 'join us'
  10. 'joke takes'
  11. 'jolenejjfo mean'
In [140]:
# Plot a wordcloud
wordcloud(bi_words, freq, max.words = 10)

In another program, removing the maximum words gives this result (way more contrasting!):

Changing frequency weights

So far you have used term frequency to make the DocumentTermMatrix or TermDocumentMatrix. There are other term weights that can be helpful. The most popular weight is TfIdf, which stands for 'term frequency-inverse document frequency'.

The TfIdf score increases by term occurrence but is penalized by the frequency of appearance among all documents.

From a common sense perspective, if a term appears often it must be important. This attribute is represented by term frequency (i.e. Tf), which is normalized by the length of the document. However, if the term appears in all documents, it is not likely to be insightful. This is captured in the inverse document frequency (i.e. Idf).

The wiki page on TfIdf contains the mathematical explanation behind the score, but the exercise will demonstrate the practical difference.

In [141]:
# Create tf_tdm
tf_tdm <- TermDocumentMatrix(clean_corp)

# Create tfidf_tdm
tfidf_tdm <- TermDocumentMatrix(clean_corp, 
                                control = list(weighting = weightTfIdf))

# Create tf_tdm_m
tf_tdm_m <- as.matrix(tf_tdm)

# Examine part of tf_tdm_m
tf_tdm_m[508:518, 5:10]
Warning message in weighting(x):
"empty document(s): 92 413 627 894"
5678910
coachbayergc000000
coast000000
cocktail010000
cocoa000000
cocobear20000000
coconut000000
codagogy000000
codealan000000
coffeeaddict000000
coffeeboy25000000
coffeebreakfast000000
In [130]:
# Create tfidf_tdm_m 
tfidf_tdm_m <- as.matrix(tfidf_tdm)

# Examine part of tfidf_tdm_m
tfidf_tdm_m[508:518, 5:10]
5678910
coachbayergc0 0.0000000 0 0 0
coast0 0.0000000 0 0 0
cocktail0 1.9931570 0 0 0
cocoa0 0.0000000 0 0 0
cocobear200 0.0000000 0 0 0
coconut0 0.0000000 0 0 0
codagogy0 0.0000000 0 0 0
codealan0 0.0000000 0 0 0
coffeeaddict0 0.0000000 0 0 0
coffeeboy250 0.0000000 0 0 0
coffeebreakfast0 0.0000000 0 0 0

Capturing metadata in tm

Depending on what you are trying to accomplish, you may want to keep metadata about the document when you create a TDM or DTM. This metadata can be incorporated into the corpus fairly easily by creating a readerControl list and applying it to a DataframeSource when calling VCorpus().

The data frame contains the metadata to be captured. The names() function is helpful for this.

To capture the text column of the coffee tweets text along with a metadata column of unique numbers called num you would use the code below.

custom_reader <- readTabular(
  mapping = list(content = "text", id = "num")
)
text_corpus <- VCorpus(
  DataframeSource(tweets), 
  readerControl = list(reader = custom_reader)
)
In [219]:
str(tweets)
'data.frame':	1000 obs. of  15 variables:
 $ num         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ text        : chr  "@ayyytylerb that is so true drink lots of coffee" "RT @bryzy_brib: Senior March tmw morning at 7:25 A.M. in the SENIOR lot. Get up early, make yo coffee/breakfast, cus this will "| __truncated__ "If you believe in #gunsense tomorrow would be a very good day to have your coffee any place BUT @Starbucks Guns+Coffee=#nosense"| __truncated__ "My cute coffee mug. http://t.co/2udvMU6XIG" ...
 $ favorited   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ replyToSN   : chr  "ayyytylerb" "<NA>" "<NA>" "<NA>" ...
 $ created     : chr  "8/9/2013 02:43:00" "8/9/2013 02:43:00" "8/9/2013 02:43:00" "8/9/2013 02:43:00" ...
 $ truncated   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ replyToSID  : num  3.66e+17 NA NA NA NA ...
 $ id          : num  3.66e+17 3.66e+17 3.66e+17 3.66e+17 3.66e+17 ...
 $ replyToUID  : int  43 176 176 176 176 176 176 13 176 176 ...
 $ statusSource: chr  "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "web" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" ...
 $ screenName  : chr  "thejennagibson" "carolynicosia" "janeCkay" "AlexandriaOOTD" ...
 $ retweetCount: num  0 1 0 0 2 0 0 0 1 2 ...
 $ retweeted   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ longitude   : logi  NA NA NA NA NA NA ...
 $ latitude    : logi  NA NA NA NA NA NA ...
In [227]:
# Add author to custom reading list
custom_reader <- readTabular(mapping = list(content = "text", 
                                            id = "num",
                                            author = "screenName",
                                            date = "created"))
custom_reader
function (elem, language, id) 
{
    meta <- lapply(mapping[setdiff(names(mapping), "content")], 
        function(m) elem$content[, m])
    if (is.null(meta$id)) 
        meta$id <- as.character(id)
    if (is.null(meta$language)) 
        meta$language <- as.character(language)
    PlainTextDocument(elem$content[, mapping$content], meta = meta)
}
In [231]:
# Make corpus with custom reading
text_corpus <- VCorpus(DataframeSource(tweets), 
                       readerControl = list(reader = custom_reader))
In [234]:
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
  return(corpus)
}
In [235]:
# Clean corpus
text_corpus <- clean_corpus(text_corpus)

# Print data
text_corpus[[1]][1]
$content = 'ayyytylerb true drink lots coffee'
In [236]:
text_corpus[[1]]
<<PlainTextDocument>>
Metadata:  4
Content:  chars: 37
In [237]:
# Print metadata
text_corpus[[1]][2]
$meta
  id      : 1
  author  : thejennagibson
  date    : 8/9/2013 02:43:00
  language: en


Battle of the tech giants for talent (case)

Step 1: Problem definition

Does Amazon or Google have a better perceived pay according to online reviews? Does Amazon or Google have a better work-life balance according to current employees?

Step 2: Identifying the text sources

Employee reviews can come from various sources. If your human resources department had the resources, you could have a third party administer focus groups to interview employees both internally and from your competitor.

Forbes and others publish articles about the "best places to work", which may mention Amazon and Google. Another source of information might be anonymous online reviews from websites like Indeed, Glassdoor or CareerBliss.

Here, we'll focus on a collection of anonymous online reviews.

In [247]:
# Import data
# sheetIndex = 5 and 6 for amzn and goog
amzn <- read.xlsx("Text.xls", sheetIndex = 5)
goog <- read.xlsx("Text.xls", sheetIndex = 6)
In [245]:
# Print the structure of amzn
amzn$url <- as.character(amzn$url)
amzn$pros <- as.character(amzn$pros)
amzn$cons <- as.character(amzn$cons)
str(amzn)
'data.frame':	500 obs. of  4 variables:
 $ pg_num: num  1 2 3 4 5 6 7 8 9 10 ...
 $ url   : chr  "50 https://www.glassdoor.com/Reviews/Amazon-com-Reviews-E6036_P50.htm" "50 https://www.glassdoor.com/Reviews/Amazon-com-Reviews-E6036_P50.htm" "50 https://www.glassdoor.com/Reviews/Amazon-com-Reviews-E6036_P50.htm" "50 https://www.glassdoor.com/Reviews/Amazon-com-Reviews-E6036_P50.htm" ...
 $ pros  : chr  "You're surrounded by smart people and the projects are interesting, if a little daunting." "Brand name is great. Have yet to meet somebody who is unfamiliar with Amazon. Hours weren't as bad as I had previously heard. B"| __truncated__ "Good money.Interaction with some great minds in the world during internal conferences and sessions.Of course the pride of being"| __truncated__ "nice pay and overtime and different shifts" ...
 $ cons  : chr  "Internal tools proliferation has created a mess for trying to get to basic information. Most people are required to learn/under"| __truncated__ "not the most stimulating work. Good brand name to work for but the work itself is mundane as it can get. As a financial analyst"| __truncated__ "No proper growth plan for employees.Difficult promotion process requiring a lot more documentation than your actual deliverable"| __truncated__ "didn't last quite long enough" ...
In [250]:
# Create amzn_pros
amzn_pros <- amzn$pros

# Create amzn_cons
amzn_cons <- amzn$cons
In [255]:
# Print the structure of goog
goog$url <- as.character(goog$url)
goog$pros <- as.character(goog$pros)
goog$cons <- as.character(goog$cons)
str(goog)
'data.frame':	501 obs. of  4 variables:
 $ pg_num: num  1 2 3 4 5 6 7 8 9 10 ...
 $ url   : chr  "1  https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P1.htm" "1  https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P1.htm" "1  https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P1.htm" "1  https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P1.htm" ...
 $ pros  : chr  "* If you're a software engineer, you're among the kings of the hill at Google. It's an engineer-driven company without a doubt "| __truncated__ "1) Food, food, food. 15+ cafes on main campus (MTV) alone. Mini-kitchens, snacks, drinks, free breakfast/lunch/dinner, all day,"| __truncated__ "You can't find a more well-regarded company that actually deserves the hype it gets." "- You drive yourself here. If you want to grow, you have to seek out opportunities and prove that your worth. This keeps you mo"| __truncated__ ...
 $ cons  : chr  "* It *is* becoming larger, and with it comes growing pains: bureaucracy, slow to respond to market threats, bloated teams, cros"| __truncated__ "1) Work/life balance. What balance? All those perks and benefits are an illusion. They keep you at work and they help you to be"| __truncated__ "I live in SF so the commute can take between 1.5 hours to 1.75 hours each way on the shuttle - sometimes 2 hours each way on a "| __truncated__ "- Google is a big company. So there are going to be winners and losers when it comes to career growth. Due to the high hiring b"| __truncated__ ...
In [265]:
# Create goog_pros
goog_pros <- goog$pros

# Create goog_cons
goog_cons <- goog$cons
'character'

Step 3: Text organization

Now that you have selected the exact text sources, you are ready to clean them up.

In [267]:
clean_stuff <- function(x){
    x <- replace_abbreviation(x)
    x <- replace_contraction(x)
    x <- replace_number(x)
    x <- replace_ordinal(x)
    x <- replace_ordinal(x)
    x <- replace_symbol(x)
    x <- tolower(x)
    return(x)
}

Working with Amazon reviews

In [268]:
# Alter amzn_pros
# replace abbreviations, contractions, numbers, ordinals, symbols...
amzn_pros <- clean_stuff(amzn_pros)

# Alter amzn_cons
amzn_cons <- clean_stuff(amzn_cons)
In [310]:
# Create az_p_corp 
az_p_corp <- VCorpus(VectorSource(amzn_pros))

# Create az_c_corp
az_c_corp <- VCorpus(VectorSource(amzn_cons))
In [273]:
tm_clean <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, 
                   c(stopwords("en"), "Google", "Amazon", "company"))
  return(corpus)
}
In [274]:
# Create amzn_pros_corp
# remove punctuation, strip white spaces, remove words and stopwords
amzn_pros_corp <- tm_clean(az_p_corp)

# Create amzn_cons_corp
amzn_cons_corp <- tm_clean(az_c_corp)

Working with Google reviews

Now that the Amazon reviews have been cleaned, the same must be done for the Google reviews.

In [275]:
# Alter amzn_pros
# replace abbreviations, contractions, numbers, ordinals, symbols...
goog_pros <- clean_stuff(goog_pros)

# Alter amzn_cons
goog_cons <- clean_stuff(goog_cons)
In [276]:
# Create az_p_corp 
goog_p_corp <- VCorpus(VectorSource(goog_pros))

# Create az_c_corp
goog_c_corp <- VCorpus(VectorSource(goog_cons))
In [277]:
# Create amzn_pros_corp
# remove punctuation, strip white spaces, remove words and stopwords
goog_pros_corp <- tm_clean(goog_p_corp)

# Create amzn_cons_corp
goog_cons_corp <- tm_clean(goog_c_corp)

Steps 4 & 5 : Feature extraction & analysis

amzn_pros_corp, amzn_cons_corp, goog_pros_corp and goog_cons_corp have all been preprocessed, so now you can extract the features you want to examine. Since you are using the bag of words approach, you decide to create a bigram TermDocumentMatrix for Amazon's positive reviews corpus, amzn_pros_corp. From this, you can quickly create a wordcloud() to understand what phrases people positively associate with working at Amazon.

The function below uses RWeka to tokenize two terms.

In [282]:
tokenizer <- function(x) 
  NGramTokenizer(x, Weka_control(min = 2, max = 2))

Feature extraction & analysis: amzn_cons

In [283]:
# Create amzn_p_tdm
# bigrams
amzn_p_tdm <- TermDocumentMatrix(amzn_pros_corp, control = list(tokenize = tokenizer))

# Create amzn_p_tdm_m
amzn_p_tdm_m <- as.matrix(amzn_p_tdm)

# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_tdm_m)

# Plot a wordcloud using amzn_p_freq values
wordcloud(names(amzn_p_freq), amzn_p_freq, max.words = 25, col = "blue")

Feature extraction & analysis: amzn_cons

In [284]:
# Create amzn_c_tdm
# bigrams
amzn_c_tdm <- TermDocumentMatrix(amzn_cons_corp, control = list(tokenize = tokenizer))

# Create amzn_c_tdm_m
amzn_c_tdm_m <- as.matrix(amzn_c_tdm)

# Create amzn_c_freq
amzn_c_freq <- rowSums(amzn_c_tdm_m)

# Plot a wordcloud of negative Amazon bigrams
wordcloud(names(amzn_c_freq), amzn_c_freq, max.words = 25, col = "red")

Feature extraction & analysis: goog_pros

In [285]:
# Create goog_p_tdm
# bigrams
goog_p_tdm <- TermDocumentMatrix(goog_pros_corp, control = list(tokenize = tokenizer))

# Create amzn_p_tdm_m
goog_p_tdm_m <- as.matrix(goog_p_tdm)

# Create amzn_p_freq
goog_p_freq <- rowSums(goog_p_tdm_m)

# Plot a wordcloud using amzn_p_freq values
wordcloud(names(goog_p_freq), goog_p_freq, max.words = 25, col = "blue")

Feature extraction & analysis: goog_cons

In [287]:
# Create goog_c_tdm
# bigrams
goog_c_tdm <- TermDocumentMatrix(goog_cons_corp, control = list(tokenize = tokenizer))

# Create amzn_c_tdm_m
goog_c_tdm_m <- as.matrix(goog_c_tdm)

# Create amzn_c_freq
goog_c_freq <- rowSums(goog_c_tdm_m)

# Plot a wordcloud of negative Amazon bigrams
wordcloud(names(goog_c_freq), goog_c_freq, max.words = 25, col = "red")

amzn_cons dendrogram

It seems there is a strong indication of long working hours and poor work-life balance in the reviews. As a simple clustering technique, you decide to perform a hierarchical cluster and create a dendrogram to see how connected these phrases are.

In [291]:
# Create amzn_c_tdm
# bigrams
amzn_c_tdm <- TermDocumentMatrix(amzn_cons_corp, control = list(tokenize = tokenizer))

# Print amzn_c_tdm to the console
amzn_c_tdm

# Create amzn_c_tdm2 by removing sparse terms 
amzn_c_tdm2 <- removeSparseTerms(amzn_c_tdm, sparse = 0.993)

# Create hc as a cluster of distance values
hc <- hclust(dist(amzn_c_tdm2, method = "euclidean"), method = "complete")

# Produce a plot of hc
plot(hc)
<<TermDocumentMatrix (terms: 4778, documents: 500)>>
Non-/sparse entries: 5220/2383780
Sparsity           : 100%
Maximal term length: 31
Weighting          : term frequency (tf)

amzn_pros dendrogram

In [292]:
# Create amzn_c_tdm
# bigrams
amzn_p_tdm <- TermDocumentMatrix(amzn_pros_corp, control = list(tokenize = tokenizer))

# Print amzn_c_tdm to the console
amzn_p_tdm

# Create amzn_c_tdm2 by removing sparse terms 
amzn_p_tdm2 <- removeSparseTerms(amzn_p_tdm, sparse = 0.993)

# Create hc as a cluster of distance values
hc <- hclust(dist(amzn_p_tdm2, method = "euclidean"), method = "complete")

# Produce a plot of hc
plot(hc)
<<TermDocumentMatrix (terms: 4091, documents: 500)>>
Non-/sparse entries: 4824/2040676
Sparsity           : 100%
Maximal term length: 32
Weighting          : term frequency (tf)

goog_cons dendrogram

In [293]:
# Create amzn_c_tdm
# bigrams
goog_c_tdm <- TermDocumentMatrix(goog_cons_corp, control = list(tokenize = tokenizer))

# Print amzn_c_tdm to the console
goog_c_tdm

# Create amzn_c_tdm2 by removing sparse terms 
goog_c_tdm2 <- removeSparseTerms(goog_c_tdm, sparse = 0.993)

# Create hc as a cluster of distance values
hc <- hclust(dist(goog_c_tdm2, method = "euclidean"), method = "complete")

# Produce a plot of hc
plot(hc)
<<TermDocumentMatrix (terms: 3952, documents: 501)>>
Non-/sparse entries: 4509/1975443
Sparsity           : 100%
Maximal term length: 29
Weighting          : term frequency (tf)

goog_pros dendrogram

In [294]:
# Create amzn_c_tdm
# bigrams
goog_p_tdm <- TermDocumentMatrix(goog_pros_corp, control = list(tokenize = tokenizer))

# Print amzn_c_tdm to the console
goog_p_tdm

# Create amzn_c_tdm2 by removing sparse terms 
goog_p_tdm2 <- removeSparseTerms(goog_p_tdm, sparse = 0.993)

# Create hc as a cluster of distance values
hc <- hclust(dist(goog_p_tdm2, method = "euclidean"), method = "complete")

# Produce a plot of hc
plot(hc)
<<TermDocumentMatrix (terms: 3245, documents: 501)>>
Non-/sparse entries: 4159/1621586
Sparsity           : 100%
Maximal term length: 33
Weighting          : term frequency (tf)

Word association

As expected, you see similar topics throughout the dendrogram. Switching back to positive comments, you decide to examine top phrases that appeared in the word clouds. You hope to find associated terms using the findAssocs() function from tm. You want to check for something surprising now that you have learned of long hours and a lack of work-life balance.

amzn_c_tdm

In [299]:
# Create amzn_p_tdm
# bigrams
amzn_c_tdm <- TermDocumentMatrix(amzn_cons_corp, 
    control = list(tokenize = tokenizer))

# Create amzn_p_m
amzn_c_m <- as.matrix(amzn_c_tdm)

# Create amzn_p_freq
amzn_c_freq <- rowSums(amzn_c_m)

# Create term_frequency
term_frequency <- sort(amzn_c_freq, decreasing = TRUE)

# Print the 5 most common terms
term_frequency[1:5]

# Find associations with fast paced
findAssocs(amzn_p_tdm, "fast paced", 0.20)
long hours
29
work life
21
worklife balance
21
life balance
20
can get
9
$`fast paced` =
paced environment
0.49
environments ever
0.35
learn fast
0.35
paced friendly
0.35
paced work
0.35
able excel
0.25
activity ample
0.25
advance one
0.25
also well
0.25
amazon fast
0.25
amazon noting
0.25
amazon one
0.25
amount time
0.25
ample opportunity
0.25
assistance ninety
0.25
benefits including
0.25
break computer
0.25
call activity
0.25
can choose
0.25
catchy cheers
0.25
center things
0.25
challenging expect
0.25
cheers opportunity
0.25
choose success
0.25
combined encouragement
0.25
competitive environments
0.25
computer room
0.25
cool things
0.25
deliver results
0.25
dock makes
0.25
driven deliver
0.25
easy learn
0.25
emphasis shipping
0.25
encouragement innovation
0.25
environment benefits
0.25
environment catchy
0.25
environment center
0.25
environment fast
0.25
environment help
0.25
environment smart
0.25
ever known
0.25
ever witnessed
0.25
everchanging fast
0.25
everyones preferences
0.25
excel advance
0.25
excel everchanging
0.25
exciting environment
0.25
expect learn
0.25
extremely fast
0.25
facility top
0.25
fail successful
0.25
fantastic able
0.25
fired part
0.25
five percent
0.25
freindly place
0.25
friendly atmosphere
0.25
friendly management
0.25
full medical
0.25
get fired
0.25
go extremely
0.25
great plenty
0.25
great teamwork
0.25
happening technology
0.25
hassle benefits
0.25
help get
0.25
help workers
0.25
high quality
0.25
high volume
0.25
including full
0.25
innovation owning
0.25
job requirements
0.25
leader can
0.25
line break
0.25
lot responsibility
0.25
maintaining high
0.25
makes time
0.25
management nice
0.25
nice facility
0.25
ninety five
0.25
noting short
0.25
offers opportunity
0.25
one competitive
0.25
one fast
0.25
opportunity overtime
0.25
opportunity yell
0.25
ownership fast
0.25
owning work
0.25
paced emphasis
0.25
paced exciting
0.25
paced high
0.25
paced never
0.25
paced rewarding
0.25
paced ship
0.25
paced software
0.25
paid upfront
0.25
people focused
0.25
percent paid
0.25
plenty shifts
0.25
position fast
0.25
possible still
0.25
preferences fast
0.25
products quickly
0.25
quality bar
0.25
quickly possible
0.25
readily available
0.25
requirements easy
0.25
responsibility ownership
0.25
results great
0.25
results team
0.25
rewarding people
0.25
shifts everyones
0.25
ship dock
0.25
shipping products
0.25
short amount
0.25
short fantastic
0.25
smart coworkers
0.25
still maintaining
0.25
success fail
0.25
successful also
0.25
team driven
0.25
technology today
0.25
things happening
0.25
things lot
0.25
time fast
0.25
time go
0.25
top line
0.25
upfront experience
0.25
vision well
0.25
volume call
0.25
well rewarded
0.25
well tuition
0.25
witnessed combined
0.25
work can
0.25
work cool
0.25
work environments
0.25
work fast
0.25
work job
0.25
workers readily
0.25
yell leader
0.25

amzn_p_tdm

In [302]:
# Create amzn_p_tdm
# bigrams
amzn_p_tdm <- TermDocumentMatrix(amzn_pros_corp, 
    control = list(tokenize = tokenizer))

# Create amzn_p_m
amzn_p_m <- as.matrix(amzn_p_tdm)

# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_m)

# Create term_frequency
term_frequency <- sort(amzn_p_freq, decreasing = TRUE)

# Print the 5 most common terms
term_frequency[1:5]

# Find associations with fast paced
# let's skip it to be brief
#findAssocs(amzn_p_tdm, "fast paced", 0.20)
good pay
25
great benefits
24
smart people
20
place work
17
fast paced
16

goog_c_tdm

In [303]:
# Create amzn_p_tdm
# bigrams
goog_c_tdm <- TermDocumentMatrix(goog_cons_corp, 
    control = list(tokenize = tokenizer))

# Create amzn_p_m
goog_c_m <- as.matrix(goog_c_tdm)

# Create amzn_p_freq
goog_c_freq <- rowSums(goog_c_m)

# Create term_frequency
term_frequency <- sort(goog_c_freq, decreasing = TRUE)

# Print the 5 most common terms
term_frequency[1:5]

# Find associations with fast paced
# let's skip it to be brief
#findAssocs(amzn_p_tdm, "fast paced", 0.20)
wo hundred
40
hree hundred
28
hundred forty
28
two wo
28
hundred two
16

goog_p_tdm

In [304]:
# Create amzn_p_tdm
# bigrams
goog_p_tdm <- TermDocumentMatrix(goog_pros_corp, 
    control = list(tokenize = tokenizer))

# Create amzn_p_m
goog_p_m <- as.matrix(goog_p_tdm)

# Create amzn_p_freq
goog_p_freq <- rowSums(goog_p_m)

# Create term_frequency
term_frequency <- sort(goog_p_freq, decreasing = TRUE)

# Print the 5 most common terms
term_frequency[1:5]

# Find associations with fast paced
# let's skip it to be brief
#findAssocs(amzn_p_tdm, "fast paced", 0.20)
smart people
42
free food
41
place work
26
great benefits
22
great perks
20

Quick review of Google reviews

Create a comparison.cloud() of Google's positive and negative reviews for comparison to Amazon. This will give you a quick understanding of top terms without having to spend as much time as you did examining the Amazon reviews in the previous exercises.

all_amzn_corpus and all_goog_corpus both have the 500 positive and 500 negative reviews. Clean the corpus and create a comparison cloud comparing the common words in both pro and con reviews.

Load Amazon data

And inspect them.

In [345]:
all_amzn_corpus <- Corpus(DirSource("Amzn/", encoding = "UTF-8", mode = 'text'))
inspect(all_amzn_corpus)
inspect(all_amzn[1])
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 59576

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 53188

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 59576

In [344]:
str(all_amzn_corpus)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 59576

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 53188

List of 2
 $ Amzn_cons.txt:List of 2
  ..$ content: chr [1:500] "Internal tools proliferation has created a mess for trying to get to basic information. Most people are required to learn/under"| __truncated__ "not the most stimulating work. Good brand name to work for but the work itself is mundane as it can get. As a financial analyst"| __truncated__ "No proper growth plan for employees.Difficult promotion process requiring a lot more documentation than your actual deliverable"| __truncated__ "didn't last quite long enough" ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-11-16 14:10:16"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "Amzn_cons.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ Amzn_pros.txt:List of 2
  ..$ content: chr [1:500] "You're surrounded by smart people and the projects are interesting, if a little daunting." "Brand name is great. Have yet to meet somebody who is unfamiliar with Amazon. Hours weren't as bad as I had previously heard. B"| __truncated__ "Good money.Interaction with some great minds in the world during internal conferences and sessions.Of course the pride of being"| __truncated__ "nice pay and overtime and different shifts" ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-11-16 14:10:16"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "Amzn_pros.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

Load Amazon data

And inspect them.

In [346]:
all_goog_corpus <- Corpus(DirSource("Goog/", encoding = "UTF-8", mode = 'text'))
inspect(all_goog_corpus)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 52083

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 44933

Clean the data

In [336]:
# Back with the cleaning functions
clean_stuff <- function(x){
    x <- replace_abbreviation(x)
    x <- replace_contraction(x)
    x <- replace_number(x)
    x <- replace_ordinal(x)
    x <- replace_ordinal(x)
    x <- replace_symbol(x)
    x <- tolower(x)
    return(x)
}

tm_clean <- function(x){
  x <- tm_map(x, removePunctuation)
  x <- tm_map(x, stripWhitespace)
  x <- tm_map(x, removeWords, 
                   c(stopwords("en"), "Google", "Amazon", "company"))
  return(x)
}

Amazon

In [377]:
# Clean all_amzn_corp
all_amzn_corp <- tm_clean(all_amzn_corpus)

# Create all_tdm
all_amzn_tdm <- TermDocumentMatrix(all_amzn_corp)

str(all_amzn_tdm)
all_amzn_tdm[[6]][2]
List of 6
 $ i       : int [1:3464] 1 2 3 4 5 6 8 10 12 16 ...
 $ j       : int [1:3464] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:3464] 2 2 1 1 1 1 1 2 1 1 ...
 $ nrow    : int 2688
 $ ncol    : int 2
 $ dimnames:List of 2
  ..$ Terms: chr [1:2688] "100" "1012" "10h" "10hour" ...
  ..$ Docs : chr [1:2] "Amzn_cons.txt" "Amzn_pros.txt"
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
$Docs =
  1. 'Amzn_cons.txt'
  2. 'Amzn_pros.txt'
In [378]:
# Name the tdm columns
colnames(all_amzn_tdm) <- c("Amzn_Cons", "Amzn_Pros")

str(all_amzn_tdm)
List of 6
 $ i       : int [1:3464] 1 2 3 4 5 6 8 10 12 16 ...
 $ j       : int [1:3464] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:3464] 2 2 1 1 1 1 1 2 1 1 ...
 $ nrow    : int 2688
 $ ncol    : int 2
 $ dimnames:List of 2
  ..$ Terms: chr [1:2688] "100" "1012" "10h" "10hour" ...
  ..$ Docs : chr [1:2] "Amzn_Cons" "Amzn_Pros"
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
In [379]:
# Create all_m
all_amzn_m <- as.matrix(all_amzn_tdm)

head(all_amzn_m)
Amzn_ConsAmzn_Pros
10021
101220
10h10
10hour10
125010
12hrsday10
In [380]:
# Build a comparison cloud
comparison.cloud(all_amzn_m, colors = c("#2196f3", "#F44336"), max.words =100)

Google

In [381]:
# Clean all_goog_corp
all_goog_corp <- tm_clean(all_goog_corpus)

# Create all_tdm
all_goog_tdm <- TermDocumentMatrix(all_goog_corp)

str(all_goog_tdm)
all_goog_tdm[[6]][2]
List of 6
 $ i       : int [1:2896] 2 3 5 6 7 8 11 12 13 15 ...
 $ j       : int [1:2896] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:2896] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2309
 $ ncol    : int 2
 $ dimnames:List of 2
  ..$ Terms: chr [1:2309] "100" "1000" "100k" "106" ...
  ..$ Docs : chr [1:2] "Goog_cons.txt" "Goog_pros.txt"
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
$Docs =
  1. 'Goog_cons.txt'
  2. 'Goog_pros.txt'
In [383]:
# Name the tdm columns
colnames(all_goog_tdm) <- c("Goog_Cons", "Goog_Pros")

str(all_goog_tdm)
List of 6
 $ i       : int [1:2896] 2 3 5 6 7 8 11 12 13 15 ...
 $ j       : int [1:2896] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:2896] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2309
 $ ncol    : int 2
 $ dimnames:List of 2
  ..$ Terms: chr [1:2309] "100" "1000" "100k" "106" ...
  ..$ Docs : chr [1:2] "Goog_Cons" "Goog_Pros"
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
In [384]:
# Create all_m
all_goog_m <- as.matrix(all_goog_tdm)

head(all_goog_m)
Goog_ConsGoog_Pros
10001
100010
100k10
10601
17510
20010
In [385]:
# Build a comparison cloud
comparison.cloud(all_goog_m, colors = c("#2196f3", "#F44336"), max.words =100)

Cage match!

Amazon vs. Google PRO reviews // CON reviews

Positive Amazon reviews appear to mention "good benefits" while the negative reviews focus on "work load" and "work-life balance" issues.

In contrast, Google's positive reviews mention "great food", "perks", "smart people", and "fun culture", among other things. The Google negative reviews discuss "politics", "getting big", "bureaucracy" and "middle management".

Make a pyramid plot lining up positive reviews for Amazon and Google to see the differences between any shared bigrams.

In [488]:
all_pro_corpus <- Corpus(DirSource("Pros/", encoding = "UTF-8", mode = 'text'))
In [518]:
all_con_corpus <- Corpus(DirSource("Cons/", encoding = "UTF-8", mode = 'text'))
In [487]:
tm_clean
function (x) 
{
    x <- tm_map(x, removePunctuation)
    x <- tm_map(x, stripWhitespace)
    x <- tm_map(x, removeWords, c(stopwords("en"), "Google", 
        "Amazon", "company"))
    return(x)
}
In [519]:
all_pro_corp <- tm_clean(all_pro_corpus)
all_con_corp <- tm_clean(all_con_corpus)
In [490]:
tokenizer
function (x) 
NGramTokenizer(x, Weka_control(min = 2, max = 2))
In [498]:
all_pro_tdm_bigram <- TermDocumentMatrix(all_pro_corp, 
                                          control = list(tokenize = tokenizer))

colnames(all_pro_tdm_bigram) <- c("AmznPros", "GoogPros")

str(all_pro_tdm_bigram)
List of 6
 $ i       : int [1:7766] 2 8 10 11 12 13 15 17 19 21 ...
 $ j       : int [1:7766] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:7766] 1 1 2 1 1 1 1 1 1 1 ...
 $ nrow    : int 7359
 $ ncol    : int 2
 $ dimnames:List of 2
  ..$ Terms: chr [1:7359] "1 countless" "1 employees" "1 food" "1 good" ...
  ..$ Docs : chr [1:2] "AmznPros" "GoogPros"
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
In [520]:
all_con_tdm_bigram <- TermDocumentMatrix(all_con_corp, 
                                          control = list(tokenize = tokenizer))

colnames(all_con_tdm_bigram) <- c("AmznCons", "GoogCons")
In [499]:
all_pro_m_bigram <- as.matrix(all_pro_tdm_bigram)

str(all_pro_m_bigram)
 num [1:7359, 1:2] 0 1 0 0 0 0 0 1 0 2 ...
 - attr(*, "dimnames")=List of 2
  ..$ Terms: chr [1:7359] "1 countless" "1 employees" "1 food" "1 good" ...
  ..$ Docs : chr [1:2] "AmznPros" "GoogPros"
In [521]:
all_con_m_bigram <- as.matrix(all_con_tdm_bigram)
In [502]:
head(all_pro_m_bigram, 5)
AmznProsGoogPros
1 countless01
1 employees10
1 food01
1 good01
1 open01
In [525]:
# commonality cloud
commonality.cloud(all_pro_m_bigram, colors = "steelblue1", max.words = 100)
commonality.cloud(all_con_m_bigram, colors = "indianred1", max.words = 100)
Warning message in wordcloud(rownames(term.matrix)[freq > 0], freq[freq > 0], min.freq = 0, :
"worklife balance could not be fit on page. It will not be plotted."
In [526]:
# comparison cloud
comparison.cloud(all_pro_m_bigram, colors = c("orange", "blue"), max.words = 50)
comparison.cloud(all_con_m_bigram, colors = c("orange", "blue"), max.words = 50)
Warning message in comparison.cloud(all_con_m_bigram, colors = c("orange", "blue"), :
"nothing nothing could not be fit on page. It will not be plotted."Warning message in comparison.cloud(all_con_m_bigram, colors = c("orange", "blue"), :
"talented people could not be fit on page. It will not be plotted."
In [504]:
# pyramid plot
# install and load the plotrix package
library(plotrix)
In [527]:
# Create common_words
common_words_p <- subset(all_pro_m_bigram, 
                       all_pro_m_bigram[,1] > 0 & all_pro_m_bigram[,2] > 0)
common_words_c <- subset(all_con_m_bigram, 
                       all_con_m_bigram[,1] > 0 & all_con_m_bigram[,2] > 0)

# Create difference
difference_p <- abs(common_words_p[,1] - common_words_p[,2])
difference_c <- abs(common_words_c[,1] - common_words_c[,2])

# Add difference to common_words
common_words_p <- cbind(common_words_p, difference_p)
common_words_c <- cbind(common_words_c, difference_c)

# Order the data frame from most differences to least
common_words_p <- common_words_p[order(common_words_p[,3], decreasing = TRUE), ]
common_words_c <- common_words_c[order(common_words_c[,3], decreasing = TRUE), ]

# Create top15_df
top15_df_p <- data.frame(x = common_words_p[1:15,1], 
                       y = common_words_p[1:15,2], 
                       labels = rownames(common_words_p[1:15,]))
top15_df_c <- data.frame(x = common_words_c[1:15,1], 
                       y = common_words_c[1:15,2], 
                       labels = rownames(common_words_c[1:15,]))
In [529]:
# Create the pyramid plot
pyramid.plot(top15_df_p$x, 
             top15_df_p$y, 
             labels = top15_df_p$labels, 
             gap = 12, 
             top.labels = c("Amzn", "Pro Words", "Google"), 
             main = "Words in Common", 
             unit = NULL)
  1. 5.1
  2. 4.1
  3. 4.1
  4. 2.1
In [530]:
pyramid.plot(top15_df_c$x, 
             top15_df_c$y, 
             labels = top15_df_c$labels, 
             gap = 12, 
             top.labels = c("Amzn", "Con Words", "Google"), 
             main = "Words in Common", 
             unit = NULL)
  1. 5.1
  2. 4.1
  3. 4.1
  4. 2.1

Conclusions

Interestingly, some Amazon employees discussed "work-life balance" as a negative. In both organizations, people mentioned "culture" and "smart people", so there are some similar positive aspects between the two companies.

Neg...

Step 6: Reach a conclusion

Based on the visual, does Amazon or Google have a better work-life balance according to current employee reviews?

Google.

Draw another conclusion, insight, or recommendation.

Earlier you were surprised to see "fast paced" in the pros despite the other reviews mentioning "work-life balance". Recall that you used findAssocs() to get a named vector of phrases. These may lead you to a conclusion about the type of person who favorably views an intense workload.

Given the abbreviated results of the associated phrases, what would you recommend Amazon HR recruiters look for in candidates? (You can use the snippet below to gain insight on phrases associated with "fast paced".)

In [531]:
findAssocs(amzn_p_tdm, "fast paced", 0.2)[[1]][1:15]
paced environment
0.49
environments ever
0.35
learn fast
0.35
paced friendly
0.35
paced work
0.35
able excel
0.25
activity ample
0.25
advance one
0.25
also well
0.25
amazon fast
0.25
amazon noting
0.25
amazon one
0.25
amount time
0.25
ample opportunity
0.25
assistance ninety
0.25

Final word

Text mining is useful everywhere legal, marketing, HR, academia, etc.