Foreword
The text inside a document is called unstructured data; like photo, video or sound files. Tweets are text files. However, the metadata, such as the ‘tweet creation date & time’, are structured data.
‘Text’ bears others names. Text inside computer is called Unicode or just character strings. ‘Text’ is also called ‘natural language’. On the other hand, R is also a language, but a programming language.
We can manipulate ‘natural languages’. Searching, analyzing,and transforming strings is called Natural Language Processing or NLP. In R, we can perform NLP with the stringr package.
We are going to work with tweets. We could also work with another text corpus: e-mails, chats, SMS, website comments, logs or other records, TV/radio verbatims, court transcripts, etc.
After loading the necessary packages loging into the Twitter API with setup_twitter_oauth…
We load in the new package.
library(stringr)We pull tweets using the TweetFram function.
TweetFrame <- function(searchTerm, maxTweets, langTweets)
{
twtList <-
searchTwitter(searchTerm, n=maxTweets, lang=langTweets)
return(do.call("rbind",
lapply(twtList,as.data.frame)))
}#earth
tweetDF <- TweetFrame("#earth", 100, "en")
head(tweetDF$text, 3)## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! #rareearthgallerycc #amethyst #crystal… https://t.co/EaSNd7Bz14"
## [2] "Invigorating & Peaceful Sunday morning hiking Griffith Park!\n\n#DTLA #hiking #peace #MotherEarth #clouds #coyote… https://t.co/3nYeOE8j6g"
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on #sale for $1.99 via @AceRocBooks Grab it today!… https://t.co/NBlm5Ou0a4"
We attach the data frame and verify it is done.
attach(tweetDF)
search()## [1] ".GlobalEnv" "tweetDF" "package:stringr"
## [4] "package:httr" "package:bit64" "package:bit"
## [7] "package:rjson" "package:devtools" "package:ROAuth"
## [10] "package:twitteR" "package:RJSONIO" "package:RCurl"
## [13] "package:bitops" "package:stats" "package:graphics"
## [16] "package:grDevices" "package:utils" "package:datasets"
## [19] "package:methods" "Autoloads" "package:base"
We can see the tweetDF object in the .GlobalEnv.
Let’s examine the data frame.
# length of the d.f.
length(text)## [1] 100
# length of each tweet in the d.f.
str_length(text)## [1] 139 140 129 139 144 144 139 140 144 143 95 144 143 140 140 140 144
## [18] 144 144 NA 96 93 140 117 140 122 97 122 96 93 140 140 144 144
## [35] 64 96 93 144 140 136 140 144 144 139 96 96 93 93 93 140 56
## [52] 140 96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
## [69] 137 144 144 96 144 144 103 140 144 115 140 144 144 144 144 NA 128
## [86] 96 128 134 140 140 77 134 118 96 96 73 96 140 144 140
We get the number of tweets and the length in characters of each tweet. Add the later to the data frame.
tweetDF$textlen <- str_length(text)
tweetDF$textlen## [1] 139 140 129 139 144 144 139 140 144 143 95 144 143 140 140 140 144
## [18] 144 144 NA 96 93 140 117 140 122 97 122 96 93 140 140 144 144
## [35] 64 96 93 144 140 136 140 144 144 139 96 96 93 93 93 140 56
## [52] 140 96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
## [69] 137 144 144 96 144 144 103 140 144 115 140 144 144 144 144 NA 128
## [86] 96 128 134 140 140 77 134 118 96 96 73 96 140 144 140
We cannot access the new field without the $ notation unless you detach it and attach the data frame again.
detach(tweetDF)
search()## [1] ".GlobalEnv" "package:stringi" "package:stringr"
## [4] "package:httr" "package:bit64" "package:bit"
## [7] "package:rjson" "package:devtools" "package:ROAuth"
## [10] "package:twitteR" "package:RJSONIO" "package:RCurl"
## [13] "package:bitops" "package:stats" "package:graphics"
## [16] "package:grDevices" "package:utils" "package:datasets"
## [19] "package:methods" "Autoloads" "package:base"
attach(tweetDF)
search()## [1] ".GlobalEnv" "tweetDF" "package:stringi"
## [4] "package:stringr" "package:httr" "package:bit64"
## [7] "package:bit" "package:rjson" "package:devtools"
## [10] "package:ROAuth" "package:twitteR" "package:RJSONIO"
## [13] "package:RCurl" "package:bitops" "package:stats"
## [16] "package:graphics" "package:grDevices" "package:utils"
## [19] "package:datasets" "package:methods" "Autoloads"
## [22] "package:base"
textlen## [1] 139 140 129 139 144 144 139 140 144 143 95 144 143 140 140 140 144
## [18] 144 144 NA 96 93 140 117 140 122 97 122 96 93 140 140 144 144
## [35] 64 96 93 144 140 136 140 144 144 139 96 96 93 93 93 140 56
## [52] 140 96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
## [69] 137 144 144 96 144 144 103 140 144 115 140 144 144 144 144 NA 128
## [86] 96 128 134 140 140 77 134 118 96 96 73 96 140 144 140
Better.
We count the tweets with more than 140 characters.
length(tweetDF[textlen > 140, "text"])## [1] 26
There are tweets with lengths greater than 140. It indicates that some tweets have extra characters like extra spaces. If we want to count the number of words, we need to clean up a bit. We substitute one space any time that two spaces are found. We calculate the new length and create a new variable.
tweetDF$modtext <- str_replace_all(text," "," ")
tweetDF$textlen2 <- str_length(tweetDF$modtext)
tweetDF$textlen2## [1] 139 140 129 139 144 144 139 139 144 143 95 144 143 140 139 140 144
## [18] 144 144 112 96 93 140 117 140 122 97 122 96 93 140 140 144 144
## [35] 64 96 93 144 139 136 140 144 144 139 96 96 93 93 93 140 56
## [52] 138 96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
## [69] 137 144 144 96 144 144 103 140 144 113 140 144 144 144 144 71 128
## [86] 96 126 134 140 140 76 134 118 96 96 73 96 140 144 140
detach(tweetDF)
attach(tweetDF)We count the number of differences between the former variable and the new one. We compute this difference tweet by tweet.
length(tweetDF[textlen != textlen2, ])## [1] 19
textlen2 - textlen## [1] 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 NA 0 0 0
## [24] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0
## [47] 0 0 0 0 0 -2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [70] 0 0 0 0 0 0 0 0 -2 0 0 0 0 0 NA 0 0 -2 0 0 0 -1 0
## [93] 0 0 0 0 0 0 0 0
A negative number indicates the clean tweets are shorter. We have a cleaner data frame.
We count the number of words per tweet and the overall average number of words per tweet.
tweetDF$wordCount <- (str_count(modtext, " ") + 1)
detach(tweetDF)
attach(tweetDF)
wordCount## [1] 18 14 20 18 22 19 16 20 19 16 9 21 20 20 22 19 20 22 19 14 9 10 16
## [24] 22 24 14 10 14 9 10 16 22 19 19 9 9 10 22 20 14 24 22 19 22 9 9
## [47] 10 14 10 23 5 21 9 15 15 15 15 15 15 15 15 15 15 18 15 15 15 15 15
## [70] 19 22 9 19 22 10 15 22 12 19 20 22 19 19 8 13 9 20 16 19 22 11 15
## [93] 14 9 9 8 9 22 20 18
mean(wordCount)## [1] 15.97
Now, let’s process the corpus to tweets. We can use regular expressions, or regex, to process the corpus.
We want to find the retweets (rt): when a tweet is broadcasted on a larger scale.
The regex retweet sequence is RT @[a-z,A-Z]*:; tweet beginning with RT,@ for any character in lower or uppercase followed by a *, a joker for ‘anything else’.
tweetDF$rt <- str_match(modtext, "RT @[a-z,A-Z]*: ")
detach(tweetDF)
attach(tweetDF)We review the NA (no results) and remove these unnecessary strings (clean the tweets).
head(rt, 10)## [,1]
## [1,] NA
## [2,] NA
## [3,] NA
## [4,] "RT @MAHAMOSA: "
## [5,] "RT @earthtokens: "
## [6,] "RT @earthtokens: "
## [7,] NA
## [8,] NA
## [9,] "RT @earthtokens: "
## [10,] "RT @earthtokens: "
We simplify the structure of rt; we make it a vector.
tweetDF$rt <- as.vector(tweetDF$rt)
head(tweetDF$rt, 10)## [1] NA NA NA
## [4] "RT @MAHAMOSA: " "RT @earthtokens: " "RT @earthtokens: "
## [7] NA NA "RT @earthtokens: "
## [10] "RT @earthtokens: "
We clean a little bit more.
# remove strings: "RT @"
tweetDF$rt <- str_replace(tweetDF$rt, "RT @","")
# remove strings: ": "
tweetDF$rt <- str_replace(tweetDF$rt,": ","")
detach(tweetDF)
attach(tweetDF)
# check it out
head(rt, 10)## [1] NA NA NA "MAHAMOSA" "earthtokens"
## [6] "earthtokens" NA NA "earthtokens" "earthtokens"
We want to break down the corpus and discriminate among tweets; create classes or levels. We can then find the number of instances for each level.
We first coerce the vector into factors in a contingency table. Factors are a collection of descriptive labels, categories or levels
table(as.factor(rt))##
## ajrjacksonart earthtokens EnjoyNature MAHAMOSA
## 1 28 1 1
## Marileopardi NewWorldLibrary Satellogic sgerendaskiss
## 1 1 1 16
We find the number of levels: unique or without duplicates. The table presents the count for each level.
str(as.factor(rt))## Factor w/ 8 levels "ajrjacksonart",..: NA NA NA 4 2 2 NA NA 2 2 ...
There are a number of unique levels that will appear a number of times.
We save the first level and process it.
level_1 <- levels(as.factor(rt))[1]
level_1## [1] "ajrjacksonart"
# use level_1 to find instances in the vector
as.vector(str_match(tweetDF$rt, level_1))## [1] NA NA NA NA
## [5] NA NA NA NA
## [9] NA NA NA NA
## [13] NA NA NA NA
## [17] NA NA NA "ajrjacksonart"
## [21] NA NA NA NA
## [25] NA NA NA NA
## [29] NA NA NA NA
## [33] NA NA NA NA
## [37] NA NA NA NA
## [41] NA NA NA NA
## [45] NA NA NA NA
## [49] NA NA NA NA
## [53] NA NA NA NA
## [57] NA NA NA NA
## [61] NA NA NA NA
## [65] NA NA NA NA
## [69] NA NA NA NA
## [73] NA NA NA NA
## [77] NA NA NA NA
## [81] NA NA NA NA
## [85] NA NA NA NA
## [89] NA NA NA NA
## [93] NA NA NA NA
## [97] NA NA NA NA
# confirm if there is at least one instance (at least 1 TRUE)
as.vector(!is.na(str_match(tweetDF$rt, level_1)))## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
# count the number of instances; TRUE=1, FALSE=0
sum(!is.na(str_match(tweetDF$rt, level_1)))## [1] 1
We can find 1 instance(s) with level_1.
level_2 <- levels(as.factor(rt))[2]
sum(!is.na(str_match(tweetDF$rt, level_2)))## [1] 28
We can find 28 instance(s) with level_2.
level_5 <- levels(as.factor(rt))[5]
sum(!is.na(str_match(tweetDF$rt, level_5)))## [1] 1
We can find 1 instance(s) with level_5.
We create a new variable, longtext, that will be TRUE if the original tweet was longer than 140 characters. We ‘flag’ the attributes we are looking for. We can check out the results in a two-way contingency table.
# filter
tweetDF$longtext <- (textlen2 > 140)
detach(tweetDF)
attach(tweetDF)
# check out
# row by col
table(as.factor(rt), as.factor(longtext))##
## FALSE TRUE
## ajrjacksonart 1 0
## earthtokens 4 24
## EnjoyNature 1 0
## MAHAMOSA 1 0
## Marileopardi 1 0
## NewWorldLibrary 1 0
## Satellogic 1 0
## sgerendaskiss 16 0
Level by level, we can see the number of instances (TRUE). We can locate where the longest tweets are.
We create another flag variable, hasrt, that indicates whether a tweet was retweeted (or contains a retweet).
We first filter out the NA.
# filter
tweetDF$hasrt <- !(is.na(rt))
tweetDF$hasrt## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [12] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [34] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [45] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [78] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
## [89] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
## [100] FALSE
detach(tweetDF)
attach(tweetDF)
hasrt ## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [12] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [34] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [45] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [78] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
## [89] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
## [100] FALSE
Or use View(hasrt).
We count the occurrences (TRUE=1, FALSE=0).
sum(hasrt)## [1] 50
We print contingency tables of short/long tweets and retweets.
Absolute
table(hasrt, longtext)## longtext
## hasrt FALSE TRUE
## FALSE 50 0
## TRUE 26 24
Proportional or relative
prop.table(table(hasrt, longtext), 2)## longtext
## hasrt FALSE TRUE
## FALSE 0.6578947 0.0000000
## TRUE 0.3421053 1.0000000
It seems there are more short tweets longtext=FALSE that are retweeted hasrt=TRUE (lower-left) than long tweets (lower-right).
However, a higher proportion of long tweets are retweeted.
In other words, longer tweets are proportionally more retweeted although the volume of short tweets is more important.
Extract URLs from tweet texts using another regex: http://t.co/[a-z,A-Z,0-9]{8}.
The regular expression begins with the 12 literal characters ending with a forward slash. Then we have a regular expression pattern to match. The material within the square brackets matches any upper or lowercase letter and any digit. The numeral 8 between the curly braces at the end say to match the previous pattern exactly eight times.
Instead of function str_match, we use str_match_all. The first function returns a long matrix of 1 column, the second function returns an array of small matrices.
tweetDF$urlist <- str_match_all(tweetDF$text, "https://t.co/[a-z,A-Z,0-9]{8}")
detach(tweetDF)
attach(tweetDF)
head(urlist, 5)## [[1]]
## [,1]
## [1,] "https://t.co/EaSNd7Bz"
##
## [[2]]
## [,1]
## [1,] "https://t.co/3nYeOE8j"
##
## [[3]]
## [,1]
## [1,] "https://t.co/NBlm5Ou0"
##
## [[4]]
## [,1]
##
## [[5]]
## [,1]
We get a a multidimensional object: an array of matrices. Some matrices are empty (no value), others are 1x1 (1 value) or more.
We need an apply function to roll through each matrix and count the number of URLs per tweet. The recursive apply, rapply, dives down into the complex, nested structure of urlist and repetitively run the length function.
tweetDF$numurls <- rapply(urlist, length)
detach(tweetDF)
attach(tweetDF)
head(numurls, 10)## [1] 1 1 1 0 0 0 1 1 0 1
We now have a new variable that counts the number of URLs per tweet.
Let’s look at a proportion contingency table crossing two attributes: the number of URLs per tweets vs. long tweets.
prop.table(table(numurls,longtext))## longtext
## numurls FALSE TRUE
## 0 0.22 0.23
## 1 0.51 0.01
## 2 0.03 0.00
We have 100 tweets. 10%, for example, means 10 tweets.
We can read the proportion of long tweet longtext=TRUE with 0, 1 or 2 URLs for example.
We can also build three-way contingency tables mixing retweets and URLs with long tweets.
table(numurls, hasrt, longtext)## , , longtext = FALSE
##
## hasrt
## numurls FALSE TRUE
## 0 18 4
## 1 29 22
## 2 3 0
##
## , , longtext = TRUE
##
## hasrt
## numurls FALSE TRUE
## 0 0 23
## 1 0 1
## 2 0 0
prop.table(table(numurls, hasrt, longtext))## , , longtext = FALSE
##
## hasrt
## numurls FALSE TRUE
## 0 0.18 0.04
## 1 0.29 0.22
## 2 0.03 0.00
##
## , , longtext = TRUE
##
## hasrt
## numurls FALSE TRUE
## 0 0.00 0.23
## 1 0.00 0.01
## 2 0.00 0.00
We can extract some statistics.
# average length (number of characters) of retweets AND long tweets
mean(textlen2[hasrt & longtext])## [1] 143.9167
# average length of retweets AND short tweets (! means NOT)
mean(textlen2[hasrt & !longtext])## [1] 110.8077
In other words, retweets are longer tweets.
We can examine the results of text mining with visualization tools.
We will focus on word cloud. We can push the analysis with methods such as ‘bag of words’ or ‘sentiment analysis’. These methods involves treemaps and dendrograms.
Check out the ending note for visual examples.
twitteR) and NLP packages (wordcloud, tm) to explore word frequencies.wordcloud, tm).stats; class dendrogram. Dendrograms is what we get when we plot the results of hierarchical cluster analysis, with hclust, on a corpus.dendextend, a package for visualizing, adjusting, and comparing dendrograms (based on bioinformatics), ggdendro, that can be used with ggplot2, and FactoMineR that does hierarchical clustering on principal components.phyloseq package. The package was designed for biology and ecology. Consult this demo about Phylogenetic Tree. The package does more than dendrograms: network analyses, heatmaps, grid or facets plots, radial trees, etc.phyloseq package with the ggplot2 package to draw arborescences and radial trees.ape package is related to the phyloseq one. Consult the website.Based on the previous extractions: #earth.
head(tweetDF$text, 3)## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! #rareearthgallerycc #amethyst #crystal… https://t.co/EaSNd7Bz14"
## [2] "Invigorating & Peaceful Sunday morning hiking Griffith Park!\n\n#DTLA #hiking #peace #MotherEarth #clouds #coyote… https://t.co/3nYeOE8j6g"
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on #sale for $1.99 via @AceRocBooks Grab it today!… https://t.co/NBlm5Ou0a4"
We need to clean up the data:
stringr packageWe can clean the tweets using regex and the str_replace_all function from the stringr package. We build a custom function to automate things.
CleanTweets will takes the junk out of a vector tweet texts.
library(stringr)
CleanTweets <- function(tweets) {
# Remove redundant spaces
tweets <- str_replace_all(tweets, " ", " ")
# Get rid of URLs
tweets <- str_replace_all(tweets,
"http://t.co/[a-z,A-Z,0-9]{10}", "")
tweets <- str_replace_all(tweets,
"https://t.co/[a-z,A-Z,0-9]{10}", "")
# Take out retweet header, there is only one
tweets <- str_replace(tweets, "RT @[a-z,A-Z,0-9]*: ", "")
# Get rid of hashtags
tweets <- str_replace_all(tweets, "#[a-z,A-Z,0-9]*", "")
# Get rid of references to other screennames
tweets <- str_replace_all(tweets, "@[a-z,A-Z,0-9]*", "")
# more
tweets <- str_replace_all(tweets,"…","")
tweets <- str_replace_all(tweets,"�","")
tweets <- str_replace_all(tweets, " ", " ")
tweets <- str_replace_all(tweets, " ", " ")
tweets <- str_replace_all(tweets, " ", " ")
tweets <- str_replace_all(tweets, " ", " ")
tweets <- str_replace_all(tweets, " ", " ")
tweets <- str_replace_all(tweets, " ", " ")
tweets <- str_replace_all(tweets,
"httpstco[a-z,A-Z,0-9]{10}", "")
tweets <- str_replace_all(tweets,
"httpstco[a-z,A-Z,0-9]{2}", "")
return(tweets)
}The original text.
Text <- tweetDF$textWe apply CleanTweets.
cleanText <- CleanTweets(Text)We show the original text and the cleans text side by side.
head(Text, 3)## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! #rareearthgallerycc #amethyst #crystal… https://t.co/EaSNd7Bz14"
## [2] "Invigorating & Peaceful Sunday morning hiking Griffith Park!\n\n#DTLA #hiking #peace #MotherEarth #clouds #coyote… https://t.co/3nYeOE8j6g"
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on #sale for $1.99 via @AceRocBooks Grab it today!… https://t.co/NBlm5Ou0a4"
head(cleanText, 3)## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! "
## [2] "Invigorating & Peaceful Sunday morning hiking Griffith Park!\n\n "
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on for $1.99 via Grab it today! "
Before using word clouds or other visualization tools, we need to convert the body of tweets into a text corpus using the the tm package and text mining functions (similarly to data mining, but for unstructured data).
library(tm)Let’s coerce the body of tweets into class Corpus. Here is a simple example: a body of 2 texts.
docs <- c("This is a text.", "This another one.")
docsCorpus <- VCorpus(VectorSource(docs))
docsCorpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
A ‘corpus’ is a ‘body’ of texts. The above corpus has 2 documents. Whether we deal with 2 small strings, 10 full page articles, 50 multipage doctoral theses, 1000 tweets, they can all be converted into a corpus of documents (texts) to be analyzed.
class(docsCorpus)## [1] "VCorpus" "Corpus"
A class not only contains definitions about the structure of data, it also contains references to functions that can work on that class.
We can compare the simple body of texts with the corpus, once created.
str(docs)## chr [1:2] "This is a text." "This another one."
str(docsCorpus)## List of 2
## $ 1:List of 2
## ..$ content: chr "This is a text."
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2018-01-15 19:03:49"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "1"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 2:List of 2
## ..$ content: chr "This another one."
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2018-01-15 19:03:49"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "2"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
The corpus has more metadata. We could have implemented the metadata manually.
ptd1 <- PlainTextDocument("This is a text.",
heading = "Plain text document",
id = basename(tempfile()),
language = "en")
meta(ptd1)## author : character(0)
## datetimestamp: 2018-01-15 19:03:49
## description : character(0)
## heading : Plain text document
## id : file236616f2ea7b
## language : en
## origin : character(0)
ptd2 <- PlainTextDocument("This another one.",
heading = "Plain text document",
id = basename(tempfile()),
language = "en")
meta(ptd2)## author : character(0)
## datetimestamp: 2018-01-15 19:03:49
## description : character(0)
## heading : Plain text document
## id : file23666c5c2d33
## language : en
## origin : character(0)
Let’s work with our cleanText (body of tweets).
tweetCorpus <- VCorpus(VectorSource(cleanText))
tweetCorpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 100
The body of tweets is now a corpus of 100 (mini) documents.
The first thing we want to do is simplify the corpus: remove capital letters, remove punctuation, remove unnecessary words, etc.
# this step is facultative
tweetCorpus <- tm_map(tweetCorpus, PlainTextDocument)
tweetCorpus <- tm_map(tweetCorpus, content_transformer(tolower))
tweetCorpus <- tm_map(tweetCorpus, content_transformer(removePunctuation))Unnecessary words can be stop words (the, a, at, etc.) and repetitive expressions (common names, brand names, tag words, etc.).
In text mining, we only want data (words) that bring value.
In a sentiment analysis, about a brand for example, some words simply cloud the analysis; we do not need the brand name or ‘the’ showing up in ngrams, dendrograms or word clouds.
Since most tweets are mostly in English, we remove the English [stop words(https://en.wikipedia.org/wiki/Stop_words).
tweetCorpus <- tm_map(tweetCorpus,content_transformer(removeWords),
stopwords('english'))We could go further.
We then convert the corpus into a text matrix. The matrix is a rectangular data structure or a contingency table where we can see each term on one axis, documents on the other axis, and a tally inside the matrix (term frequency). The matrix can take to forms:
# create a term-document matrix or TDM
tweetTDM <- TermDocumentMatrix(tweetCorpus)
tweetTDM## <<TermDocumentMatrix (terms: 282, documents: 100)>>
## Non-/sparse entries: 668/27532
## Sparsity : 98%
## Maximal term length: 15
## Weighting : term frequency (tf)
t(tweetTDM) # new a document-term matrix or DTM## <<DocumentTermMatrix (documents: 100, terms: 282)>>
## Non-/sparse entries: 668/27532
## Sparsity : 98%
## Maximal term length: 15
## Weighting : term frequency (tf)
A term may be a single word, “biology,” or it could also be a compound word, “data analysis”. A term like “data” can appear once in the first document, twice in the second document, and not at all in the third document. Then the column for the term data will tally 1, 2, 0.
‘Sparse’ refers to the overwhelming number of cells that contain zero indicating that the particular term does not appear in a given number of documents.
A sparse matrix occurs when the vocabulary is rich and documents are clearly different form each other. Customer satisfaction messages generate denser matrices because of the repetitive style of each message (“great product, bad product” and the likes).
We can further process TDM and DTM. Text mining is a field of its own. However, our goal is to draw word clouds. We need to coerce the TDM into a plain matrix.
class(tweetTDM)## [1] "TermDocumentMatrix" "simple_triplet_matrix"
dim(tweetTDM)## [1] 282 100
str(tweetTDM)## List of 6
## $ i : int [1:668] 24 86 127 162 174 221 223 241 275 20 ...
## $ j : int [1:668] 1 1 1 1 1 1 1 1 1 2 ...
## $ v : num [1:668] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 282
## $ ncol : int 100
## $ dimnames:List of 2
## ..$ Terms: chr [1:282] "100" "1365x2048" "199" "2017" ...
## ..$ Docs : chr [1:100] "character(0)" "character(0)" "character(0)" "character(0)" ...
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
tdMatrix <- as.matrix(tweetTDM)class(tdMatrix)## [1] "matrix"
dim(tdMatrix)## [1] 282 100
str(tdMatrix)## num [1:282, 1:100] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ Terms: chr [1:282] "100" "1365x2048" "199" "2017" ...
## ..$ Docs : chr [1:100] "character(0)" "character(0)" "character(0)" "character(0)" ...
tdMatrix[1:5, 1:5]## Docs
## Terms character(0) character(0) character(0) character(0)
## 100 0 0 0 0
## 1365x2048 0 0 0 0
## 199 0 0 1 0
## 2017 0 0 0 0
## 2048 0 0 0 0
## Docs
## Terms character(0)
## 100 0
## 1365x2048 0
## 199 0
## 2017 0
## 2048 0
We sort the matrix in descending order to show the most frequent terms at the top.
sortedMatrix <- sort(rowSums(tdMatrix), decreasing=TRUE)
class(sortedMatrix)## [1] "numeric"
length(sortedMatrix) # it is now a numeric vector with names## [1] 282
str(sortedMatrix)## Named num [1:282] 39 25 21 19 18 18 18 18 18 16 ...
## - attr(*, "names")= chr [1:282] "hotel" "amp" "supporter" "impactchoice" ...
head(sortedMatrix, 5)## hotel amp supporter impactchoice client
## 39 25 21 19 18
We now have a vector where names are the terms and integers are the frequencies (the total word tally for all documents).
We extract the names, bind them with frequencies in 2 columns, and remove row names.
cloudFrame <- data.frame(word = names(sortedMatrix),
freq = sortedMatrix)
row.names(cloudFrame) <- NULL
head(cloudFrame, 5)## word freq
## 1 hotel 39
## 2 amp 25
## 3 supporter 21
## 4 impactchoice 19
## 5 client 18
We can now visualize the results with word clouds.
We load the wordcloud functions from the workcloud package.
library(wordcloud)We create the first word cloud.
wordcloud(cloudFrame$word, cloudFrame$freq)There are more options.
library(RColorBrewer)
display.brewer.all()pal1 <- brewer.pal(8, "Dark2") # qualitative palette
pal1b <- brewer.pal(8, "Accent") # qualitative palette
pal2 <- brewer.pal(9, "GnBu") # sequential palette
pal3 <- brewer.pal(11, "Spectral") # diverging paletteSome palette have more colours. With 8 colours, we are limited to 8 word-frequency entry.
Each colour will be related to a frequency just like the font size. The size and the colour determine the frequency.
We can adjust the word cloud parameters to fit the colour palette.
Here are a couple of examples:
par(mfrow = c(1,2))
# qualitative palette
wordcloud(cloudFrame$word, cloudFrame$freq, colors = pal1) # with 8 colours max; we let the word cloud self-adjust
# sequential palette
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 4, colors = pal2) # min.freq starts at the 4th color on the palette; lower ranks are too pale to show up and they apply to 1, 2, 3 frequencies# diverging palette
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, colors = pal3) # we force lower frequencies in; reddish colors are low frequencies
# qualitative palette
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, random.order = TRUE, colors = pal1b, rot.per = 0.33) # the placement is no longer organized by frequencies, only 1/3 of term are flipped at 90 degrees# idem
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, random.order = TRUE, colors = pal1b, rot.per = 0) # no term are flipped at 90 degrees
# idem
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, random.order = FALSE, colors = pal1b, rot.per = 0) # remove random placement, most frequent terms appear in the centerWe save the word data into a CSV and a flat file.
write.csv uses ‘,’ separators by default and write.csv uses ‘;’ separators by default. Encode the text to capture the foreign characters; UTF-16 is better than UTF-8; consult https://en.wikipedia.org/wiki/UTF-16
write.csv2(cloudFrame, file = "cloudFrame.csv", row.names = FALSE, fileEncoding = "UTF-16LE")
write.table(cloudFrame, file = "cloudFrame.txt", sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")With the original text or the cloudFrame files, we can use other software to generate word clouds. Some of these software are online:
Word clouds are one way to visualize word frequency. Other visualization have other purposes.
class dendrogram; hierarchical cluster analysis, with hclust.
dendextend
ggdendro (with ggplot2)
FactoMineR
phyloseq (with ggplot2)
ape (related to phyloseq)