Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets and results.
  • Compiled on Labour Day 2017 (to put the tweets in context).


String theory – A word on ‘text’

The text inside a document is called unstructured data; like photo, video or sound files. Tweets are text files. However, the metadata, such as the ‘tweet creation date & time’, are structured data.

‘Text’ bears others names. Text inside computer is called Unicode or just character strings. ‘Text’ is also called ‘natural language’. On the other hand, R is also a language, but a programming language.

We can manipulate ‘natural languages’. Searching, analyzing,and transforming strings is called Natural Language Processing or NLP. In R, we can perform NLP with the stringr package.

We are going to work with tweets. We could also work with another text corpus: e-mails, chats, SMS, website comments, logs or other records, TV/radio verbatims, court transcripts, etc.

Notes on the main packages used in this case



Preparing the analysis

After loading the necessary packages loging into the Twitter API with setup_twitter_oauth

We load in the new package.

library(stringr)


Pulling some tweets (the ‘text’)

We pull tweets using the TweetFram function.

TweetFrame <- function(searchTerm, maxTweets, langTweets)
{
  twtList <-
    searchTwitter(searchTerm, n=maxTweets, lang=langTweets)
  return(do.call("rbind",
                 lapply(twtList,as.data.frame)))
}

#earth

tweetDF <- TweetFrame("#earth", 100, "en")

head(tweetDF$text, 3)
## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! #rareearthgallerycc #amethyst #crystal… https://t.co/EaSNd7Bz14"   
## [2] "Invigorating &amp; Peaceful Sunday morning hiking Griffith Park!\n\n#DTLA #hiking #peace #MotherEarth #clouds #coyote… https://t.co/3nYeOE8j6g"
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on #sale for $1.99 via @AceRocBooks Grab it today!… https://t.co/NBlm5Ou0a4"

We attach the data frame and verify it is done.

attach(tweetDF)
search()
##  [1] ".GlobalEnv"        "tweetDF"           "package:stringr"  
##  [4] "package:httr"      "package:bit64"     "package:bit"      
##  [7] "package:rjson"     "package:devtools"  "package:ROAuth"   
## [10] "package:twitteR"   "package:RJSONIO"   "package:RCurl"    
## [13] "package:bitops"    "package:stats"     "package:graphics" 
## [16] "package:grDevices" "package:utils"     "package:datasets" 
## [19] "package:methods"   "Autoloads"         "package:base"

We can see the tweetDF object in the .GlobalEnv.



Preprocessing the tweets

Let’s examine the data frame.

# length of the d.f.
length(text)
## [1] 100
# length of each tweet in the d.f.
str_length(text)
##   [1] 139 140 129 139 144 144 139 140 144 143  95 144 143 140 140 140 144
##  [18] 144 144  NA  96  93 140 117 140 122  97 122  96  93 140 140 144 144
##  [35]  64  96  93 144 140 136 140 144 144 139  96  96  93  93  93 140  56
##  [52] 140  96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
##  [69] 137 144 144  96 144 144 103 140 144 115 140 144 144 144 144  NA 128
##  [86]  96 128 134 140 140  77 134 118  96  96  73  96 140 144 140

We get the number of tweets and the length in characters of each tweet. Add the later to the data frame.

tweetDF$textlen <- str_length(text)
tweetDF$textlen
##   [1] 139 140 129 139 144 144 139 140 144 143  95 144 143 140 140 140 144
##  [18] 144 144  NA  96  93 140 117 140 122  97 122  96  93 140 140 144 144
##  [35]  64  96  93 144 140 136 140 144 144 139  96  96  93  93  93 140  56
##  [52] 140  96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
##  [69] 137 144 144  96 144 144 103 140 144 115 140 144 144 144 144  NA 128
##  [86]  96 128 134 140 140  77 134 118  96  96  73  96 140 144 140

We cannot access the new field without the $ notation unless you detach it and attach the data frame again.

detach(tweetDF)
search()
##  [1] ".GlobalEnv"        "package:stringi"   "package:stringr"  
##  [4] "package:httr"      "package:bit64"     "package:bit"      
##  [7] "package:rjson"     "package:devtools"  "package:ROAuth"   
## [10] "package:twitteR"   "package:RJSONIO"   "package:RCurl"    
## [13] "package:bitops"    "package:stats"     "package:graphics" 
## [16] "package:grDevices" "package:utils"     "package:datasets" 
## [19] "package:methods"   "Autoloads"         "package:base"
attach(tweetDF)
search()
##  [1] ".GlobalEnv"        "tweetDF"           "package:stringi"  
##  [4] "package:stringr"   "package:httr"      "package:bit64"    
##  [7] "package:bit"       "package:rjson"     "package:devtools" 
## [10] "package:ROAuth"    "package:twitteR"   "package:RJSONIO"  
## [13] "package:RCurl"     "package:bitops"    "package:stats"    
## [16] "package:graphics"  "package:grDevices" "package:utils"    
## [19] "package:datasets"  "package:methods"   "Autoloads"        
## [22] "package:base"
textlen
##   [1] 139 140 129 139 144 144 139 140 144 143  95 144 143 140 140 140 144
##  [18] 144 144  NA  96  93 140 117 140 122  97 122  96  93 140 140 144 144
##  [35]  64  96  93 144 140 136 140 144 144 139  96  96  93  93  93 140  56
##  [52] 140  96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
##  [69] 137 144 144  96 144 144 103 140 144 115 140 144 144 144 144  NA 128
##  [86]  96 128 134 140 140  77 134 118  96  96  73  96 140 144 140

Better.

We count the tweets with more than 140 characters.

length(tweetDF[textlen > 140, "text"])
## [1] 26

There are tweets with lengths greater than 140. It indicates that some tweets have extra characters like extra spaces. If we want to count the number of words, we need to clean up a bit. We substitute one space any time that two spaces are found. We calculate the new length and create a new variable.

tweetDF$modtext <- str_replace_all(text,"  "," ")

tweetDF$textlen2 <- str_length(tweetDF$modtext)
tweetDF$textlen2
##   [1] 139 140 129 139 144 144 139 139 144 143  95 144 143 140 139 140 144
##  [18] 144 144 112  96  93 140 117 140 122  97 122  96  93 140 140 144 144
##  [35]  64  96  93 144 139 136 140 144 144 139  96  96  93  93  93 140  56
##  [52] 138  96 140 140 140 140 140 140 140 140 140 140 131 140 140 140 140
##  [69] 137 144 144  96 144 144 103 140 144 113 140 144 144 144 144  71 128
##  [86]  96 126 134 140 140  76 134 118  96  96  73  96 140 144 140
detach(tweetDF)
attach(tweetDF)

We count the number of differences between the former variable and the new one. We compute this difference tweet by tweet.

length(tweetDF[textlen != textlen2, ])
## [1] 19
textlen2 - textlen
##   [1]  0  0  0  0  0  0  0 -1  0  0  0  0  0  0 -1  0  0  0  0 NA  0  0  0
##  [24]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0
##  [47]  0  0  0  0  0 -2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##  [70]  0  0  0  0  0  0  0  0 -2  0  0  0  0  0 NA  0  0 -2  0  0  0 -1  0
##  [93]  0  0  0  0  0  0  0  0

A negative number indicates the clean tweets are shorter. We have a cleaner data frame.

We count the number of words per tweet and the overall average number of words per tweet.

tweetDF$wordCount <- (str_count(modtext, " ") + 1)

detach(tweetDF)
attach(tweetDF)

wordCount
##   [1] 18 14 20 18 22 19 16 20 19 16  9 21 20 20 22 19 20 22 19 14  9 10 16
##  [24] 22 24 14 10 14  9 10 16 22 19 19  9  9 10 22 20 14 24 22 19 22  9  9
##  [47] 10 14 10 23  5 21  9 15 15 15 15 15 15 15 15 15 15 18 15 15 15 15 15
##  [70] 19 22  9 19 22 10 15 22 12 19 20 22 19 19  8 13  9 20 16 19 22 11 15
##  [93] 14  9  9  8  9 22 20 18
mean(wordCount)
## [1] 15.97


Processing the tweets

Now, let’s process the corpus to tweets. We can use regular expressions, or regex, to process the corpus.

We want to find the retweets (rt): when a tweet is broadcasted on a larger scale.

The regex retweet sequence is RT @[a-z,A-Z]*:; tweet beginning with RT,@ for any character in lower or uppercase followed by a *, a joker for ‘anything else’.

tweetDF$rt <- str_match(modtext, "RT @[a-z,A-Z]*: ")

detach(tweetDF)
attach(tweetDF)

We review the NA (no results) and remove these unnecessary strings (clean the tweets).

head(rt, 10)
##       [,1]               
##  [1,] NA                 
##  [2,] NA                 
##  [3,] NA                 
##  [4,] "RT @MAHAMOSA: "   
##  [5,] "RT @earthtokens: "
##  [6,] "RT @earthtokens: "
##  [7,] NA                 
##  [8,] NA                 
##  [9,] "RT @earthtokens: "
## [10,] "RT @earthtokens: "

We simplify the structure of rt; we make it a vector.

tweetDF$rt <- as.vector(tweetDF$rt)
head(tweetDF$rt, 10)
##  [1] NA                  NA                  NA                 
##  [4] "RT @MAHAMOSA: "    "RT @earthtokens: " "RT @earthtokens: "
##  [7] NA                  NA                  "RT @earthtokens: "
## [10] "RT @earthtokens: "

We clean a little bit more.

# remove strings: "RT @"
tweetDF$rt <- str_replace(tweetDF$rt, "RT @","")
# remove strings: ": "
tweetDF$rt <- str_replace(tweetDF$rt,": ","")

detach(tweetDF)
attach(tweetDF)

# check it out
head(rt, 10)
##  [1] NA            NA            NA            "MAHAMOSA"    "earthtokens"
##  [6] "earthtokens" NA            NA            "earthtokens" "earthtokens"


Ventilating the tweets

We want to break down the corpus and discriminate among tweets; create classes or levels. We can then find the number of instances for each level.

We first coerce the vector into factors in a contingency table. Factors are a collection of descriptive labels, categories or levels

table(as.factor(rt))
## 
##   ajrjacksonart     earthtokens     EnjoyNature        MAHAMOSA 
##               1              28               1               1 
##    Marileopardi NewWorldLibrary      Satellogic   sgerendaskiss 
##               1               1               1              16

We find the number of levels: unique or without duplicates. The table presents the count for each level.

str(as.factor(rt))
##  Factor w/ 8 levels "ajrjacksonart",..: NA NA NA 4 2 2 NA NA 2 2 ...

There are a number of unique levels that will appear a number of times.

We save the first level and process it.

level_1 <- levels(as.factor(rt))[1]
level_1
## [1] "ajrjacksonart"
# use level_1 to find instances in the vector
as.vector(str_match(tweetDF$rt, level_1))
##   [1] NA              NA              NA              NA             
##   [5] NA              NA              NA              NA             
##   [9] NA              NA              NA              NA             
##  [13] NA              NA              NA              NA             
##  [17] NA              NA              NA              "ajrjacksonart"
##  [21] NA              NA              NA              NA             
##  [25] NA              NA              NA              NA             
##  [29] NA              NA              NA              NA             
##  [33] NA              NA              NA              NA             
##  [37] NA              NA              NA              NA             
##  [41] NA              NA              NA              NA             
##  [45] NA              NA              NA              NA             
##  [49] NA              NA              NA              NA             
##  [53] NA              NA              NA              NA             
##  [57] NA              NA              NA              NA             
##  [61] NA              NA              NA              NA             
##  [65] NA              NA              NA              NA             
##  [69] NA              NA              NA              NA             
##  [73] NA              NA              NA              NA             
##  [77] NA              NA              NA              NA             
##  [81] NA              NA              NA              NA             
##  [85] NA              NA              NA              NA             
##  [89] NA              NA              NA              NA             
##  [93] NA              NA              NA              NA             
##  [97] NA              NA              NA              NA
# confirm if there is at least one instance (at least 1 TRUE)
as.vector(!is.na(str_match(tweetDF$rt, level_1)))
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
# count the number of instances; TRUE=1, FALSE=0
sum(!is.na(str_match(tweetDF$rt, level_1)))
## [1] 1

We can find 1 instance(s) with level_1.

level_2 <- levels(as.factor(rt))[2]
sum(!is.na(str_match(tweetDF$rt, level_2)))
## [1] 28

We can find 28 instance(s) with level_2.

level_5 <- levels(as.factor(rt))[5]
sum(!is.na(str_match(tweetDF$rt, level_5)))
## [1] 1

We can find 1 instance(s) with level_5.



Tagging new attributes to the tweets

We create a new variable, longtext, that will be TRUE if the original tweet was longer than 140 characters. We ‘flag’ the attributes we are looking for. We can check out the results in a two-way contingency table.

# filter
tweetDF$longtext <- (textlen2 > 140)

detach(tweetDF)
attach(tweetDF)

# check out
# row by col
table(as.factor(rt), as.factor(longtext))
##                  
##                   FALSE TRUE
##   ajrjacksonart       1    0
##   earthtokens         4   24
##   EnjoyNature         1    0
##   MAHAMOSA            1    0
##   Marileopardi        1    0
##   NewWorldLibrary     1    0
##   Satellogic          1    0
##   sgerendaskiss      16    0

Level by level, we can see the number of instances (TRUE). We can locate where the longest tweets are.

We create another flag variable, hasrt, that indicates whether a tweet was retweeted (or contains a retweet).

We first filter out the NA.

# filter
tweetDF$hasrt <- !(is.na(rt))
tweetDF$hasrt
##   [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
##  [12]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
##  [34]  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [45]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
##  [78] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
##  [89]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## [100] FALSE
detach(tweetDF)
attach(tweetDF)

hasrt 
##   [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
##  [12]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
##  [34]  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [45]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
##  [78] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
##  [89]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## [100] FALSE

Or use View(hasrt).

We count the occurrences (TRUE=1, FALSE=0).

sum(hasrt)
## [1] 50

We print contingency tables of short/long tweets and retweets.

Absolute

table(hasrt, longtext)
##        longtext
## hasrt   FALSE TRUE
##   FALSE    50    0
##   TRUE     26   24

Proportional or relative

prop.table(table(hasrt, longtext), 2)
##        longtext
## hasrt       FALSE      TRUE
##   FALSE 0.6578947 0.0000000
##   TRUE  0.3421053 1.0000000

It seems there are more short tweets longtext=FALSE that are retweeted hasrt=TRUE (lower-left) than long tweets (lower-right).

However, a higher proportion of long tweets are retweeted.

In other words, longer tweets are proportionally more retweeted although the volume of short tweets is more important.



Tagging another attribute

Extract URLs from tweet texts using another regex: http://t.co/[a-z,A-Z,0-9]{8}.

The regular expression begins with the 12 literal characters ending with a forward slash. Then we have a regular expression pattern to match. The material within the square brackets matches any upper or lowercase letter and any digit. The numeral 8 between the curly braces at the end say to match the previous pattern exactly eight times.

Instead of function str_match, we use str_match_all. The first function returns a long matrix of 1 column, the second function returns an array of small matrices.

tweetDF$urlist <- str_match_all(tweetDF$text, "https://t.co/[a-z,A-Z,0-9]{8}")

detach(tweetDF)
attach(tweetDF)

head(urlist, 5)
## [[1]]
##      [,1]                   
## [1,] "https://t.co/EaSNd7Bz"
## 
## [[2]]
##      [,1]                   
## [1,] "https://t.co/3nYeOE8j"
## 
## [[3]]
##      [,1]                   
## [1,] "https://t.co/NBlm5Ou0"
## 
## [[4]]
##      [,1]
## 
## [[5]]
##      [,1]

We get a a multidimensional object: an array of matrices. Some matrices are empty (no value), others are 1x1 (1 value) or more.

We need an apply function to roll through each matrix and count the number of URLs per tweet. The recursive apply, rapply, dives down into the complex, nested structure of urlist and repetitively run the length function.

tweetDF$numurls <- rapply(urlist, length)

detach(tweetDF)
attach(tweetDF)

head(numurls, 10)
##  [1] 1 1 1 0 0 0 1 1 0 1

We now have a new variable that counts the number of URLs per tweet.



Combining the attributes

Let’s look at a proportion contingency table crossing two attributes: the number of URLs per tweets vs. long tweets.

prop.table(table(numurls,longtext))
##        longtext
## numurls FALSE TRUE
##       0  0.22 0.23
##       1  0.51 0.01
##       2  0.03 0.00

We have 100 tweets. 10%, for example, means 10 tweets.

We can read the proportion of long tweet longtext=TRUE with 0, 1 or 2 URLs for example.

We can also build three-way contingency tables mixing retweets and URLs with long tweets.

table(numurls, hasrt, longtext)
## , , longtext = FALSE
## 
##        hasrt
## numurls FALSE TRUE
##       0    18    4
##       1    29   22
##       2     3    0
## 
## , , longtext = TRUE
## 
##        hasrt
## numurls FALSE TRUE
##       0     0   23
##       1     0    1
##       2     0    0
prop.table(table(numurls, hasrt, longtext))
## , , longtext = FALSE
## 
##        hasrt
## numurls FALSE TRUE
##       0  0.18 0.04
##       1  0.29 0.22
##       2  0.03 0.00
## 
## , , longtext = TRUE
## 
##        hasrt
## numurls FALSE TRUE
##       0  0.00 0.23
##       1  0.00 0.01
##       2  0.00 0.00

We can extract some statistics.

# average length (number of characters) of retweets AND long tweets 
mean(textlen2[hasrt & longtext])
## [1] 143.9167
# average length of retweets AND short tweets (! means NOT)
mean(textlen2[hasrt & !longtext])
## [1] 110.8077

In other words, retweets are longer tweets.



An image is worth…

We can examine the results of text mining with visualization tools.

  • Word clouds.
  • Treemaps.
  • Radio treemaps.
  • Dendrograms.

We will focus on word cloud. We can push the analysis with methods such as ‘bag of words’ or ‘sentiment analysis’. These methods involves treemaps and dendrograms.

Notes on text visualization packages and more

Check out the ending note for visual examples.

  • An example from a blog article using Twitter (twitteR) and NLP packages (wordcloud, tm) to explore word frequencies.
  • Another example from a blog focusing on NLP packages (wordcloud, tm).
  • Other than word frequency within a text or a corpus, we may want to explore the word associations and ngrams (frequent groups of words) with arborescences or dendrograms. Dendrograms is a built-in object in R base package stats; class dendrogram. Dendrograms is what we get when we plot the results of hierarchical cluster analysis, with hclust, on a corpus.
  • There are complementary packages for plotting dendrograms. They work for structured and unstructured data. Consult packages and search for examples: dendextend, a package for visualizing, adjusting, and comparing dendrograms (based on bioinformatics), ggdendro, that can be used with ggplot2, and FactoMineR that does hierarchical clustering on principal components.
  • We can also use the phyloseq package. The package was designed for biology and ecology. Consult this demo about Phylogenetic Tree. The package does more than dendrograms: network analyses, heatmaps, grid or facets plots, radial trees, etc.
  • An example showing nice visualizations using the phyloseq package with the ggplot2 package to draw arborescences and radial trees.
  • The ape package is related to the phyloseq one. Consult the website.


Building a (clean) corpus

Based on the previous extractions: #earth.

head(tweetDF$text, 3)
## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! #rareearthgallerycc #amethyst #crystal… https://t.co/EaSNd7Bz14"   
## [2] "Invigorating &amp; Peaceful Sunday morning hiking Griffith Park!\n\n#DTLA #hiking #peace #MotherEarth #clouds #coyote… https://t.co/3nYeOE8j6g"
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on #sale for $1.99 via @AceRocBooks Grab it today!… https://t.co/NBlm5Ou0a4"

We need to clean up the data:

  • strips out extra spaces,
  • gets rid of all URL strings,
  • takes out the retweet header if one exists in the tweet,
  • removes hashtags,
  • eliminates references to other people’s tweet handles with string replacement functions from the stringr package

We can clean the tweets using regex and the str_replace_all function from the stringr package. We build a custom function to automate things.

CleanTweets will takes the junk out of a vector tweet texts.

library(stringr)

CleanTweets <- function(tweets) {
  # Remove redundant spaces
  tweets <- str_replace_all(tweets, "  ", " ")
  # Get rid of URLs
  tweets <- str_replace_all(tweets, 
    "http://t.co/[a-z,A-Z,0-9]{10}", "")
  tweets <- str_replace_all(tweets, 
    "https://t.co/[a-z,A-Z,0-9]{10}", "")
  # Take out retweet header, there is only one
  tweets <- str_replace(tweets, "RT @[a-z,A-Z,0-9]*: ", "")
  # Get rid of hashtags
  tweets <- str_replace_all(tweets, "#[a-z,A-Z,0-9]*", "")
  # Get rid of references to other screennames
  tweets <- str_replace_all(tweets, "@[a-z,A-Z,0-9]*", "")
  # more
  tweets <- str_replace_all(tweets,"…","")
  tweets <- str_replace_all(tweets,"�","")
  tweets <- str_replace_all(tweets, "  ", " ")
  tweets <- str_replace_all(tweets, "  ", " ")
  tweets <- str_replace_all(tweets, "  ", " ")
  tweets <- str_replace_all(tweets, "  ", " ")
  tweets <- str_replace_all(tweets, "  ", " ")
  tweets <- str_replace_all(tweets, "  ", " ")
  tweets <- str_replace_all(tweets, 
    "httpstco[a-z,A-Z,0-9]{10}", "")
  tweets <- str_replace_all(tweets, 
    "httpstco[a-z,A-Z,0-9]{2}", "")
  return(tweets)
}

The original text.

Text <- tweetDF$text

We apply CleanTweets.

cleanText <- CleanTweets(Text)

We show the original text and the cleans text side by side.

head(Text, 3)
## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! #rareearthgallerycc #amethyst #crystal… https://t.co/EaSNd7Bz14"   
## [2] "Invigorating &amp; Peaceful Sunday morning hiking Griffith Park!\n\n#DTLA #hiking #peace #MotherEarth #clouds #coyote… https://t.co/3nYeOE8j6g"
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on #sale for $1.99 via @AceRocBooks Grab it today!… https://t.co/NBlm5Ou0a4"
head(cleanText, 3)
## [1] "Throw backs! Some one of a kind specimens sold but will never be forgotten! "            
## [2] "Invigorating &amp; Peaceful Sunday morning hiking Griffith Park!\n\n "                   
## [3] "\"Stranger in a Strange Land\" by Robert A. Heinlein is on for $1.99 via Grab it today! "


Examining and manipulating text

Before using word clouds or other visualization tools, we need to convert the body of tweets into a text corpus using the the tm package and text mining functions (similarly to data mining, but for unstructured data).

library(tm)

First, a simple example

Let’s coerce the body of tweets into class Corpus. Here is a simple example: a body of 2 texts.

docs <- c("This is a text.", "This another one.")
docsCorpus <- VCorpus(VectorSource(docs))
docsCorpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

A ‘corpus’ is a ‘body’ of texts. The above corpus has 2 documents. Whether we deal with 2 small strings, 10 full page articles, 50 multipage doctoral theses, 1000 tweets, they can all be converted into a corpus of documents (texts) to be analyzed.

class(docsCorpus)
## [1] "VCorpus" "Corpus"

A class not only contains definitions about the structure of data, it also contains references to functions that can work on that class.

We can compare the simple body of texts with the corpus, once created.

str(docs)
##  chr [1:2] "This is a text." "This another one."
str(docsCorpus)
## List of 2
##  $ 1:List of 2
##   ..$ content: chr "This is a text."
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2018-01-15 19:03:49"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "1"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 2:List of 2
##   ..$ content: chr "This another one."
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2018-01-15 19:03:49"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "2"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

The corpus has more metadata. We could have implemented the metadata manually.

ptd1 <- PlainTextDocument("This is a text.",
                          heading = "Plain text document",
                          id = basename(tempfile()),
                          language = "en")
meta(ptd1)
##   author       : character(0)
##   datetimestamp: 2018-01-15 19:03:49
##   description  : character(0)
##   heading      : Plain text document
##   id           : file236616f2ea7b
##   language     : en
##   origin       : character(0)
ptd2 <- PlainTextDocument("This another one.",
                          heading = "Plain text document",
                          id = basename(tempfile()),
                          language = "en")
meta(ptd2)
##   author       : character(0)
##   datetimestamp: 2018-01-15 19:03:49
##   description  : character(0)
##   heading      : Plain text document
##   id           : file23666c5c2d33
##   language     : en
##   origin       : character(0)

Back to the main corpus

Let’s work with our cleanText (body of tweets).

tweetCorpus <- VCorpus(VectorSource(cleanText))
tweetCorpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 100

The body of tweets is now a corpus of 100 (mini) documents.

The first thing we want to do is simplify the corpus: remove capital letters, remove punctuation, remove unnecessary words, etc.

# this step is facultative
tweetCorpus <- tm_map(tweetCorpus, PlainTextDocument)

tweetCorpus <- tm_map(tweetCorpus, content_transformer(tolower))
tweetCorpus <- tm_map(tweetCorpus, content_transformer(removePunctuation))

Unnecessary words can be stop words (the, a, at, etc.) and repetitive expressions (common names, brand names, tag words, etc.).

In text mining, we only want data (words) that bring value.

In a sentiment analysis, about a brand for example, some words simply cloud the analysis; we do not need the brand name or ‘the’ showing up in ngrams, dendrograms or word clouds.

Since most tweets are mostly in English, we remove the English [stop words(https://en.wikipedia.org/wiki/Stop_words).

tweetCorpus <- tm_map(tweetCorpus,content_transformer(removeWords),
  stopwords('english'))

We could go further.

We then convert the corpus into a text matrix. The matrix is a rectangular data structure or a contingency table where we can see each term on one axis, documents on the other axis, and a tally inside the matrix (term frequency). The matrix can take to forms:

  • Term-document, with terms as the rows and documents as the columns.
  • Document-term, vice-versa.
  • In both cases, we can transpose the matrix.
# create a term-document matrix or TDM
tweetTDM <- TermDocumentMatrix(tweetCorpus)
tweetTDM
## <<TermDocumentMatrix (terms: 282, documents: 100)>>
## Non-/sparse entries: 668/27532
## Sparsity           : 98%
## Maximal term length: 15
## Weighting          : term frequency (tf)
t(tweetTDM) # new a document-term matrix or DTM
## <<DocumentTermMatrix (documents: 100, terms: 282)>>
## Non-/sparse entries: 668/27532
## Sparsity           : 98%
## Maximal term length: 15
## Weighting          : term frequency (tf)

A term may be a single word, “biology,” or it could also be a compound word, “data analysis”. A term like “data” can appear once in the first document, twice in the second document, and not at all in the third document. Then the column for the term data will tally 1, 2, 0.

‘Sparse’ refers to the overwhelming number of cells that contain zero indicating that the particular term does not appear in a given number of documents.

A sparse matrix occurs when the vocabulary is rich and documents are clearly different form each other. Customer satisfaction messages generate denser matrices because of the repetitive style of each message (“great product, bad product” and the likes).

Carry on or swerve off

We can further process TDM and DTM. Text mining is a field of its own. However, our goal is to draw word clouds. We need to coerce the TDM into a plain matrix.

class(tweetTDM)
## [1] "TermDocumentMatrix"    "simple_triplet_matrix"
dim(tweetTDM)
## [1] 282 100
str(tweetTDM)
## List of 6
##  $ i       : int [1:668] 24 86 127 162 174 221 223 241 275 20 ...
##  $ j       : int [1:668] 1 1 1 1 1 1 1 1 1 2 ...
##  $ v       : num [1:668] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 282
##  $ ncol    : int 100
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:282] "100" "1365x2048" "199" "2017" ...
##   ..$ Docs : chr [1:100] "character(0)" "character(0)" "character(0)" "character(0)" ...
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
tdMatrix <- as.matrix(tweetTDM)
class(tdMatrix)
## [1] "matrix"
dim(tdMatrix)
## [1] 282 100
str(tdMatrix)
##  num [1:282, 1:100] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Terms: chr [1:282] "100" "1365x2048" "199" "2017" ...
##   ..$ Docs : chr [1:100] "character(0)" "character(0)" "character(0)" "character(0)" ...
tdMatrix[1:5, 1:5]
##            Docs
## Terms       character(0) character(0) character(0) character(0)
##   100                  0            0            0            0
##   1365x2048            0            0            0            0
##   199                  0            0            1            0
##   2017                 0            0            0            0
##   2048                 0            0            0            0
##            Docs
## Terms       character(0)
##   100                  0
##   1365x2048            0
##   199                  0
##   2017                 0
##   2048                 0

We sort the matrix in descending order to show the most frequent terms at the top.

sortedMatrix <- sort(rowSums(tdMatrix), decreasing=TRUE)

class(sortedMatrix)
## [1] "numeric"
length(sortedMatrix) # it is now a numeric vector with names
## [1] 282
str(sortedMatrix)
##  Named num [1:282] 39 25 21 19 18 18 18 18 18 16 ...
##  - attr(*, "names")= chr [1:282] "hotel" "amp" "supporter" "impactchoice" ...
head(sortedMatrix, 5)
##        hotel          amp    supporter impactchoice       client 
##           39           25           21           19           18

We now have a vector where names are the terms and integers are the frequencies (the total word tally for all documents).

We extract the names, bind them with frequencies in 2 columns, and remove row names.

cloudFrame <- data.frame(word = names(sortedMatrix),
                         freq = sortedMatrix)
row.names(cloudFrame) <- NULL

head(cloudFrame, 5)
##           word freq
## 1        hotel   39
## 2          amp   25
## 3    supporter   21
## 4 impactchoice   19
## 5       client   18

We can now visualize the results with word clouds.



Word clouds

We load the wordcloud functions from the workcloud package.

library(wordcloud)

We create the first word cloud.

wordcloud(cloudFrame$word, cloudFrame$freq)

There are more options.

library(RColorBrewer)

display.brewer.all()

pal1 <- brewer.pal(8, "Dark2") # qualitative palette
pal1b <- brewer.pal(8, "Accent") # qualitative palette

pal2 <- brewer.pal(9, "GnBu") # sequential palette
pal3 <- brewer.pal(11, "Spectral") # diverging palette

Some palette have more colours. With 8 colours, we are limited to 8 word-frequency entry.

Each colour will be related to a frequency just like the font size. The size and the colour determine the frequency.

We can adjust the word cloud parameters to fit the colour palette.

  • Minimum frequency: any word showing less than the number will not show up in the cloud.
  • Maximum words in the cloud.
  • Random order: by default, higher frequencies can either show up in the middle or on the edges of the cloud; we can add some randomness.
  • Rotation: we can align all the words or set a percentage to appear at 90 degrees

Here are a couple of examples:

par(mfrow = c(1,2))

# qualitative palette
wordcloud(cloudFrame$word, cloudFrame$freq, colors = pal1) # with 8 colours max; we let the word cloud self-adjust

# sequential palette
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 4, colors = pal2) # min.freq starts at the 4th color on the palette; lower ranks are too pale to show up and they apply to 1, 2, 3 frequencies

# diverging palette
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, colors = pal3) # we force lower frequencies in; reddish colors are low frequencies

# qualitative palette
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, random.order = TRUE, colors = pal1b, rot.per = 0.33) # the placement is no longer organized by frequencies, only 1/3 of term are flipped at 90 degrees

# idem
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, random.order = TRUE, colors = pal1b, rot.per = 0) # no term are flipped at 90 degrees

# idem
wordcloud(cloudFrame$word, cloudFrame$freq, min.freq = 2, max.words = 50, random.order = FALSE, colors = pal1b, rot.per = 0) # remove random placement, most frequent terms appear in the center



Word clouds outside R

We save the word data into a CSV and a flat file.

write.csv uses ‘,’ separators by default and write.csv uses ‘;’ separators by default. Encode the text to capture the foreign characters; UTF-16 is better than UTF-8; consult https://en.wikipedia.org/wiki/UTF-16

write.csv2(cloudFrame, file = "cloudFrame.csv", row.names = FALSE, fileEncoding = "UTF-16LE")

write.table(cloudFrame, file = "cloudFrame.txt", sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")

With the original text or the cloudFrame files, we can use other software to generate word clouds. Some of these software are online:

  • Wordle (and change the angles).
  • tagCloud generator.
  • ImageChef (choice of fonts, shapes).
  • ABVya.
  • Tagul.
  • WordItOut.
  • Tagxedo (choice of shapes).
  • TagCrowd (like the blog word clouds).


Notes: examples of other visualizations

Word clouds are one way to visualize word frequency. Other visualization have other purposes.

Basic functions

class dendrogram; hierarchical cluster analysis, with hclust.


Packages

dendextend


ggdendro (with ggplot2)


FactoMineR


phyloseq (with ggplot2)


ape (related to phyloseq)