Foreword
Twitter generates unstructured data; as opposed to structure data or numeric, ordinal, binary data for example.
We can capture this unstructured data, build a corpus and run analyses with the twitteR package. Beforehand, to download tweets, or scrape the web, we need an API.
Above all, we create an account and/or log in Twitter.
Here is a function that takes as input the name of a package. It tests whether the package has been downloaded – ‘installed’ – from the R code repository. If it has not yet been downloaded/installed, the function does it.
Then the function uses require to prepare the package (it works like library):
EnsurePackage <- function(x)
{
x <- as.character(x)
if (!require(x, character.only = TRUE))
{
install.packages(pkgs = x,
repos = "http://cran.r-project.org")
require(x, character.only = TRUE)
}
}require does the same thing than library, but it also returns the value ‘FALSE’ if the package requested has not yet been downloaded.
We use the function and load all packages:
PrepareTwitter <- function()
{
EnsurePackage("bitops")
EnsurePackage("RCurl")
EnsurePackage("RJSONIO")
EnsurePackage("twitteR")
EnsurePackage("ROAuth")
}We execute the function and install the necessary packages: bitops, RCurl, RJSONIO, twitteR and ROAuth.
PrepareTwitter()Alternatively, we can ready R with the following.
library(bitops)
library(RCurl)
library(RJSONIO)
library(twitteR)
library(ROAuth)These steps might be tedious. I might not work the first time. Dependencies are needed for the above packages. Find help online (Stack Overflow for example) to fix these issues. In the end, we should be able to properly load in the 5 packages.
Other installation might be needed: install.packages(c("devtools", "rjson", "bit64", "httr")). Restarting the R session following an installation helps. Do not forget to load the packages in R as well. Oh yes… restart the R session, then load all the needed packages.
library(devtools)
library(rjson)
library(bit64)
library(httr)Depending on which version of OS, it may be necessary to provide new SSL certificates. Certificates help to maintain secure communications across the Internet, and most computers keep an up-to-date copy on file, but not all of them do.
With Windows.
download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile = 'cacert.pem')twitteR uses RCurl which in turn employs SSL security whenever ‘https’ appears in a URL.
With the Twitter account, go to the settings, go in Applications and create a new application. Create it and/or log in apps. You will get redirected to a screen with all the OAuth setting of your new app (the authentification process).
We need ‘doing the handshake’ every time R interacts with Twitter.
Down the line, we need 4 codes: ‘consumerKey’, ‘consumerSecret’, ‘access_token’ and ‘access_secret’. How?
The wording and the procedure can change from time to time. Check out the latest twitterR documentation for more instructions. Consult the Twitter Developer Documentation website. Search online help to ‘retrieve the data from Twitter’. Here is a good example.
Once we have the 4 codes, we create 4 variables (api_key, api_secret, access_token, access_secret) to link R with the Twitter account (my snippet remains secret with echo=FALSE).
api_key <- "YOUR API KEY" # or consumer key
api_secret <- "YOUR API SECRET" # or consumer secret
access_token <- "YOUR ACCESS TOKEN"
access_token_secret <- "YOUR ACCESS TOKEN SECRET"We wrap up the OAuth authentication process with the handshake functions from the httr package and we open a twitteR session.
setup_twitter_oauth(api_key,
api_secret,
access_token,
access_secret)## [1] "Using direct authentication"
When the API is on, R should now be linked to Twitter.
We test it with the searchTwitter function.
searchTwitter(searchString,
n=3,
lang=NULL,
since=NULL,
until=NULL,
locale=NULL,
geocode=NULL,
sinceID=NULL,
maxID=NULL,
resultType=NULL,
retryOnRateLimit=120)#sun
searchTwitter("sun",
n = 3)## [[1]]
## [1] "MLB37167: RT @Breaking911: UPDATE: At least 17 people wounded in MLK Day weekend shootings across Chicago - Sun Times https://t.co/qoqIPz9w65"
##
## [[2]]
## [1] "marianathesnake: Glow from @ABHcosmetics Sun dipped ☀️ ✨☀️✨☀️\xed\xa0\xbd\xed\xb2\x84#AnastasiaBeverlyHills @norvina1 #ABHGlow #sundipped #Highlighter… https://t.co/eJGndsanJ7"
##
## [[3]]
## [1] "Kizzezzleepy: RT @SiblingsKisses: If @delavinkisses is the sun, then Kissables are like sunflowers, all looking towards the sun, drawn to her radiant, wa…"
On Windows, you had to get a new certificate: ‘cacert.pem’. Then, you may have to use this command: searchTwitter("#hashtag", n = 3, cainfo = "cacert.pem").
We change the language (lang=NULL is the default setting). Consult Wikipedia for the list of languages; pick the proper ISO 639-1 code. For example, French is ‘fr’.
#soleil
searchTwitter("#soleil",
n = 3,
lang = 'fr') ## [[1]]
## [1] "JimmyD5976: RT @lavoixdunord: Les destinations les plus en vue à l’aéroport de Lille-Lesquin #soleil #vacances https://t.co/2p7e9O8DDd https://t.co/ytK…"
##
## [[2]]
## [1] "Fleming1186: RT @VALBERGAlpesSud: Aujourd’hui c’est #lundi au #soleil \xed\xa0\xbd\xed\xb8\x8e\xed\xa0\xbc\xed\xbe\xbfet vous ? ⛄️\xed\xa0\xbd\xed\xb8\x9c #ski #winter #CotedAzurFrance #lovevalberg https://t.co/U8iZANPXMd"
##
## [[3]]
## [1] "lavoixdunord: Les destinations les plus en vue à l’aéroport de Lille-Lesquin #soleil #vacances https://t.co/2p7e9O8DDd https://t.co/ytKjO0OZLp"
The worst is done, begin can the fun!
twitteR, the functionsdirectMessage-class, class “directMessage”: a class to represent Twitter Direct Messages.
dmGet, dmSent, dmDestroy, dmSend, functions to manipulate Twitter direct messages.getTrends, availableTrendLocations, closestTrendLocations, functions to view Twitter trends.getUser, lookupUsers, functions to manage Twitter users.import_statuses, import_trends, import_users, import_obj, json_to_users, json_to_statuses, json_to_trends, functions to import twitteR objects from various sources.load_tweets_db, store_tweets_db, store_users_db, load_users_db, load_tweets_db, functions to persist/load twitteR data to a database.registerTwitterOAuth, register OAuth credentials to twitter R session.searchTwitter, search twitter.search_twitter_and_store, a function to store searched tweets to a database.
register_db_backend, register_sqlite_backend, register_mysql_backend.setup_twitter_oauth, sets up the OAuth credentials for a twitteR session.status-class, class to contain a Twitter status.
taskStatus, a function to send a Twitter DM after completion of a task.userTimeline, homeTimeline, mentions, retweetsOfMe, functions to view Twitter timelines.twListToDF, a function to convert twitteR lists to data.framesupdateStatus, tweet, deleteStatus, functions to manipulate Twitter status.use_oauth_token, sets up the OAuth credentials for a twitteR session from an existing Token object.userFactory, a container object to model Twitter users.stats::HoltWinters, Holt-Winters filtering of times series.stats::plot.HoltWinters, plot function for Holt-Winters objects.stats::predict.HoltWinters, prediction function for fitted Holt-Winters model.To pull data from Twitter, we log into the Twitter account (open the app), load in all the R packages and run the setup_twitter_oauth function to begin.
We pull the top trends in two cities, near two bridges.
San Francisco, Golden Bridge
SF_woeid <- closestTrendLocations(37.781157, -122.39720) # one of the accesses of the Golden Bridge
SF_woeid## name country woeid
## 1 San Francisco United States 2487956
SF_trends <- getTrends(as.numeric(SF_woeid[3]))
head(SF_trends) # a data frame## name url
## 1 #MLKDay http://twitter.com/search?q=%23MLKDay
## 2 Dolores O'Riordan http://twitter.com/search?q=%22Dolores+O%27Riordan%22
## 3 #MotivationMonday http://twitter.com/search?q=%23MotivationMonday
## 4 Happy MLK http://twitter.com/search?q=%22Happy+MLK%22
## 5 #Heathers http://twitter.com/search?q=%23Heathers
## 6 Birmingham Jail http://twitter.com/search?q=%22Birmingham+Jail%22
## query woeid
## 1 %23MLKDay 2487956
## 2 %22Dolores+O%27Riordan%22 2487956
## 3 %23MotivationMonday 2487956
## 4 %22Happy+MLK%22 2487956
## 5 %23Heathers 2487956
## 6 %22Birmingham+Jail%22 2487956
Montréal, Jacques-Cartier
Mtl_woeid <- closestTrendLocations(45.522660, -73.546428) # one of the accesses of the Jacques-Cartier bridge
Mtl_woeid## name country woeid
## 1 Montreal Canada 3534
Mtl_trends <- getTrends(as.numeric(Mtl_woeid[3]))
head(Mtl_trends) # a data frame## name url
## 1 Dolores O'Riordan http://twitter.com/search?q=%22Dolores+O%27Riordan%22
## 2 #MLKDay http://twitter.com/search?q=%23MLKDay
## 3 #BlueMonday http://twitter.com/search?q=%23BlueMonday
## 4 Andrew Shaw http://twitter.com/search?q=%22Andrew+Shaw%22
## 5 Kevin Glenn http://twitter.com/search?q=%22Kevin+Glenn%22
## 6 #NationalHatDay http://twitter.com/search?q=%23NationalHatDay
## query woeid
## 1 %22Dolores+O%27Riordan%22 3534
## 2 %23MLKDay 3534
## 3 %23BlueMonday 3534
## 4 %22Andrew+Shaw%22 3534
## 5 %22Kevin+Glenn%22 3534
## 6 %23NationalHatDay 3534
#climate
We get some data (recent tweets).
# worldwide tweets
tweetList <- searchTwitter("#climate", n = 500)
mode(tweetList); length(tweetList)## [1] "list"
## [1] 500
The object is a unidimensional structure: a list of 500 entries.
# the first tweet structure (data + metadata)
str(head(tweetList, 1))## List of 1
## $ :Reference class 'status' [package "twitteR"] with 17 fields
## ..$ text : chr "RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjH"| __truncated__
## ..$ favorited : logi FALSE
## ..$ favoriteCount: num 0
## ..$ replyToSN : chr(0)
## ..$ created : POSIXct[1:1], format: "2018-01-15 18:58:29"
## ..$ truncated : logi FALSE
## ..$ replyToSID : chr(0)
## ..$ id : chr "952978313047191552"
## ..$ replyToUID : chr(0)
## ..$ statusSource : chr "<a href=\"http://twitter.com/#!/download/ipad\" rel=\"nofollow\">Twitter for iPad</a>"
## ..$ screenName : chr "MonterioJulio"
## ..$ retweetCount : num 7
## ..$ isRetweet : logi TRUE
## ..$ retweeted : logi FALSE
## ..$ longitude : chr(0)
## ..$ latitude : chr(0)
## ..$ urls :'data.frame': 1 obs. of 5 variables:
## .. ..$ url : chr "https://t.co/M8S0nwjHzH"
## .. ..$ expanded_url: chr "https://www.sciencedaily.com/releases/2018/01/180103160202.htm"
## .. ..$ display_url : chr "sciencedaily.com/releases/2018/…"
## .. ..$ start_index : num 88
## .. ..$ stop_index : num 111
## ..and 53 methods, of which 39 are possibly relevant:
## .. getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet,
## .. getLatitude, getLongitude, getReplyToSID, getReplyToSN,
## .. getReplyToUID, getRetweetCount, getRetweeted, getRetweeters,
## .. getRetweets, getScreenName, getStatusSource, getText, getTruncated,
## .. getUrls, initialize, setCreated, setFavoriteCount, setFavorited,
## .. setId, setIsRetweet, setLatitude, setLongitude, setReplyToSID,
## .. setReplyToSN, setReplyToUID, setRetweetCount, setRetweeted,
## .. setScreenName, setStatusSource, setText, setTruncated, setUrls,
## .. toDataFrame, toDataFrame#twitterObj
# the first 3 tweets (data alone)
head(tweetList, 3)## [[1]]
## [1] "MonterioJulio: RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjHzH\n\n#climate #ClimateChangeIsR…"
##
## [[2]]
## [1] "geddes_anna: RT @iddrilefil: #Development and #climate #finance is not about labelled infrastructure but about good economic planning and setting the ri…"
##
## [[3]]
## [1] "SteliosYiatros: RT @WaterJPI: Applications now open! Join The @ClimateKIC Journey, Europe's largest #climate #innovation summer school. https://t.co/bTG7v0…"
We convert the object into a data frame. The data frame, another object, is a 2D structure (rows and columns).
tweetDF <- twListToDF(tweetList)
mode(tweetDF); dim(tweetDF)## [1] "list"
## [1] 500 16
head(tweetDF, 3)## text
## 1 RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjHzH\n\n#climate #ClimateChangeIsR…
## 2 RT @iddrilefil: #Development and #climate #finance is not about labelled infrastructure but about good economic planning and setting the ri…
## 3 RT @WaterJPI: Applications now open! Join The @ClimateKIC Journey, Europe's largest #climate #innovation summer school. https://t.co/bTG7v0…
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 <NA> 2018-01-15 18:58:29 FALSE
## 2 FALSE 0 <NA> 2018-01-15 18:55:11 FALSE
## 3 FALSE 0 <NA> 2018-01-15 18:55:03 FALSE
## replyToSID id replyToUID
## 1 <NA> 952978313047191552 <NA>
## 2 <NA> 952977481765711872 <NA>
## 3 <NA> 952977446604890113 <NA>
## statusSource
## 1 <a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>
## 2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 MonterioJulio 7 TRUE FALSE NA NA
## 2 geddes_anna 7 TRUE FALSE NA NA
## 3 SteliosYiatros 3 TRUE FALSE NA NA
Alternatively, we can run this code.
tweetDF <- do.call('rbind', lapply(tweetList, as.data.frame))
mode(tweetDF)
dim(tweetDF)
head(tweetDF, 3)…for many other situations:
as.data.frame coerce tweetList the list into a data frame.lapply applies the coercing function to all of the elements of the list.rbind binds all rows together.do.call() allows for a flexible number of arguments to be supplied to the functions (rbind and lapply). lapply first, followed by rbind in quotes.We implement a function to automate the extraction and convert the results into a data frame.
TweetFrame <- function(searchTerm, maxTweets)
{
twtList <-
searchTwitter(searchTerm, n = maxTweets)
return(do.call("rbind", lapply(twtList, as.data.frame)))
}twtList is a temporary variable created within the function. This is the result of variable scoping.
We create a new dataset.
#earth
tweetDF <- TweetFrame("#earth", 250)We start the analysis.
We print the tweetDF object attributes and extract a vector from the d.f.
# attributes(tweetDF)$row.names is skipped because it is a long enumeration; the number of tweets
# object attributes
attributes(tweetDF)$names## [1] "text" "favorited" "favoriteCount" "replyToSN"
## [5] "created" "truncated" "replyToSID" "id"
## [9] "replyToUID" "statusSource" "screenName" "retweetCount"
## [13] "isRetweet" "retweeted" "longitude" "latitude"
# one attribute
attributes(tweetDF)$class## [1] "data.frame"
# metadata: time, POSIXct
head(tweetDF$created, 3)## [1] "2018-01-15 18:59:26 UTC" "2018-01-15 18:58:59 UTC"
## [3] "2018-01-15 18:58:44 UTC"
# or attach the object to simplify the code
attach(tweetDF)
head(created, 3)## [1] "2018-01-15 18:59:26 UTC" "2018-01-15 18:58:59 UTC"
## [3] "2018-01-15 18:58:44 UTC"
We plot the data.
library(ggplot2)
ggplot(tweetDF, aes(x=created)) +
geom_histogram(bins=15, fill="white", colour="black") +
xlab('time') + ylab('frequency')We compute the time range (earliest tweet to last tweet in the d.f.).
max(created); min(created)## [1] "2018-01-15 18:59:26 UTC"
## [1] "2018-01-15 16:23:04 UTC"
timerange <- max(created) - min(created)
timerange## Time difference of 2.606111 hours
We compute the hourly tweet velocity: no of tweets * 60min / timerange for each bar in minute, i.e., the average number of tweet per hour.
as.numeric(timerange) * 60 / 15 # TPH## [1] 10.42444
Time introduces a discrete element. Tweets occur at specific times.
Can we predict tweets with the Poisson discrete distribution?
We will cover these notions, further down.
We order the tweets by time.
sortweetDF <- tweetDF[order(as.integer(created)), ]
# detach the unsorted d.f.
detach(tweetDF)
# attach the sorted d.f.
attach(sortweetDF)We compute the time difference between the tweets in seconds. Seconds are the smallest time measure in YYYY-MM-DD HH:MM:SS.
diff(created)## Time differences in secs
## [1] 5 231 130 45 11 83 15 38 13 16 0 142 22 133 46 6 27
## [18] 94 29 11 28 24 4 8 34 13 1 78 89 7 3 11 19 9
## [35] 8 8 127 12 19 19 11 1 0 35 3 48 15 26 91 79 68
## [52] 30 22 40 11 44 7 41 8 18 8 33 16 5 11 11 14 12
## [69] 52 140 2 87 18 26 33 12 100 60 41 37 10 57 2 54 2
## [86] 2 20 4 34 41 28 10 275 31 48 119 2 6 6 20 4 9
## [103] 11 45 94 182 66 16 5 35 4 106 4 8 94 74 117 103 56
## [120] 30 53 12 6 28 9 71 32 34 19 19 153 22 155 9 38 49
## [137] 41 68 8 43 75 11 47 9 16 25 3 7 13 12 21 32 18
## [154] 175 17 48 37 14 20 47 70 13 37 7 3 20 1 97 27 12
## [171] 98 7 6 7 166 29 28 24 32 177 9 33 3 16 22 7 2
## [188] 4 15 14 22 7 9 87 6 44 5 52 62 47 16 5 9 76
## [205] 19 15 3 20 11 81 82 4 17 5 40 7 19 17 3 25 148
## [222] 52 35 25 76 10 8 95 54 21 86 57 35 2 137 47 5 7
## [239] 6 25 5 5 8 172 25 49 60 15 27
diff <- as.numeric(diff(created))
diff <- as.data.frame(diff)We plot the differences.
library(ggplot2)
ggplot(diff, aes(x=diff)) +
geom_histogram(bins=15, fill="white", colour="black") +
xlab('seconds') + ylab('frequency')We average the time difference.
mean(as.integer(diff(created)))## [1] 37.67871
In average, there is one tweet every \(\approx\) 38s.
Is this the most frequent value?
library("modeest")
mfv(as.integer(diff(created)))## [1] 7 11
median(as.integer(diff(created)))## [1] 21
The mfv function shows that the most commonly occurring time interval between neighboring tweets is: 7, 11s!
The median: 21s. The distribution skewed towards 0, but there are high outliers inflating the mean.
We count the number of tweets with certain time intervals; under 60, 30, and 10 seconds difference. We plot the results.
seconds <- c(sum((as.integer(diff(created))) < 60),
sum((as.integer(diff(created))) < 30),
sum((as.integer(diff(created))) < 10))
tranches <- c('60s', '30s', '10s')
time <- data.frame(tranches = tranches, seconds = seconds)library(ggplot2)
ggplot(time, aes(x=tranches, y=seconds)) +
geom_bar(stat='identity', fill="white", colour="black") +
xlab('tranche') + ylab('count')A ratio or proportion is a probability that the next tweet will arrive in x seconds or less.
pseconds <- c(sum((as.integer(diff(created))) < 60) / 500,
sum((as.integer(diff(created))) < 30) / 500,
sum((as.integer(diff(created))) < 10) / 500)
tranches <- c('60s', '30s', '10s')
time <- data.frame(tranches = tranches, seconds = pseconds)library(ggplot2)
ggplot(time, aes(x=tranches, y=seconds)) +
geom_bar(stat='identity', fill="white", colour="black") +
xlab('tranche') + ylab('probability')The probability that the next tweet will arrive in 60 seconds or less is 0.4; in 10 seconds or less, 0.138.
Can we build a prediction around this concept? Let’s create a function.
Given a list of tweet arrival times, we can calculate the delays between the tweets using time differences. Then, we can compute a ordered list of cumulative probabilities of arrival for the sequential list of time increments for plotting.
ArrivalProbability <- function(times, increment, max)
{
# Initialize an empty vector
plist <- NULL
# Probability is defined over the size of this sample
# of arrival times
timeLen <- length(times)
# May not be necessary, but checks for input mistake
if (increment > max) {return(NULL)}
for (i in seq(increment, max, by = increment))
{
# diff() requires a sorted list of times
# diff() calculates the delays between neighboring times
# the logical test <i provides a list of TRUEs and FALSEs
# of length = timeLen, then sum() counts the TRUEs.
# Divide by timeLen to calculate a proportion
plist <- c(plist, (sum(as.integer(diff(times)) < i)) / timeLen)
}
return(plist)
}times, a sorted, ascending list of arrival times, in POSIXct.increment, the time increment for each new slot, e.g. 10s.max, the highest time increment, e.g., 240s.incr = 10
maxi = 60
created_plist <- ArrivalProbability(created, incr, maxi)
length_plist <- 1:length(created_plist) * incr
simtime <- data.frame(probability = created_plist, tranches = length_plist)library(ggplot2)
ggplot(simtime, aes(x=tranches, y=probability)) +
geom_point(shape=1, size=2) +
xlab('tranche') + ylab('probability')We read the graph as the y probability that the next tweet will arrive in x seconds or less.
# detach the sorted d.f.
detach(sortweetDF)#climate
head(tweetList, 3)## [[1]]
## [1] "MonterioJulio: RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjHzH\n\n#climate #ClimateChangeIsR…"
##
## [[2]]
## [1] "geddes_anna: RT @iddrilefil: #Development and #climate #finance is not about labelled infrastructure but about good economic planning and setting the ri…"
##
## [[3]]
## [1] "SteliosYiatros: RT @WaterJPI: Applications now open! Join The @ClimateKIC Journey, Europe's largest #climate #innovation summer school. https://t.co/bTG7v0…"
The most basic storage solutions are CSV and flat files.
tweetDF_climate <- twListToDF(tweetList)write.csv uses ‘,’ separators by default and write.csv uses ‘;’ separators by default. Encode the text to capture the foreign characters; UTF-16 is better than UTF-8; consult https://en.wikipedia.org/wiki/UTF-16
write.csv2(tweetDF_climate, 'tweetDF_climate.csv', sep = ";", row.names = FALSE, fileEncoding = "UTF-16LE")In a flat file.
write.table(tweetDF_climate, 'tweetDF_climate.txt', sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")#earth
head(tweetDF, 1)## text
## 1 RT @EnjoyNature: #Sunset at #YaquinaHead #Lighthouse #Newport #Oregon\n\n#Nature #Relax #Beauty #Photo #Vacation #Ocean #Earth #Travel\n#Color…
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 <NA> 2018-01-15 18:59:26 FALSE
## replyToSID id replyToUID
## 1 <NA> 952978551984345088 <NA>
## statusSource
## 1 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 EnjoyNature 64 TRUE FALSE <NA> <NA>
head(sortweetDF, 1)## text
## 250 RT @earthtokens: Site inspection at Hotel Verde (Africa Greenest Hotel) - impactChoice client & #EARTH #Token supporter, take a look at the…
## favorited favoriteCount replyToSN created truncated
## 250 FALSE 0 <NA> 2018-01-15 16:23:04 FALSE
## replyToSID id replyToUID
## 250 <NA> 952939197328838657 <NA>
## statusSource
## 250 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 250 NekonyunN 171 TRUE FALSE <NA> <NA>
In a CSV.
write.csv2(tweetDF, 'tweetDF.csv', sep = ";", row.names = FALSE, fileEncoding = "UTF-16LE")
write.csv2(sortweetDF, 'sortweetDF.csv', sep = ";", row.names = FALSE, fileEncoding = "UTF-16LE")In a flat file.
write.table(tweetDF, 'tweetDF.txt', sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")
write.table(sortweetDF, 'sortweetDF.txt', sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")Consider packages:
xlsx.xlsxjars.XLConnect.XLConnectjars.xlsReadWrite..xlsc.readODS for LibreOffice, OpenOffice.gdata, general purpose program, read.xls read Excel files on Mac Linux.For more massive storage, we can consider SQL databases. The twitteR package offer the appropriate functions:
search_twitter_and_store(searchString, table_name = "tweets", + lang = NULL, locale = NULL, geocode = NULL, retryOnRateLimit = 120).register_db_backend(db_handle).db_handle, a DBI connection.register_sqlite_backend(sqlite_file, ...), file path for a SQLite file.register_mysql_backend(db_name, host, user, password, ...), hostname the database is on, username to connect to the database with, password to connect to the database.Also:
RMySQL connects to MySQL.ROracle connects with the widely used Oracle commercial database package.RPostgreSQL connects with the well-developed, full featured PostgreSQL (sometimes just called Postgres) database system.RSQliteconnects with SQlite, another open source, independently developed database system.RMongo connects with the MongoDB system, which is a NoSQL database. MongoDB uses JavaScript to access data. As such it is well suited for web development applications.RODBC connects with ODBC compliant databases, which include Microsoft’s SQLserver, Microsoft Access, and Microsoft Excel, among others. Note that these applications are native to Windows and Windows server, and as such the support for Linux and Mac OS is limited.RHadoop, Hadoop, HDFS, MapReduce.SparkR Spark.# open a connexion with the db
con <- dbConnect(dbDriver("MySQL"), dbname = "test")
dbListTables(con)
# create
dbWriteTable(con, "census", testFrame, overwrite = TRUE)
dbListTables(con)
# run an SQL query
dbGetQuery(con, "SELECT region, july11pop FROM census WHERE july11pop < 1000000")
# close the connexion; ALWAYS!
dbDisconnect(con)We found that arrival times of tweets on a given topic seem to mimic the Poisson distribution.
The time difference between tweets.
library(ggplot2)
ggplot(diff, aes(x=diff)) +
geom_histogram(bins=15, fill="white", colour="black") +
xlab('Seconds') + ylab('frequency')We also covered the mean, median, mfv and skewness of these time differences.
Here are 4 streams of random numbers that roughly fit the Poisson distribution.
Tweet time differences resembles something between the upper-right and lower-left graphs. Tweets behave like the Poisson distribution: they are positively skewed with a long tail towards outliers (a few large time differences) that make the average higher than the median and the mfv.
If we assign parameters n=1000 and \(\lambda\)=10, we get a mean and variance of:
mean(rpois(1000,10)); var(rpois(1000,10))## [1] 10.047
## [1] 9.861637
The Poisson distribution has a property that the variance equals the mean (“equidispersion”).
We also saw the time difference (count and) probability.
library(ggplot2)
ggplot(time, aes(x=tranches, y=seconds)) +
geom_bar(stat='identity', fill="white", colour="black") +
xlab('tranche') + ylab('probability')We can do the same with the random numbers (n = 1000 iterations, \(\lambda\)=30).
pseconds <- c(sum(rpois(1000,30) < 10) / 1000,
sum(rpois(1000,30) < 30) / 1000,
sum(rpois(1000,30) < 60) / 1000)
tranches <- factor(c('10s', '30s', '60s'))
tranches <- factor(tranches, levels = c('10s', '30s', '60s'))
time <- data.frame(tranches = tranches, seconds = pseconds)library(ggplot2)
ggplot(time, aes(x=tranches, y=seconds)) +
geom_bar(stat='identity', fill="white", colour="black") +
xlab('tranche') + ylab('probability')Or other tranches (n = 1000 iterations, \(\lambda\)=10).
pseconds <- c(sum(rpois(1000,10) < 5) / 1000,
sum(rpois(1000,10) < 10) / 1000,
sum(rpois(1000,10) < 20) / 1000)
tranches <- factor(c('5s', '10s', '20s'))
tranches <- factor(tranches, levels = c('5s', '10s', '20s'))
time <- data.frame(tranches = tranches, seconds = pseconds)library(ggplot2)
ggplot(time, aes(x=tranches, y=seconds)) +
geom_bar(stat='identity', fill="white", colour="black") +
xlab('tranche') + ylab('probability')According to the Poisson distribution, if the curve is steeper, the topic is more active, more popular, because the delays between tweet are shorter.
If we want to compare two different topics, the more popular is likely to have a higher posting rate.
We can now develop a test to compare two different topics to see which one is more popular (or at least which one has a higher posting rate).
The next function, DelayProbability, is similar to the above function, ArrivalProbability, but works with unsorted list of delay times.
DelayProbability <- function(delays, increment, max)
{
# Initialize an empty vector
plist <- NULL
# Probability is defined over the size of this sample
# of arrival times
delayLen <- length(delays)
# May not be necessary, but checks for input mistake
if (increment > max) {return(NULL)}
for (i in seq(increment, max, by = increment))
{
# logical test <=i provides list of TRUEs and FALSEs
# of length = timeLen, then sum() counts the TRUEs
plist<-c(plist,(sum(delays <= i) / delayLen))
}
return(plist)
}delays, an unsorted list of arrival times, in POSIXct.increment, the time increment for each new slot, e.g. 10s.max, the highest time increment, e.g., 240s.Let’s simulate two Poisson distributions (as we would compare two topics posting rates) from 1s to 20s with 1s increments.
redf <- data.frame(colour = 'red',
time = seq(1,20,1),
probability = DelayProbability(rpois(100, 10), 1, 20))
greenf <- data.frame(colour = 'green',
time = seq(1,20,1),
probability = DelayProbability(rpois(100, 3), 1, 20))
redgreenf <- rbind(redf, greenf)library(ggplot2)
ggplot(redgreenf, aes(x=time, y=probability, colour=colour)) +
geom_point(shape=1, size=2) +
xlab('seconds') + ylab('probability')The green curve is steeper (increment is lower): higher posting rate. There is a more than 85% probability that the next green tweet will arrive before 5s. The same probability for the red topic is less than 15%. The green topic is ‘hotter’.
However, this is just one sample. Let’s run multiple samples (bootstrapping) of the green topic.
par(mfrow = c(1,2))
plot(DelayProbability(rpois(100, 10), 1, 20), col = "green3", ylab = 'probability', xlab = 'seconds', main = '15 samples')
grid()
for (i in 1:15) {
points(DelayProbability(rpois(100, 10), 1, 20), col = "green3")
}
plot(DelayProbability(rpois(100, 10), 1, 20), col = "green3", ylab = '', xlab = 'seconds)', main = '100 samples')
for (i in 1:100) {
points(DelayProbability(rpois(100, 10), 1, 20), col = "green3")
}
grid()We can expect, from a Poisson distribution, that the probability a 3s delay between two tweets is likely to be:
ppois(3, lambda = 10)## [1] 0.01033605
The \(\lambda\) parameter is the mean probability, splitting the distribution into two halves.
greenf <- data.frame(colour = 'green',
time = seq(1,20,1),
probability = DelayProbability(rpois(100, 10), 1, 20))par(mfrow = c(1,1))
library(ggplot2)
ggplot(greenf, aes(x=time, y=probability)) +
geom_point(shape=1, size=2) +
xlab('seconds') + ylab('probability') +
geom_vline(xintercept=20/2, linetype='dashed', col='darkgray')If we randomly simulate one very large sample…
mean(rpois(100000, 10)); var(rpois(100000, 10))## [1] 10.00753
## [1] 10.04851
…the mean approaches the variance; both are close to \(\lambda\) = 10.
From above, we remember the probability that the next tweet will come in the next 3s:
sum(rpois(100000, 10) <= 3) / 100000
# or simply
ppois(3, lambda = 10)## [1] 0.01082
## [1] 0.01033605
The probability the next tweet will come in the next 10s?
ppois(10, lambda = 10)
# the other way around
qpois(0.58303, lambda = 10)
# in and out
qpois(ppois(10, lambda = 10), lambda=10)## [1] 0.5830398
## [1] 10
## [1] 10
The calculation and the probability are very close:
(58638 / 100000); ppois(10, lambda = 10)## [1] 0.58638
## [1] 0.5830398
How much variation is there around one of these probabilities?
poisson.test(58638, 100000)$conf.int## [1] 0.5816434 0.5911456
## attr(,"conf.level")
## [1] 0.95
The answer is in the confidence interval. The interesting part is the upper/lower bounds on a 95% confidence interval. No matter what sample we generate, the results fall within the interval.
# lower limit
poisson.test(58638, 100000)$conf.int[1]
poisson.test(58638, 100000)$conf.int[1] < ppois(10, lambda = 10)## [1] 0.5816434
## [1] TRUE
# upper limit
poisson.test(58638, 100000)$conf.int[2]
poisson.test(58638, 100000)$conf.int[2] > ppois(10, lambda = 10)## [1] 0.5911456
## [1] TRUE
# other samples, other intervals
poisson.test(5863, 10000)$conf.int
poisson.test(586, 1000)$conf.int
poisson.test(58, 100)$conf.int## [1] 0.5713874 0.6015033
## attr(,"conf.level")
## [1] 0.95
## [1] 0.5395084 0.6354261
## attr(,"conf.level")
## [1] 0.95
## [1] 0.4404183 0.7497845
## attr(,"conf.level")
## [1] 0.95
The smaller the sample, the larger the confidence interval, the more variability or imprecision.
Let’s apply this knowledge to a practical case. We want to compare two sets of arrival rates.
The next function grabs a number of tweets, maxTweets, on a topic, searchTerm and convert the list into a data frame.
TweetFrame <- function(searchTerm, maxTweets)
{
twtList <-
searchTwitter(searchTerm, n = maxTweets)
return(do.call("rbind",
lapply(twtList, as.data.frame)))
}What are the current hot trends around Montréal?
# one of the accesses of the Jacques-Cartier bridge
Mtl_woeid <- closestTrendLocations(45.522660, -73.546428)
Mtl_woeid## name country woeid
## 1 Montreal Canada 3534
Mtl_trends <- getTrends(as.numeric(Mtl_woeid[3]))
# top names
top <- 10
Mtl_trends[1:top, 'name']## [1] "Dolores O'Riordan" "#MLKDay" "#BlueMonday"
## [4] "Andrew Shaw" "Kevin Glenn" "#NationalHatDay"
## [7] "#iaccmm" "Young Canadians" "#TOTY"
## [10] "Toronto Police"
# pick two randomly
hot_two <- round(runif(2, 1, top), 0)
# extract the name
hot_two_1 <- Mtl_trends[hot_two[1], 'name']
hot_two_2 <- Mtl_trends[hot_two[2], 'name']
hot_two_1; hot_two_2## [1] "Andrew Shaw"
## [1] "Dolores O'Riordan"
Let’s scrape Tweeter with the hot topics (2). We want to compare them.
no_tweets <- 500
hot_two_1_DF <- TweetFrame(hot_two_1, no_tweets)
hot_two_2_DF <- TweetFrame(hot_two_2, no_tweets)
hot_two_1_DF[1:2, 'text']## [1] "RT @cultureoflosing: With both Andrew Shaw and now Logan Shaw, the Habs should acquire Henrik Sedin and play them all on the same like for…"
## [2] "RT @StuCowan1: My updated Game Day Notebook setting up tonight's matchup between the #Habs and #Islanders at the Bell Centre (7:30 p.m., TS…"
hot_two_2_DF[1:2, 'text']## [1] "RT @whereisMUNA: gone too soon. \nrest in power to one of our biggest musical inspirations, dolores o'riordan. \nthank you. https://t.co/Zghv…"
## [2] "RT @CNNEE: Muere la cantante de The Cranberries Dolores O'Riordan\nhttps://t.co/3B0Gu3qp77"
We sort the data frames by the time they were released ($created).
sort_hot_two_1_DF <- hot_two_1_DF[order(as.integer(hot_two_1_DF$created)), ]
sort_hot_two_2_DF <- hot_two_2_DF[order(as.integer(hot_two_2_DF$created)), ]
sort_hot_two_1_DF[1:2, 'text']## [1] "RT @DanyAllstar15: I can’t stand even looking at Andrew Shaw. Like yeah, I’m definitely a huge loser but wow is this guy a fucking bone job…"
## [2] "RT @DanyAllstar15: I can’t stand even looking at Andrew Shaw. Like yeah, I’m definitely a huge loser but wow is this guy a fucking bone job…"
sort_hot_two_2_DF[1:2, 'text']## [1] "RT @FoxNews: Dolores O'Riordan, beloved Cranberries singer, dies suddenly at 46 https://t.co/8hCrXA2rU5"
## [2] "Dolores O'Riordan, Cranberries lead singer, dead at 46 https://t.co/gfaElr7ecT"
We extract two vectors of time differences.
sort_hot_two_1_delays <- as.integer(diff(sort_hot_two_1_DF$created))
sort_hot_two_2_delays <- as.integer(diff(sort_hot_two_2_DF$created))
# data class
class(sort_hot_two_1_delays)
# first 10; mesured in seconds
sort_hot_two_1_delays[1:10]## [1] "integer"
## [1] 299 177 88 141 40 136 48 123 157 423
We compute descriptive statistics.
mean(sort_hot_two_1_delays); mean(sort_hot_two_2_delays) # seconds## [1] 288.3347
## [1] 0.04408818
median(sort_hot_two_1_delays); median(sort_hot_two_2_delays) # seconds## [1] 16
## [1] 0
mfv(sort_hot_two_1_delays); mfv(sort_hot_two_2_delays) # most frequent value in seconds## [1] 1
## [1] 0
sqrt(var(sort_hot_two_1_delays)); sqrt(var(sort_hot_two_2_delays)) # std deviation in seconds## [1] 1193.898
## [1] 0.205497
sort_hot_two_1_delays_count <- sum(sort_hot_two_1_delays <= 30)
sort_hot_two_2_delays_count <- sum(sort_hot_two_2_delays <= 30)
sort_hot_two_1_delays_count; sort_hot_two_2_delays_count # tweets count under 30s## [1] 307
## [1] 499
sort_hot_two_1_delays_prob <- sum(sort_hot_two_1_delays <= 30) / no_tweets
sort_hot_two_2_delays_prob <- sum(sort_hot_two_2_delays <= 30) / no_tweets
sort_hot_two_1_delays_prob; sort_hot_two_2_delays_prob # probability the next tweet is in 30s or less## [1] 0.614
## [1] 0.998
# topic 1 confidence interval
sort_hot_two_1_delays_CI <- poisson.test(sort_hot_two_1_delays_count, no_tweets)$conf.int
sort_hot_two_1_delays_CI[1]; sort_hot_two_1_delays_prob; sort_hot_two_1_delays_CI[2]## [1] 0.5472308
## [1] 0.614
## [1] 0.6866687
stripchart(c(sort_hot_two_1_delays_CI[1],
sort_hot_two_1_delays_prob,
sort_hot_two_1_delays_CI[2]),
main = 'topic 1 confidence interval', xlab = 'lower-mean-upper')
grid(ny = NA)# topic 2 confidence interval
sort_hot_two_2_delays_CI <- poisson.test(sort_hot_two_2_delays_count, no_tweets)$conf.int
sort_hot_two_2_delays_CI[1]; sort_hot_two_2_delays_prob; sort_hot_two_2_delays_CI[2]## [1] 0.9123449
## [1] 0.998
## [1] 1.089531
stripchart(c(sort_hot_two_2_delays_CI[1],
sort_hot_two_2_delays_prob,
sort_hot_two_2_delays_CI[2]),
main = 'topic 2 confidence interval', xlab = 'lower-mean-upper')
grid(ny = NA)Let’s visualize all this.
sort_hot_two_delays_CI <- data.frame(topic = factor(c(1,1,1,2,2,2)),
confint = c(sort_hot_two_1_delays_CI, sort_hot_two_1_delays_prob, sort_hot_two_2_delays_CI, sort_hot_two_2_delays_prob))
sort_hot_two_delays_CI## topic confint
## 1 1 0.5472308
## 2 1 0.6866687
## 3 1 0.6140000
## 4 2 0.9123449
## 5 2 1.0895309
## 6 2 0.9980000
library(ggplot2)
ggplot(sort_hot_two_delays_CI, aes(x=topic, y=confint)) +
geom_boxplot() +
xlab('topic') + ylab('confidence interval')The test is a comparison of the Poisson rates. We test the null hypothesis that the count for topic 1 equals the count for topic 2.
poisson.test(c(sort_hot_two_1_delays_count,
sort_hot_two_2_delays_count),
c(no_tweets,
no_tweets))##
## Comparison of Poisson rates
##
## data: c(sort_hot_two_1_delays_count, sort_hot_two_2_delays_count) time base: c(no_tweets, no_tweets)
## count1 = 307, expected count1 = 403, p-value = 1.388e-11
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.5319483 0.7106438
## sample estimates:
## rate ratio
## 0.6152305
The p-value is over 5%. The test rejects the null hypothesis and accepts the alternative hypothesis of inequality (as we can see on the boxplot above).