Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets and results.
  • Compiled on Labour Day 2017 (to put the tweets in context).


A word on Twitter

Twitter generates unstructured data; as opposed to structure data or numeric, ordinal, binary data for example.

We can capture this unstructured data, build a corpus and run analyses with the twitteR package. Beforehand, to download tweets, or scrape the web, we need an API.

Notes on the main packages used in this case



The initial setup

Twitter: creating an account

Above all, we create an account and/or log in Twitter.

R: installing and loading packages

Here is a function that takes as input the name of a package. It tests whether the package has been downloaded – ‘installed’ – from the R code repository. If it has not yet been downloaded/installed, the function does it.

Then the function uses require to prepare the package (it works like library):

EnsurePackage <- function(x)
{
  x <- as.character(x)
  if (!require(x, character.only = TRUE))
  {
    install.packages(pkgs = x,
       repos = "http://cran.r-project.org")
    require(x, character.only = TRUE)
  }
}

require does the same thing than library, but it also returns the value ‘FALSE’ if the package requested has not yet been downloaded.

We use the function and load all packages:

PrepareTwitter <- function()
{
  EnsurePackage("bitops")
  EnsurePackage("RCurl")
  EnsurePackage("RJSONIO")
  EnsurePackage("twitteR")
  EnsurePackage("ROAuth")
}

We execute the function and install the necessary packages: bitops, RCurl, RJSONIO, twitteR and ROAuth.

PrepareTwitter()

Alternatively, we can ready R with the following.

library(bitops)
library(RCurl)
library(RJSONIO)
library(twitteR)
library(ROAuth)

These steps might be tedious. I might not work the first time. Dependencies are needed for the above packages. Find help online (Stack Overflow for example) to fix these issues. In the end, we should be able to properly load in the 5 packages.

Other installation might be needed: install.packages(c("devtools", "rjson", "bit64", "httr")). Restarting the R session following an installation helps. Do not forget to load the packages in R as well. Oh yes… restart the R session, then load all the needed packages.

library(devtools)
library(rjson)
library(bit64)
library(httr)

R: getting new SSL tokens

Depending on which version of OS, it may be necessary to provide new SSL certificates. Certificates help to maintain secure communications across the Internet, and most computers keep an up-to-date copy on file, but not all of them do.

With Windows.

download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile = 'cacert.pem')

twitteR uses RCurl which in turn employs SSL security whenever ‘https’ appears in a URL.

The Twitter API: using your OAuth tokens

With the Twitter account, go to the settings, go in Applications and create a new application. Create it and/or log in apps. You will get redirected to a screen with all the OAuth setting of your new app (the authentification process).

We need ‘doing the handshake’ every time R interacts with Twitter.

Down the line, we need 4 codes: ‘consumerKey’, ‘consumerSecret’, ‘access_token’ and ‘access_secret’. How?

The wording and the procedure can change from time to time. Check out the latest twitterR documentation for more instructions. Consult the Twitter Developer Documentation website. Search online help to ‘retrieve the data from Twitter’. Here is a good example.

Once we have the 4 codes, we create 4 variables (api_key, api_secret, access_token, access_secret) to link R with the Twitter account (my snippet remains secret with echo=FALSE).

api_key <- "YOUR API KEY" # or consumer key
api_secret <- "YOUR API SECRET" # or consumer secret
access_token <- "YOUR ACCESS TOKEN"
access_token_secret <- "YOUR ACCESS TOKEN SECRET"

We wrap up the OAuth authentication process with the handshake functions from the httr package and we open a twitteR session.

setup_twitter_oauth(api_key,
                    api_secret,
                    access_token,
                    access_secret)
## [1] "Using direct authentication"

When the API is on, R should now be linked to Twitter.

We test it with the searchTwitter function.

searchTwitter(searchString, 
              n=3, 
              lang=NULL,
              since=NULL, 
              until=NULL,
              locale=NULL, 
              geocode=NULL, 
              sinceID=NULL, 
              maxID=NULL,
              resultType=NULL, 
              retryOnRateLimit=120)

#sun

searchTwitter("sun",
              n = 3)
## [[1]]
## [1] "MLB37167: RT @Breaking911: UPDATE: At least 17 people wounded in MLK Day weekend shootings across Chicago - Sun Times https://t.co/qoqIPz9w65"
## 
## [[2]]
## [1] "marianathesnake: Glow from @ABHcosmetics Sun dipped ☀️ ✨☀️✨☀️\xed\xa0\xbd\xed\xb2\x84#AnastasiaBeverlyHills @norvina1 #ABHGlow #sundipped #Highlighter… https://t.co/eJGndsanJ7"
## 
## [[3]]
## [1] "Kizzezzleepy: RT @SiblingsKisses: If @delavinkisses is the sun, then Kissables are like sunflowers, all looking towards the sun, drawn to her radiant, wa…"

On Windows, you had to get a new certificate: ‘cacert.pem’. Then, you may have to use this command: searchTwitter("#hashtag", n = 3, cainfo = "cacert.pem").

We change the language (lang=NULL is the default setting). Consult Wikipedia for the list of languages; pick the proper ISO 639-1 code. For example, French is ‘fr’.

#soleil

searchTwitter("#soleil",
              n = 3,
              lang = 'fr') 
## [[1]]
## [1] "JimmyD5976: RT @lavoixdunord: Les destinations les plus en vue à l’aéroport de Lille-Lesquin #soleil #vacances https://t.co/2p7e9O8DDd https://t.co/ytK…"
## 
## [[2]]
## [1] "Fleming1186: RT @VALBERGAlpesSud: Aujourd’hui c’est #lundi au #soleil \xed\xa0\xbd\xed\xb8\x8e\xed\xa0\xbc\xed\xbe\xbfet vous ? ⛄️\xed\xa0\xbd\xed\xb8\x9c #ski #winter #CotedAzurFrance #lovevalberg https://t.co/U8iZANPXMd"
## 
## [[3]]
## [1] "lavoixdunord: Les destinations les plus en vue à l’aéroport de Lille-Lesquin #soleil #vacances https://t.co/2p7e9O8DDd https://t.co/ytKjO0OZLp"

The worst is done, begin can the fun!



twitteR, the functions

  • directMessage-class, class “directMessage”: a class to represent Twitter Direct Messages.
    • dmGet, dmSent, dmDestroy, dmSend, functions to manipulate Twitter direct messages.
  • getTrends, availableTrendLocations, closestTrendLocations, functions to view Twitter trends.
  • getUser, lookupUsers, functions to manage Twitter users.
  • import_statuses, import_trends, import_users, import_obj, json_to_users, json_to_statuses, json_to_trends, functions to import twitteR objects from various sources.
  • load_tweets_db, store_tweets_db, store_users_db, load_users_db, load_tweets_db, functions to persist/load twitteR data to a database.
  • registerTwitterOAuth, register OAuth credentials to twitter R session.
  • searchTwitter, search twitter.
  • search_twitter_and_store, a function to store searched tweets to a database.
    • register_db_backend, register_sqlite_backend, register_mysql_backend.
  • setup_twitter_oauth, sets up the OAuth credentials for a twitteR session.
  • status-class, class to contain a Twitter status.
    • taskStatus, a function to send a Twitter DM after completion of a task.
  • userTimeline, homeTimeline, mentions, retweetsOfMe, functions to view Twitter timelines.
  • twListToDF, a function to convert twitteR lists to data.frames
  • updateStatus, tweet, deleteStatus, functions to manipulate Twitter status.
  • use_oauth_token, sets up the OAuth credentials for a twitteR session from an existing Token object.
  • userFactory, a container object to model Twitter users.
  • stats::HoltWinters, Holt-Winters filtering of times series.
  • stats::plot.HoltWinters, plot function for Holt-Winters objects.
  • stats::predict.HoltWinters, prediction function for fitted Holt-Winters model.


Pulling data, pulling basic metadata

To pull data from Twitter, we log into the Twitter account (open the app), load in all the R packages and run the setup_twitter_oauth function to begin.

Location-based

We pull the top trends in two cities, near two bridges.

San Francisco, Golden Bridge

SF_woeid <- closestTrendLocations(37.781157, -122.39720) # one of the accesses of the Golden Bridge
SF_woeid
##            name       country   woeid
## 1 San Francisco United States 2487956
SF_trends <- getTrends(as.numeric(SF_woeid[3]))
head(SF_trends) # a data frame
##                name                                                   url
## 1           #MLKDay                 http://twitter.com/search?q=%23MLKDay
## 2 Dolores O'Riordan http://twitter.com/search?q=%22Dolores+O%27Riordan%22
## 3 #MotivationMonday       http://twitter.com/search?q=%23MotivationMonday
## 4         Happy MLK           http://twitter.com/search?q=%22Happy+MLK%22
## 5         #Heathers               http://twitter.com/search?q=%23Heathers
## 6   Birmingham Jail     http://twitter.com/search?q=%22Birmingham+Jail%22
##                       query   woeid
## 1                 %23MLKDay 2487956
## 2 %22Dolores+O%27Riordan%22 2487956
## 3       %23MotivationMonday 2487956
## 4           %22Happy+MLK%22 2487956
## 5               %23Heathers 2487956
## 6     %22Birmingham+Jail%22 2487956

Montréal, Jacques-Cartier

Mtl_woeid <- closestTrendLocations(45.522660, -73.546428) # one of the accesses of the Jacques-Cartier bridge
Mtl_woeid
##       name country woeid
## 1 Montreal  Canada  3534
Mtl_trends <- getTrends(as.numeric(Mtl_woeid[3]))
head(Mtl_trends) # a data frame
##                name                                                   url
## 1 Dolores O'Riordan http://twitter.com/search?q=%22Dolores+O%27Riordan%22
## 2           #MLKDay                 http://twitter.com/search?q=%23MLKDay
## 3       #BlueMonday             http://twitter.com/search?q=%23BlueMonday
## 4       Andrew Shaw         http://twitter.com/search?q=%22Andrew+Shaw%22
## 5       Kevin Glenn         http://twitter.com/search?q=%22Kevin+Glenn%22
## 6   #NationalHatDay         http://twitter.com/search?q=%23NationalHatDay
##                       query woeid
## 1 %22Dolores+O%27Riordan%22  3534
## 2                 %23MLKDay  3534
## 3             %23BlueMonday  3534
## 4         %22Andrew+Shaw%22  3534
## 5         %22Kevin+Glenn%22  3534
## 6         %23NationalHatDay  3534

Topic-based

#climate

We get some data (recent tweets).

# worldwide tweets
tweetList <- searchTwitter("#climate", n = 500)
mode(tweetList); length(tweetList)
## [1] "list"
## [1] 500

The object is a unidimensional structure: a list of 500 entries.

# the first tweet structure (data + metadata)
str(head(tweetList, 1))
## List of 1
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..$ text         : chr "RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjH"| __truncated__
##   ..$ favorited    : logi FALSE
##   ..$ favoriteCount: num 0
##   ..$ replyToSN    : chr(0) 
##   ..$ created      : POSIXct[1:1], format: "2018-01-15 18:58:29"
##   ..$ truncated    : logi FALSE
##   ..$ replyToSID   : chr(0) 
##   ..$ id           : chr "952978313047191552"
##   ..$ replyToUID   : chr(0) 
##   ..$ statusSource : chr "<a href=\"http://twitter.com/#!/download/ipad\" rel=\"nofollow\">Twitter for iPad</a>"
##   ..$ screenName   : chr "MonterioJulio"
##   ..$ retweetCount : num 7
##   ..$ isRetweet    : logi TRUE
##   ..$ retweeted    : logi FALSE
##   ..$ longitude    : chr(0) 
##   ..$ latitude     : chr(0) 
##   ..$ urls         :'data.frame':    1 obs. of  5 variables:
##   .. ..$ url         : chr "https://t.co/M8S0nwjHzH"
##   .. ..$ expanded_url: chr "https://www.sciencedaily.com/releases/2018/01/180103160202.htm"
##   .. ..$ display_url : chr "sciencedaily.com/releases/2018/…"
##   .. ..$ start_index : num 88
##   .. ..$ stop_index  : num 111
##   ..and 53 methods, of which 39 are  possibly relevant:
##   ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet,
##   ..  getLatitude, getLongitude, getReplyToSID, getReplyToSN,
##   ..  getReplyToUID, getRetweetCount, getRetweeted, getRetweeters,
##   ..  getRetweets, getScreenName, getStatusSource, getText, getTruncated,
##   ..  getUrls, initialize, setCreated, setFavoriteCount, setFavorited,
##   ..  setId, setIsRetweet, setLatitude, setLongitude, setReplyToSID,
##   ..  setReplyToSN, setReplyToUID, setRetweetCount, setRetweeted,
##   ..  setScreenName, setStatusSource, setText, setTruncated, setUrls,
##   ..  toDataFrame, toDataFrame#twitterObj
# the first 3 tweets (data alone)
head(tweetList, 3)
## [[1]]
## [1] "MonterioJulio: RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjHzH\n\n#climate #ClimateChangeIsR…"
## 
## [[2]]
## [1] "geddes_anna: RT @iddrilefil: #Development and #climate #finance is not about labelled infrastructure but about good economic planning and setting the ri…"
## 
## [[3]]
## [1] "SteliosYiatros: RT @WaterJPI: Applications now open! Join The @ClimateKIC Journey, Europe's largest #climate #innovation summer school. https://t.co/bTG7v0…"

We convert the object into a data frame. The data frame, another object, is a 2D structure (rows and columns).

tweetDF <- twListToDF(tweetList)
mode(tweetDF); dim(tweetDF)
## [1] "list"
## [1] 500  16
head(tweetDF, 3)
##                                                                                                                                               text
## 1 RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjHzH\n\n#climate #ClimateChangeIsR…
## 2     RT @iddrilefil: #Development and #climate #finance is not about labelled infrastructure but about good economic planning and setting the ri…
## 3     RT @WaterJPI: Applications now open! Join The @ClimateKIC Journey, Europe's largest #climate #innovation summer school. https://t.co/bTG7v0…
##   favorited favoriteCount replyToSN             created truncated
## 1     FALSE             0      <NA> 2018-01-15 18:58:29     FALSE
## 2     FALSE             0      <NA> 2018-01-15 18:55:11     FALSE
## 3     FALSE             0      <NA> 2018-01-15 18:55:03     FALSE
##   replyToSID                 id replyToUID
## 1       <NA> 952978313047191552       <NA>
## 2       <NA> 952977481765711872       <NA>
## 3       <NA> 952977446604890113       <NA>
##                                                                         statusSource
## 1  <a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>
## 2                 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##       screenName retweetCount isRetweet retweeted longitude latitude
## 1  MonterioJulio            7      TRUE     FALSE        NA       NA
## 2    geddes_anna            7      TRUE     FALSE        NA       NA
## 3 SteliosYiatros            3      TRUE     FALSE        NA       NA

Alternatively, we can run this code.

tweetDF <- do.call('rbind', lapply(tweetList, as.data.frame))
mode(tweetDF)
dim(tweetDF)
head(tweetDF, 3)

Useful functions

…for many other situations:

  • as.data.frame coerce tweetList the list into a data frame.
  • lapply applies the coercing function to all of the elements of the list.
  • rbind binds all rows together.
  • do.call() allows for a flexible number of arguments to be supplied to the functions (rbind and lapply). lapply first, followed by rbind in quotes.


Exploring the extractions

A little automation

We implement a function to automate the extraction and convert the results into a data frame.

TweetFrame <- function(searchTerm, maxTweets)
{
  twtList <-
    searchTwitter(searchTerm, n = maxTweets)
  return(do.call("rbind", lapply(twtList, as.data.frame)))
}

twtList is a temporary variable created within the function. This is the result of variable scoping.

Tweet timing (metadata)

We create a new dataset.

#earth

tweetDF <- TweetFrame("#earth", 250)

We start the analysis.

Attributes and time created

We print the tweetDF object attributes and extract a vector from the d.f.

# attributes(tweetDF)$row.names   is skipped because it is a long enumeration; the number of tweets
# object attributes
attributes(tweetDF)$names
##  [1] "text"          "favorited"     "favoriteCount" "replyToSN"    
##  [5] "created"       "truncated"     "replyToSID"    "id"           
##  [9] "replyToUID"    "statusSource"  "screenName"    "retweetCount" 
## [13] "isRetweet"     "retweeted"     "longitude"     "latitude"
# one attribute
attributes(tweetDF)$class
## [1] "data.frame"
# metadata: time, POSIXct
head(tweetDF$created, 3)
## [1] "2018-01-15 18:59:26 UTC" "2018-01-15 18:58:59 UTC"
## [3] "2018-01-15 18:58:44 UTC"
# or attach the object to simplify the code
attach(tweetDF)
head(created, 3)
## [1] "2018-01-15 18:59:26 UTC" "2018-01-15 18:58:59 UTC"
## [3] "2018-01-15 18:58:44 UTC"

Visualization

We plot the data.

library(ggplot2)

ggplot(tweetDF, aes(x=created)) +
  geom_histogram(bins=15, fill="white", colour="black") +
  xlab('time') + ylab('frequency')

We compute the time range (earliest tweet to last tweet in the d.f.).

max(created); min(created)
## [1] "2018-01-15 18:59:26 UTC"
## [1] "2018-01-15 16:23:04 UTC"
timerange <- max(created) - min(created)
timerange
## Time difference of 2.606111 hours

We compute the hourly tweet velocity: no of tweets * 60min / timerange for each bar in minute, i.e., the average number of tweet per hour.

as.numeric(timerange) * 60 / 15 # TPH
## [1] 10.42444

Time introduces a discrete element. Tweets occur at specific times.

Can we predict tweets with the Poisson discrete distribution?

We will cover these notions, further down.

Sorting and extracting more time statistics (metadata)

We order the tweets by time.

sortweetDF <- tweetDF[order(as.integer(created)), ]

# detach the unsorted d.f.
detach(tweetDF)
# attach the sorted d.f.
attach(sortweetDF)

We compute the time difference between the tweets in seconds. Seconds are the smallest time measure in YYYY-MM-DD HH:MM:SS.

diff(created)
## Time differences in secs
##   [1]   5 231 130  45  11  83  15  38  13  16   0 142  22 133  46   6  27
##  [18]  94  29  11  28  24   4   8  34  13   1  78  89   7   3  11  19   9
##  [35]   8   8 127  12  19  19  11   1   0  35   3  48  15  26  91  79  68
##  [52]  30  22  40  11  44   7  41   8  18   8  33  16   5  11  11  14  12
##  [69]  52 140   2  87  18  26  33  12 100  60  41  37  10  57   2  54   2
##  [86]   2  20   4  34  41  28  10 275  31  48 119   2   6   6  20   4   9
## [103]  11  45  94 182  66  16   5  35   4 106   4   8  94  74 117 103  56
## [120]  30  53  12   6  28   9  71  32  34  19  19 153  22 155   9  38  49
## [137]  41  68   8  43  75  11  47   9  16  25   3   7  13  12  21  32  18
## [154] 175  17  48  37  14  20  47  70  13  37   7   3  20   1  97  27  12
## [171]  98   7   6   7 166  29  28  24  32 177   9  33   3  16  22   7   2
## [188]   4  15  14  22   7   9  87   6  44   5  52  62  47  16   5   9  76
## [205]  19  15   3  20  11  81  82   4  17   5  40   7  19  17   3  25 148
## [222]  52  35  25  76  10   8  95  54  21  86  57  35   2 137  47   5   7
## [239]   6  25   5   5   8 172  25  49  60  15  27
diff <- as.numeric(diff(created))
diff <- as.data.frame(diff)

We plot the differences.

library(ggplot2)

ggplot(diff, aes(x=diff)) +
  geom_histogram(bins=15, fill="white", colour="black") +
  xlab('seconds') + ylab('frequency')

We average the time difference.

mean(as.integer(diff(created)))
## [1] 37.67871

In average, there is one tweet every \(\approx\) 38s.

Is this the most frequent value?

library("modeest")

mfv(as.integer(diff(created)))
## [1]  7 11
median(as.integer(diff(created)))
## [1] 21

The mfv function shows that the most commonly occurring time interval between neighboring tweets is: 7, 11s!

The median: 21s. The distribution skewed towards 0, but there are high outliers inflating the mean.

We count the number of tweets with certain time intervals; under 60, 30, and 10 seconds difference. We plot the results.

seconds <- c(sum((as.integer(diff(created))) < 60),
             sum((as.integer(diff(created))) < 30),
             sum((as.integer(diff(created))) < 10))

tranches <- c('60s', '30s', '10s')

time <- data.frame(tranches = tranches, seconds = seconds)
library(ggplot2)

ggplot(time, aes(x=tranches, y=seconds)) +
  geom_bar(stat='identity', fill="white", colour="black") +
  xlab('tranche') + ylab('count')

A ratio or proportion is a probability that the next tweet will arrive in x seconds or less.

pseconds <- c(sum((as.integer(diff(created))) < 60) / 500,
              sum((as.integer(diff(created))) < 30) / 500,
              sum((as.integer(diff(created))) < 10) / 500)

tranches <- c('60s', '30s', '10s')

time <- data.frame(tranches = tranches, seconds = pseconds)
library(ggplot2)

ggplot(time, aes(x=tranches, y=seconds)) +
  geom_bar(stat='identity', fill="white", colour="black") +
  xlab('tranche') + ylab('probability')

The probability that the next tweet will arrive in 60 seconds or less is 0.4; in 10 seconds or less, 0.138.

Can we build a prediction around this concept? Let’s create a function.

Given a list of tweet arrival times, we can calculate the delays between the tweets using time differences. Then, we can compute a ordered list of cumulative probabilities of arrival for the sequential list of time increments for plotting.

ArrivalProbability <- function(times, increment, max)
{
  # Initialize an empty vector
  plist <- NULL
  
  # Probability is defined over the size of this sample
  # of arrival times
  timeLen <- length(times)
  
  # May not be necessary, but checks for input mistake
  if (increment > max) {return(NULL)}

  for (i in seq(increment, max, by = increment))
  {
  # diff() requires a sorted list of times
  # diff() calculates the delays between neighboring times
  # the logical test <i provides a list of TRUEs and FALSEs
  # of length = timeLen, then sum() counts the TRUEs.
  # Divide by timeLen to calculate a proportion
  plist <- c(plist, (sum(as.integer(diff(times)) < i)) / timeLen)
  }
  return(plist)
}
  • times, a sorted, ascending list of arrival times, in POSIXct.
  • increment, the time increment for each new slot, e.g. 10s.
  • max, the highest time increment, e.g., 240s.
incr = 10
maxi = 60

created_plist <- ArrivalProbability(created, incr, maxi)
length_plist <- 1:length(created_plist) * incr

simtime <- data.frame(probability = created_plist, tranches = length_plist)
library(ggplot2)

ggplot(simtime, aes(x=tranches, y=probability)) +
  geom_point(shape=1, size=2) +
  xlab('tranche') + ylab('probability')

We read the graph as the y probability that the next tweet will arrive in x seconds or less.

# detach the sorted d.f.
detach(sortweetDF)


Storing tweets – Possibilities

CSV & flat files

#climate

head(tweetList, 3)
## [[1]]
## [1] "MonterioJulio: RT @PaulHBeckwith: Scientists find surprising evidence of rapid changes in the #Arctic\n\nhttps://t.co/M8S0nwjHzH\n\n#climate #ClimateChangeIsR…"
## 
## [[2]]
## [1] "geddes_anna: RT @iddrilefil: #Development and #climate #finance is not about labelled infrastructure but about good economic planning and setting the ri…"
## 
## [[3]]
## [1] "SteliosYiatros: RT @WaterJPI: Applications now open! Join The @ClimateKIC Journey, Europe's largest #climate #innovation summer school. https://t.co/bTG7v0…"

The most basic storage solutions are CSV and flat files.

tweetDF_climate <- twListToDF(tweetList)

write.csv uses ‘,’ separators by default and write.csv uses ‘;’ separators by default. Encode the text to capture the foreign characters; UTF-16 is better than UTF-8; consult https://en.wikipedia.org/wiki/UTF-16

write.csv2(tweetDF_climate, 'tweetDF_climate.csv', sep = ";", row.names = FALSE, fileEncoding = "UTF-16LE")

In a flat file.

write.table(tweetDF_climate, 'tweetDF_climate.txt', sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")

#earth

head(tweetDF, 1)
##                                                                                                                                              text
## 1 RT @EnjoyNature: #Sunset at #YaquinaHead #Lighthouse #Newport #Oregon\n\n#Nature #Relax #Beauty #Photo #Vacation #Ocean #Earth #Travel\n#Color…
##   favorited favoriteCount replyToSN             created truncated
## 1     FALSE             0      <NA> 2018-01-15 18:59:26     FALSE
##   replyToSID                 id replyToUID
## 1       <NA> 952978551984345088       <NA>
##                                                         statusSource
## 1 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##    screenName retweetCount isRetweet retweeted longitude latitude
## 1 EnjoyNature           64      TRUE     FALSE      <NA>     <NA>
head(sortweetDF, 1)
##                                                                                                                                                 text
## 250 RT @earthtokens: Site inspection at Hotel Verde (Africa Greenest Hotel) - impactChoice client &amp; #EARTH #Token supporter, take a look at the…
##     favorited favoriteCount replyToSN             created truncated
## 250     FALSE             0      <NA> 2018-01-15 16:23:04     FALSE
##     replyToSID                 id replyToUID
## 250       <NA> 952939197328838657       <NA>
##                                                                             statusSource
## 250 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
##     screenName retweetCount isRetweet retweeted longitude latitude
## 250  NekonyunN          171      TRUE     FALSE      <NA>     <NA>

In a CSV.

write.csv2(tweetDF, 'tweetDF.csv', sep = ";", row.names = FALSE, fileEncoding = "UTF-16LE")

write.csv2(sortweetDF, 'sortweetDF.csv', sep = ";", row.names = FALSE, fileEncoding = "UTF-16LE")

In a flat file.

write.table(tweetDF, 'tweetDF.txt', sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")

write.table(sortweetDF, 'sortweetDF.txt', sep = "\t", row.names = FALSE, fileEncoding = "UTF-16LE")

Spreadsheets

Consider packages:

  • xlsx.
  • xlsxjars.
  • XLConnect.
  • XLConnectjars.
  • xlsReadWrite..
  • xlsc.
  • readODS for LibreOffice, OpenOffice.
  • gdata, general purpose program, read.xls read Excel files on Mac Linux.

Databases

For more massive storage, we can consider SQL databases. The twitteR package offer the appropriate functions:

  • search_twitter_and_store(searchString, table_name = "tweets", + lang = NULL, locale = NULL, geocode = NULL, retryOnRateLimit = 120).
  • register_db_backend(db_handle).
  • db_handle, a DBI connection.
  • register_sqlite_backend(sqlite_file, ...), file path for a SQLite file.
  • register_mysql_backend(db_name, host, user, password, ...), hostname the database is on, username to connect to the database with, password to connect to the database.

Also:

  • RMySQL connects to MySQL.
  • ROracle connects with the widely used Oracle commercial database package.
  • RPostgreSQL connects with the well-developed, full featured PostgreSQL (sometimes just called Postgres) database system.
  • RSQliteconnects with SQlite, another open source, independently developed database system.
  • RMongo connects with the MongoDB system, which is a NoSQL database. MongoDB uses JavaScript to access data. As such it is well suited for web development applications.
  • RODBC connects with ODBC compliant databases, which include Microsoft’s SQLserver, Microsoft Access, and Microsoft Excel, among others. Note that these applications are native to Windows and Windows server, and as such the support for Linux and Mac OS is limited.
  • RHadoop, Hadoop, HDFS, MapReduce.
  • SparkR Spark.

A code snippet

# open a connexion with the db
con <- dbConnect(dbDriver("MySQL"), dbname = "test")

dbListTables(con)
# create
dbWriteTable(con, "census", testFrame, overwrite = TRUE)
dbListTables(con)

# run an SQL query
dbGetQuery(con, "SELECT region, july11pop FROM census WHERE july11pop < 1000000")

# close the connexion; ALWAYS!
 dbDisconnect(con)


Tweet timing – Continued

We found that arrival times of tweets on a given topic seem to mimic the Poisson distribution.

The time difference between tweets.

library(ggplot2)

ggplot(diff, aes(x=diff)) +
  geom_histogram(bins=15, fill="white", colour="black") +
  xlab('Seconds') + ylab('frequency')

We also covered the mean, median, mfv and skewness of these time differences.

Here are 4 streams of random numbers that roughly fit the Poisson distribution.

Tweet time differences resembles something between the upper-right and lower-left graphs. Tweets behave like the Poisson distribution: they are positively skewed with a long tail towards outliers (a few large time differences) that make the average higher than the median and the mfv.

If we assign parameters n=1000 and \(\lambda\)=10, we get a mean and variance of:

mean(rpois(1000,10)); var(rpois(1000,10))
## [1] 10.047
## [1] 9.861637

The Poisson distribution has a property that the variance equals the mean (“equidispersion”).

We also saw the time difference (count and) probability.

library(ggplot2)

ggplot(time, aes(x=tranches, y=seconds)) +
  geom_bar(stat='identity', fill="white", colour="black") +
  xlab('tranche') + ylab('probability')

We can do the same with the random numbers (n = 1000 iterations, \(\lambda\)=30).

pseconds <- c(sum(rpois(1000,30) < 10) / 1000,
              sum(rpois(1000,30) < 30) / 1000,
              sum(rpois(1000,30) < 60) / 1000)

tranches <- factor(c('10s', '30s', '60s'))
tranches <- factor(tranches, levels = c('10s', '30s', '60s'))

time <- data.frame(tranches = tranches, seconds = pseconds)
library(ggplot2)

ggplot(time, aes(x=tranches, y=seconds)) +
  geom_bar(stat='identity', fill="white", colour="black") +
  xlab('tranche') + ylab('probability')

Or other tranches (n = 1000 iterations, \(\lambda\)=10).

pseconds <- c(sum(rpois(1000,10) < 5) / 1000,
              sum(rpois(1000,10) < 10) / 1000,
              sum(rpois(1000,10) < 20) / 1000)

tranches <- factor(c('5s', '10s', '20s'))
tranches <- factor(tranches, levels = c('5s', '10s', '20s'))

time <- data.frame(tranches = tranches, seconds = pseconds)
library(ggplot2)

ggplot(time, aes(x=tranches, y=seconds)) +
  geom_bar(stat='identity', fill="white", colour="black") +
  xlab('tranche') + ylab('probability')



Tweet prediction – In theory

According to the Poisson distribution, if the curve is steeper, the topic is more active, more popular, because the delays between tweet are shorter.

If we want to compare two different topics, the more popular is likely to have a higher posting rate.

We can now develop a test to compare two different topics to see which one is more popular (or at least which one has a higher posting rate).

The next function, DelayProbability, is similar to the above function, ArrivalProbability, but works with unsorted list of delay times.

DelayProbability <- function(delays, increment, max)
{
  # Initialize an empty vector
  plist <- NULL
  # Probability is defined over the size of this sample
  # of arrival times
  delayLen <- length(delays)
  # May not be necessary, but checks for input mistake
  if (increment > max) {return(NULL)}
  for (i in seq(increment, max, by = increment))
  {
    # logical test <=i provides list of TRUEs and FALSEs
    # of length = timeLen, then sum() counts the TRUEs
    plist<-c(plist,(sum(delays <= i) / delayLen))
  }
  return(plist)
}
  • delays, an unsorted list of arrival times, in POSIXct.
  • increment, the time increment for each new slot, e.g. 10s.
  • max, the highest time increment, e.g., 240s.

Let’s simulate two Poisson distributions (as we would compare two topics posting rates) from 1s to 20s with 1s increments.

redf <- data.frame(colour = 'red',
                   time = seq(1,20,1),
                   probability = DelayProbability(rpois(100, 10), 1, 20))
greenf <- data.frame(colour = 'green',
                     time = seq(1,20,1),
                     probability = DelayProbability(rpois(100, 3), 1, 20))
redgreenf <- rbind(redf, greenf)
library(ggplot2)

ggplot(redgreenf, aes(x=time, y=probability, colour=colour)) +
  geom_point(shape=1, size=2) +
  xlab('seconds') + ylab('probability')

The green curve is steeper (increment is lower): higher posting rate. There is a more than 85% probability that the next green tweet will arrive before 5s. The same probability for the red topic is less than 15%. The green topic is ‘hotter’.

However, this is just one sample. Let’s run multiple samples (bootstrapping) of the green topic.

par(mfrow = c(1,2))

plot(DelayProbability(rpois(100, 10), 1, 20), col = "green3", ylab = 'probability', xlab = 'seconds', main = '15 samples')
grid()
for (i in 1:15) {
  points(DelayProbability(rpois(100, 10), 1, 20), col = "green3")
}

plot(DelayProbability(rpois(100, 10), 1, 20), col = "green3", ylab = '', xlab = 'seconds)', main = '100 samples')
for (i in 1:100) {
  points(DelayProbability(rpois(100, 10), 1, 20), col = "green3")
}
grid()

We can expect, from a Poisson distribution, that the probability a 3s delay between two tweets is likely to be:

ppois(3, lambda = 10)
## [1] 0.01033605

The \(\lambda\) parameter is the mean probability, splitting the distribution into two halves.

greenf <- data.frame(colour = 'green',
                     time = seq(1,20,1),
                     probability = DelayProbability(rpois(100, 10), 1, 20))
par(mfrow = c(1,1))

library(ggplot2)

ggplot(greenf, aes(x=time, y=probability)) +
  geom_point(shape=1, size=2) +
  xlab('seconds') + ylab('probability') +
  geom_vline(xintercept=20/2, linetype='dashed', col='darkgray')

If we randomly simulate one very large sample…

mean(rpois(100000, 10)); var(rpois(100000, 10))
## [1] 10.00753
## [1] 10.04851

…the mean approaches the variance; both are close to \(\lambda\) = 10.

From above, we remember the probability that the next tweet will come in the next 3s:

sum(rpois(100000, 10) <= 3) / 100000

# or simply
ppois(3, lambda = 10)
## [1] 0.01082
## [1] 0.01033605

The probability the next tweet will come in the next 10s?

ppois(10, lambda = 10)

# the other way around
qpois(0.58303, lambda = 10)

# in and out
qpois(ppois(10, lambda = 10), lambda=10)
## [1] 0.5830398
## [1] 10
## [1] 10

The calculation and the probability are very close:

(58638 / 100000); ppois(10, lambda = 10)
## [1] 0.58638
## [1] 0.5830398

How much variation is there around one of these probabilities?

poisson.test(58638, 100000)$conf.int
## [1] 0.5816434 0.5911456
## attr(,"conf.level")
## [1] 0.95

The answer is in the confidence interval. The interesting part is the upper/lower bounds on a 95% confidence interval. No matter what sample we generate, the results fall within the interval.

# lower limit
poisson.test(58638, 100000)$conf.int[1]
poisson.test(58638, 100000)$conf.int[1] < ppois(10, lambda = 10)
## [1] 0.5816434
## [1] TRUE
# upper limit
poisson.test(58638, 100000)$conf.int[2]
poisson.test(58638, 100000)$conf.int[2] > ppois(10, lambda = 10)
## [1] 0.5911456
## [1] TRUE
# other samples, other intervals
poisson.test(5863, 10000)$conf.int
poisson.test(586, 1000)$conf.int
poisson.test(58, 100)$conf.int
## [1] 0.5713874 0.6015033
## attr(,"conf.level")
## [1] 0.95
## [1] 0.5395084 0.6354261
## attr(,"conf.level")
## [1] 0.95
## [1] 0.4404183 0.7497845
## attr(,"conf.level")
## [1] 0.95

The smaller the sample, the larger the confidence interval, the more variability or imprecision.



Tweet prediction – In practice

Let’s apply this knowledge to a practical case. We want to compare two sets of arrival rates.

The next function grabs a number of tweets, maxTweets, on a topic, searchTerm and convert the list into a data frame.

TweetFrame <- function(searchTerm, maxTweets)
{
  twtList <-
    searchTwitter(searchTerm, n = maxTweets)
  return(do.call("rbind",
                 lapply(twtList, as.data.frame)))
}

What are the current hot trends around Montréal?

# one of the accesses of the Jacques-Cartier bridge
Mtl_woeid <- closestTrendLocations(45.522660, -73.546428)
Mtl_woeid
##       name country woeid
## 1 Montreal  Canada  3534
Mtl_trends <- getTrends(as.numeric(Mtl_woeid[3]))

# top names
top <- 10
Mtl_trends[1:top, 'name']
##  [1] "Dolores O'Riordan" "#MLKDay"           "#BlueMonday"      
##  [4] "Andrew Shaw"       "Kevin Glenn"       "#NationalHatDay"  
##  [7] "#iaccmm"           "Young Canadians"   "#TOTY"            
## [10] "Toronto Police"
# pick two randomly
hot_two <- round(runif(2, 1, top), 0)

# extract the name
hot_two_1 <- Mtl_trends[hot_two[1], 'name']
hot_two_2 <- Mtl_trends[hot_two[2], 'name']
hot_two_1; hot_two_2
## [1] "Andrew Shaw"
## [1] "Dolores O'Riordan"

Let’s scrape Tweeter with the hot topics (2). We want to compare them.

no_tweets <- 500
hot_two_1_DF <- TweetFrame(hot_two_1, no_tweets)
hot_two_2_DF <- TweetFrame(hot_two_2, no_tweets)
hot_two_1_DF[1:2, 'text']
## [1] "RT @cultureoflosing: With both Andrew Shaw and now Logan Shaw, the Habs should acquire Henrik Sedin and play them all on the same like for…" 
## [2] "RT @StuCowan1: My updated Game Day Notebook setting up tonight's matchup between the #Habs and #Islanders at the Bell Centre (7:30 p.m., TS…"
hot_two_2_DF[1:2, 'text']
## [1] "RT @whereisMUNA: gone too soon. \nrest in power to one of our biggest musical inspirations, dolores o'riordan. \nthank you. https://t.co/Zghv…"
## [2] "RT @CNNEE: Muere la cantante de The Cranberries Dolores O'Riordan\nhttps://t.co/3B0Gu3qp77"

We sort the data frames by the time they were released ($created).

sort_hot_two_1_DF <- hot_two_1_DF[order(as.integer(hot_two_1_DF$created)), ]
sort_hot_two_2_DF <- hot_two_2_DF[order(as.integer(hot_two_2_DF$created)), ]
sort_hot_two_1_DF[1:2, 'text']
## [1] "RT @DanyAllstar15: I can’t stand even looking at Andrew Shaw. Like yeah, I’m definitely a huge loser but wow is this guy a fucking bone job…"
## [2] "RT @DanyAllstar15: I can’t stand even looking at Andrew Shaw. Like yeah, I’m definitely a huge loser but wow is this guy a fucking bone job…"
sort_hot_two_2_DF[1:2, 'text']
## [1] "RT @FoxNews: Dolores O'Riordan, beloved Cranberries singer, dies suddenly at 46 https://t.co/8hCrXA2rU5"
## [2] "Dolores O'Riordan, Cranberries lead singer, dead at 46 https://t.co/gfaElr7ecT"

We extract two vectors of time differences.

sort_hot_two_1_delays <- as.integer(diff(sort_hot_two_1_DF$created))
sort_hot_two_2_delays <- as.integer(diff(sort_hot_two_2_DF$created))

# data class
class(sort_hot_two_1_delays)
# first 10; mesured in seconds
sort_hot_two_1_delays[1:10]
## [1] "integer"
##  [1] 299 177  88 141  40 136  48 123 157 423

Descriptive statistics

We compute descriptive statistics.

mean(sort_hot_two_1_delays); mean(sort_hot_two_2_delays) # seconds
## [1] 288.3347
## [1] 0.04408818
median(sort_hot_two_1_delays); median(sort_hot_two_2_delays) # seconds
## [1] 16
## [1] 0
mfv(sort_hot_two_1_delays); mfv(sort_hot_two_2_delays)  # most frequent value in seconds
## [1] 1
## [1] 0
sqrt(var(sort_hot_two_1_delays)); sqrt(var(sort_hot_two_2_delays)) # std deviation in seconds
## [1] 1193.898
## [1] 0.205497
sort_hot_two_1_delays_count <- sum(sort_hot_two_1_delays <= 30)
sort_hot_two_2_delays_count <- sum(sort_hot_two_2_delays <= 30)
sort_hot_two_1_delays_count; sort_hot_two_2_delays_count # tweets count under 30s
## [1] 307
## [1] 499
sort_hot_two_1_delays_prob <- sum(sort_hot_two_1_delays <= 30) / no_tweets
sort_hot_two_2_delays_prob <- sum(sort_hot_two_2_delays <= 30) / no_tweets 
sort_hot_two_1_delays_prob; sort_hot_two_2_delays_prob # probability the next tweet is in 30s or less
## [1] 0.614
## [1] 0.998
# topic 1 confidence interval
sort_hot_two_1_delays_CI <- poisson.test(sort_hot_two_1_delays_count, no_tweets)$conf.int 
sort_hot_two_1_delays_CI[1]; sort_hot_two_1_delays_prob; sort_hot_two_1_delays_CI[2]
## [1] 0.5472308
## [1] 0.614
## [1] 0.6866687
stripchart(c(sort_hot_two_1_delays_CI[1],
             sort_hot_two_1_delays_prob,
             sort_hot_two_1_delays_CI[2]),
             main = 'topic 1 confidence interval', xlab = 'lower-mean-upper')
grid(ny = NA)

# topic 2 confidence interval
sort_hot_two_2_delays_CI <- poisson.test(sort_hot_two_2_delays_count, no_tweets)$conf.int 
sort_hot_two_2_delays_CI[1]; sort_hot_two_2_delays_prob; sort_hot_two_2_delays_CI[2]
## [1] 0.9123449
## [1] 0.998
## [1] 1.089531
stripchart(c(sort_hot_two_2_delays_CI[1],
             sort_hot_two_2_delays_prob,
             sort_hot_two_2_delays_CI[2]),
             main = 'topic 2 confidence interval', xlab = 'lower-mean-upper')
grid(ny = NA)

Let’s visualize all this.

sort_hot_two_delays_CI <- data.frame(topic = factor(c(1,1,1,2,2,2)),
                                     confint = c(sort_hot_two_1_delays_CI, sort_hot_two_1_delays_prob, sort_hot_two_2_delays_CI, sort_hot_two_2_delays_prob))

sort_hot_two_delays_CI
##   topic   confint
## 1     1 0.5472308
## 2     1 0.6866687
## 3     1 0.6140000
## 4     2 0.9123449
## 5     2 1.0895309
## 6     2 0.9980000
library(ggplot2)

ggplot(sort_hot_two_delays_CI, aes(x=topic, y=confint)) +
  geom_boxplot() +
  xlab('topic') + ylab('confidence interval')

Poisson test

The test is a comparison of the Poisson rates. We test the null hypothesis that the count for topic 1 equals the count for topic 2.

poisson.test(c(sort_hot_two_1_delays_count,
               sort_hot_two_2_delays_count),
             c(no_tweets,
               no_tweets))
## 
##  Comparison of Poisson rates
## 
## data:  c(sort_hot_two_1_delays_count, sort_hot_two_2_delays_count) time base: c(no_tweets, no_tweets)
## count1 = 307, expected count1 = 403, p-value = 1.388e-11
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.5319483 0.7106438
## sample estimates:
## rate ratio 
##  0.6152305

The p-value is over 5%. The test rejects the null hypothesis and accepts the alternative hypothesis of inequality (as we can see on the boxplot above).