Exploring Polling Data in R

Reading the pools

What types of conclusions we can derive from the polls?

Election polls are surveys designed to estimate voter preferences and generally rely on a large, representative sample of a given population. However, polling data can often be biased, unrepresentative of the population, or missing important variables related to election outcomes (for example, whether someone will actually vote).

Even the best polls are only an approximation of voter preferences. A poll that was taken three months, three weeks, or even three hours before the election still can’t tell you exactly how someone will behave in the voting booth.

The plot bellow shows the percentage of voters supporting each candidate across all states in the 2016 Republican primaries.

Donald Trump is generally the most popular candidate, but may lag behind other candidates in certain states.

Looking under the hood

Sample of polls from both the Republican and Democratic primaries.

# Import the data
polls <- readWorksheetFromFile('polls.xls', sheet = 'polls', header = TRUE, startCol = 1, startRow = 1)

# Check out the structure of polls
str(polls)

## 'data.frame':    1396 obs. of  14 variables:
##  $ location         : chr  "IA" "IA" "IA" "IA" ...
##  $ pollster_partisan: chr  "Gravis Marketing" "Iowa State University" "Loras College" "CNN/Opinion Research Corp." ...
##  $ polldate         : POSIXct, format: "2016-01-12" "2016-01-14" ...
##  $ samplesize       : num  461 356 500 280 570 258 423 490 606 400 ...
##  $ margin_poll      : num  21 2.4 29 -8 9 9 6 -1 -4 -3 ...
##  $ electiondate     : POSIXct, format: "2016-02-01" "2016-02-01" ...
##  $ cand1_actual     : num  49.8 49.8 49.8 49.8 49.8 ...
##  $ cand2_actual     : num  49.6 49.6 49.6 49.6 49.6 ...
##  $ margin_actual    : num  0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ...
##  $ error            : num  20.8 2.2 28.8 8.2 8.8 8.8 5.8 1.2 4.2 3.2 ...
##  $ rightcall        : num  1 1 1 0 1 1 1 0 0 0 ...
##  $ candidate        : chr  "clinton" "clinton" "clinton" "clinton" ...
##  $ percent          : chr  "57" "47,4" "59" "43" ...
##  $ type             : chr  "dem" "dem" "dem" "dem" ...

# Clean
polls$polldate <- as.Date(polls$polldate)
polls$candidate <- as.factor(polls$candidate)
polls$polldate <- as.Date(polls$polldate)
polls$percent <- as.numeric(polls$percent)
polls$type <- as.factor(polls$type)
polls$location <- as.factor(polls$location)
polls$pollster <- as.factor(polls$pollster)
polls$samplesize <- as.integer(polls$samplesize)

Democrats

# Select polls for the Democratic primaries only: dem_polls
dem_polls <- subset(polls, polls$type == 'dem')

# Using dem_polls, plot polldate on the x-axis, percent (i.e. percent of voters supporting a candidate) on the y-axis, and set the color using candidate
plot(dem_polls$polldate, dem_polls$percent, col =  dem_polls$candidate, xlab = '2016', ylab = '%', main = 'Democrats - Polls')

Visualizing polling trends

Use the ggplot2 package to produce a more intuitive plot of Democratic primary polls.

# Load the ggplot2 package
library(ggplot2)

# Create a plot of Democratic candidate support (percent) over time (polldate), setting color to candidate
dem_plot <- ggplot(data = subset(polls, polls$type == 'dem'), aes(y = percent, x = polldate, col = candidate)) + 
  geom_point(alpha = 0.5) 

# View your new plot
dem_plot

# Add a trend line for each candidate using geom_smooth()
dem_plot + 
  geom_smooth(span = 0.5, se = FALSE)

The new plot of Democratic primary polls is crisp and easy to understand.

Improving poll data quality

A great plot showing support for Hillary Clinton and Bernie Sanders over the course of the 2016 Democratic primaries.

Before we draw any conclusions from this plot, we’ll want to address issues of data quality. Even the best polls can suffer from inaccuracy caused by low sample size or pollster bias.

# Summarize samplesize (contained in the polls data frame)
summary(polls$samplesize)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   115.0   403.0   502.5   680.5   721.0 14201.0

# Create a histogram showing the distribution of polls with a sample size below 1000. Use the vector `breaks` for your histogram breaks.
hist(polls$samplesize[polls$samplesize < 1000], breaks = c(0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300))

# Create a new data frame that contains only polls with a sample size greater than 400: polls2
polls2 <- polls[polls$samplesize > 400, ]

We’ve removed a key source of inaccuracy in your data. With these refined data, you are ready to produce more accurate plots.

Visualizing the refined data

Ready to produce more accurate plots.

To view two plots at once, we’ll need to use the grid.arrange() command from the gridExtra package.

library(gridExtra)

# Recreate your plot of the Demcoratic polling data using polls2: dem_plot2
dem_plot2 <- ggplot(data = subset(polls2, polls2$type == 'dem'), aes(y = percent, x = polldate, col = candidate)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  geom_smooth(span = 0.5, se = FALSE, na.rm = TRUE)

# Create a similar plot for the Republican primaries: rep_plot2
rep_plot2 <- ggplot(data = subset(polls2, polls2$type == 'rep'), aes(y = percent, x = polldate, col = candidate)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  geom_smooth(span = 0.5, se = FALSE, na.rm = TRUE)

# View both plots together (do not modify this command)
grid.arrange(dem_plot2, rep_plot2, nrow = 2)

Those plots provide a much better basis for drawing conclusions about the popularity of each candidate.

The fifty-state strategy

During the primaries, campaigns tend to focus on winning individual states, rather than maintaining popularity across the country. Instead of averaging across all states, it might make sense to view polling data state-by-state.

Compare polls in the Republican primaries across three important early states: Iowa, New Hampshire, and South Carolina. This time, we’ll keep the confidence intervals to help see how the candidates compare.

# Create a plot that includes only polling data for Republicans in Iowa: ia_plot
ia_plot <- ggplot(data = subset(polls2, polls2$type == 'rep' & polls2$location == 'IA'), aes(y = percent, x = polldate, col = candidate)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  geom_smooth(span = 0.7, na.rm = TRUE) +
  labs(title = 'Iowa')

# Create another plot that includes only polling data for Republicans in New Hampshire: nh_plot
nh_plot <- ggplot(data = subset(polls2, polls2$type == 'rep' & polls2$location == 'NH'), aes(y = percent, x = polldate, col = candidate)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  geom_smooth(span = 0.7, na.rm = TRUE) +
  labs(title = 'New Hampshire')

# Create another plot that includes only polling data for Republicans in South Carolina: sc_plot
sc_plot <- ggplot(data = subset(polls2, polls2$type == 'rep' & polls2$location == 'SC'), aes(y = percent, x = polldate, col = candidate)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  geom_smooth(span = 0.7, na.rm = TRUE) +
  labs(title = 'South Carolina')

# Take a look at all three plots together (do not modify this command)
grid.arrange(ia_plot, nh_plot, sc_plot, nrow = 3)

It looks like Iowa was a tight race, but Donald Trump had large leads in both New Hampshire and South Carolina.

Stronger predictions

There is not a 100% chance that Donald Trump will win New Hampshire because of Kasich’s upward trend.
There is a chance of Marco Rubio will win in South Carolina because of the upward trend.
Always keep in ming the confidence interval; if the top interval of a lagging candidate catches up with the bottom interval of a leading candidate, we cannot conclude to a clear victory from the leading candidate.
For that matter, We can’t be sure who will win in Iowa.

Mapping polling data

A valuable way to visualize this variation is to attach polling data to a map of the United States.

The maps package contains coordinates for important geographic and political units worldwide which R can use to generate maps. Before you can generate a map from your data, you’ll need to merge your polling data with geographic information for each state.

# Import the data
sanders <- readWorksheetFromFile('polls.xls', sheet = 'sanders', header = TRUE, startCol = 1, startRow = 1)

# Clean
sanders$pollno <- as.integer(sanders$pollno)
sanders$race <- as.factor(sanders$race)
sanders$type_detail <- as.factor(sanders$type_detail)
sanders$pollster <- as.factor(sanders$pollster)
sanders$location <- as.factor(sanders$location)
sanders$type_detail <- as.factor(sanders$type_detail)
sanders$pollster <- as.factor(sanders$pollster)
sanders$partisan <- as.logical(sanders$partisan)
sanders$samplesize <- as.integer(sanders$samplesize)
sanders$cand3_pct <- as.integer(sanders$cand3_pct)
sanders$bias <- as.logical(sanders$bias)
sanders$rightcall <- as.integer(sanders$rightcall)
sanders$comment <- as.logical(sanders$comment)
sanders$region <- as.factor(sanders$region)

# Check out the structure of polls
str(sanders)

## 'data.frame':    40 obs. of  21 variables:
##  $ pollno       : int  15380942 15380981 15381005 15381018 15381032 15381037 15381069 15381071 15381072 15381113 ...
##  $ race         : Factor w/ 35 levels "2016_Pres-D_AL",..: 9 20 2 33 4 21 12 28 28 29 ...
##  $ location     : Factor w/ 36 levels "AL","AR","AZ",..: 10 21 2 34 5 22 13 29 29 30 ...
##  $ type_detail  : Factor w/ 1 level "Pres-D": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pollster     : Factor w/ 19 levels "American Research Group",..: 5 1 16 16 18 15 6 2 5 17 ...
##  $ partisan     : logi  NA NA NA NA NA NA ...
##  $ polldate     : POSIXct, format: "2016-01-30" "2016-02-08" ...
##  $ samplesize   : int  300 409 525 693 1144 305 123 650 266 533 ...
##  $ cand3_pct    : int  4 NA NA NA NA NA NA NA NA NA ...
##  $ margin_poll  : num  8 9 25 76 5 -2 -10 50 23.6 17 ...
##  $ electiondate : POSIXct, format: "2016-02-01" "2016-02-09" ...
##  $ cand1_actual : num  49.8 60.1 66.1 86 59 ...
##  $ cand2_actual : num  49.6 37.7 30 13.6 40.3 ...
##  $ margin_actual: num  0.2 22.5 36.1 72.3 18.7 ...
##  $ error        : num  7.8 13.48 11.11 3.65 13.68 ...
##  $ bias         : logi  NA NA NA NA NA NA ...
##  $ rightcall    : int  1 1 1 1 1 0 0 1 1 1 ...
##  $ comment      : logi  NA NA NA NA NA NA ...
##  $ candidate    : chr  "sanders" "sanders" "sanders" "sanders" ...
##  $ percent      : num  43 53 32 86 49 49 23 14 36.5 37 ...
##  $ region       : Factor w/ 36 levels "alabama","arizona",..: 12 21 3 33 5 20 13 29 29 30 ...

# Load the ggthemes package
library(ggthemes)

# Load the maps package
library(maps)
states <- map_data('state')

# Merge states and sanders by region: sanders_map
sanders_map <- merge(states, sanders, by = 'region', all = TRUE)

# Reorder Sanders map according to the maps package order column
sanders_map <- sanders_map[order(sanders_map$order), ]

# Use ggplot() to produce a map of Bernie Sanders polling data (do not modify this command)
ggplot() +
  geom_polygon(data = sanders_map, aes(x = long, y = lat, group = group, fill = percent)) +
  labs(title = 'Support for Bernie Sanders at Last Poll Before Primary') + 
  theme_map()

It looks like Bernie Sanders was polling very well before the Vermont primary. No surprise there. Sanders’ polling numbers across the South are much lower.

Interactive maps using googleVis

Attaching polling data to a map allows for easy and intuitive identification of regional trends for each candidate.

However, a color gradient alone makes it difficult to identify specific polling numbers in each state. Ideally, you want your map to display general trends and specific information without being too cluttered. To accomplish this, you’ll use the googleVis package to generate an interactive map from your Sanders polling data.

The googleVis package allows you to create interactive charts and maps in R by providing a direct interface to the Google Charts API. Unlike the maps package, maps in googleVis do not require you to attach geographic coordinates to your data as long as they contain relevant geographic names (in this case, states). The gvisGeoChart command requires you to specify your data, a location variable, and a color variable.

# Load the googleVis package
library(googleVis)

# Generate a googleVis map object: sanders_gvis
sanders_gvis <- gvisGeoChart(data = sanders, locationvar = 'location', colorvar = 'percent', options = list(region = 'US', displayMode = 'regions', resolution = 'provinces'))

# Plot your googleVis object
plot(sanders_gvis)

The map can only be rendered online. This is .gif file: