Foreword



The attendance-membership models

The dataset

Import the dataset from this source.

head(attendance)
## # A tibble: 6 x 8
##      MCG  Other  Temp Members Top50    Date  Home  Away
##    <dbl>  <dbl> <int>   <dbl> <int>   <chr> <chr> <chr>
## 1  8.653 72.921    24  12.601     5 27/3/93    NM  Bris
## 2 49.856 60.848    21  25.991     7  3/4/93   Ess  Carl
## 3 24.362 59.842    24  16.948     5 17/4/93    NM  Melb
## 4 46.588  9.272    22  27.046     8  1/5/93   Ess   Gee
## 5 29.296 74.798    17  22.874     7  8/5/93  Rich   StK
## 6 34.372 61.938    16  51.646     8 22/5/93   Ess  Adel
  • MCG, Melbourne Cricket Ground, attendance at the MCG in thousands.
  • Temp, temperature (C). The forecast maximum temperature on the day of the match.
  • Other, attendance at other matches in thousands. The sum of the attendances at other AFL matches in Melbourne and Geelong on the same day as the match in question.
  • Members, membership. The sum of the memberships of the two clubs playing the match in question in thousands.
  • Top50, number of players from the top fifty. The number of players in the top 50 in the AFL who happened to be playing in the match in question.
  • Date, date of the match in the format dd/mm/yy.
  • Home, abbreviation for home club.
  • Away, abbreviation for away club.

The clubs

Abbr. City Abbr. City
NM North Melbourne Haw Hawthorn
Ess Essendon Foot Footscray
Carl Carlton Coll Collingwood
Melb Melbourne Bris Brisbane
Gee Geelong Adel Adelaide
Rich Richmond Syd Sydney
StK St Kilda WC West Coast

WC (for West Coast) means Perth. Both terms are interchangeable.

A first exploration

Draw a scatterplot.

plot(attendance$MCG ~ attendance$Members, xlab = 'Combined membership of clubs playing (000)', ylab = 'MCG Match Day Attendance (000)', col = 'blue3', las = 1)

There is a trend in the observations, but we notice outliers on the right side.

summary(attendance$Members)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   20.58   25.99   26.57   31.94   51.65

Indeed, the maximum is far from the third quarter.

Outliers

Extract the outliers.

# Sort, descending
high_memberships <- sort(attendance$Members, decreasing = TRUE)

# Count
length(high_memberships)
## [1] 41
# Extract the top 2 values
high_memberships[1:2]
## [1] 51.646 43.253
# Create new variables
attendance_high_memberships <- subset(attendance, attendance$Members >= high_memberships[2])

Redraw the scatterplot with the two outliers.

plot(MCG ~ Members, data = attendance, xlab = 'Combined membership of clubs playing (000)', ylab = 'MCG Match Day Attendance (000)', col = 'gray1', las = 1)

points(attendance_high_memberships$MCG ~ attendance_high_memberships$Members, pch = 16, col = 'blue3')

The models

We want to build a model where a club membership is the main driver of a match attendance. No member, no attendance.

The rationale leads us to remove the intercept

Model 1

Run a regression with the intercept at the origin (0).

attendance_model1 <- lm(MCG ~ 0 + Members, data = attendance)

# Alternatively
# MCG ~ Members - 1

summary(attendance_model1)
## 
## Call:
## lm(formula = MCG ~ 0 + Members, data = attendance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.961  -6.550   2.954   9.931  29.252 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## Members  1.53610    0.08768   17.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.65 on 40 degrees of freedom
## Multiple R-squared:  0.8847, Adjusted R-squared:  0.8818 
## F-statistic: 306.9 on 1 and 40 DF,  p-value: < 2.2e-16

The \(R^2\) are good and p-values are significant at the 1 % level.

The coefficient is not exactly 1, but rather 1.5361. Membership boosts attendance, but seems not to be the sole driver since the \(R^2\) explain only 0.88 of the variance.

Plot the diagnoses.

par(mfrow = c(2,2))
plot(attendance_model1)

par(mfrow = c(1,1))

In the top-left and bottom-right plots, we can see the effect of outliers on residuals.

Plot the data with the regression line.

plot(attendance$MCG ~ attendance$Members, xlab = 'Combined membership of clubs playing (000)', ylab = 'MCG Match Day Attendance (000)', col = 'blue3', xlim = c(0,60), ylim = c(0,100), las = 1)

abline(attendance_model1, col = 'red3', lwd = 2, lty = 3)

Now, for any 1000 additional members, we can expect attendance to increase by 1536.

Geographic Bias

The Commonwealth of Australia is a federation of six states and federal territories (Capital, maritimes zones, islands, Antarctic claims)): New South Wales, Northern Territory, Queensland, South Australia, Victoria and Western Australia.

library(ggmap)

Australia <- get_map(location = c(lon = 133, lat = -27), zoom = 4, maptype = 'terrain')

plot(Australia)

Most AFL clubs, such as Melbourne, are located in the state of Victoria (South-East), facing the island of Tasmania (the picture below displays the Greater Melbourne).

The supporter base around Melbourne can easily access the Melbourne Cricket Ground (MCG). The other clubs are from other states. Supporters need to make an interstate journey to see their club play at the MCG.

For example, a trip from Sydney (the northern adjacent state) to Melbourne is around eight hours by car or two by plane. From Melbourne, Perth (on the west coast) is close to five hours by air and two time zones away. In all, four clubs are interstate: Brisbane (north-east), Sydney, Adelaide (south-central), and West Coast (‘Bris’, ‘Syd’, ‘Adel’, and ‘WC’)

Differentiate the interstate clubs

We add a dummy variable to attendance dataset.

# | means OR
# WC = Perth
attendance$away_inter <- ifelse(attendance$Away == 'WC' |
  attendance$Away == 'Adel' |
  attendance$Away == 'Syd' |
  attendance$Away == 'Bris',
  1, 0)

head(attendance, 3)
## # A tibble: 3 x 9
##      MCG  Other  Temp Members Top50    Date  Home  Away away_inter
##    <dbl>  <dbl> <int>   <dbl> <int>   <chr> <chr> <chr>      <dbl>
## 1  8.653 72.921    24  12.601     5 27/3/93    NM  Bris          1
## 2 49.856 60.848    21  25.991     7  3/4/93   Ess  Carl          0
## 3 24.362 59.842    24  16.948     5 17/4/93    NM  Melb          0

The code above checks the values in the variable labeled away. If it finds an exact match with an interstate clubs, it inputs a 1 in the new variable away_inter, otherwise 0.

Model 2

Run a new regression with the dummy variable.

attendance_model2 <- lm(MCG ~ 0 + Members + away_inter, data = attendance)

summary(attendance_model2)
## 
## Call:
## lm(formula = MCG ~ 0 + Members + away_inter, data = attendance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.2003  -8.5061   0.0862   8.5411  23.5687 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## Members      1.69255    0.07962  21.257  < 2e-16 ***
## away_inter -22.84122    5.02583  -4.545  5.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.82 on 39 degrees of freedom
## Multiple R-squared:  0.9246, Adjusted R-squared:  0.9208 
## F-statistic: 239.2 on 2 and 39 DF,  p-value: < 2.2e-16

The \(R^2\) are improved and p-values are significant at the 1 % level.

The new dummy variable differentiates between two subsets in the dataset. There is an subset predicting attendance when fans have a moderate car ride, and another subset when fans are farther way (interstate journey by car or by plane). The away_inter coefficient < 0. If away_inter = 1, attendance is reduced by the coefficient.

Model 2 vs. model 1

Plot the data and compare the new model to the previous model (with the regression lines).

plot(attendance$MCG ~ attendance$Members, xlab = 'Combined membership of clubs playing (000)', ylab = 'MCG Match Day Attendance (000)', col = 'blue3', xlim = c(0,60), ylim = c(0,100), las = 1)

abline(attendance_model1, col = 'red3', lwd = 2, lty = 2)
abline(attendance_model2, col = 'darkgreen', lwd = 2, lty = 3)

legend(title = 'Model', 'topleft', inset = 0.05, c('1', '2'), col = c('red3', 'darkgreen'), lty = c(2, 3), lwd = c(2, 2), horiz = FALSE)

The new regression line is steeper; the dummy variables dampened the outlier from skewing the distribution rightwards.

Model 2 subsets

Using the last model, plot two regression lines: with away_inter = 1, and with away_inter = 0.

plot(attendance$MCG ~ attendance$Members, xlab = 'Combined membership of clubs playing (000)', ylab = 'MCG Match Day Attendance (000)', col = 'blue3', xlim = c(0,60), ylim = c(0,100), las = 1)

lines(summary(attendance_model2)$coefficients[1]*attendance$Members ~ attendance$Members, col = 'red3')

lines(summary(attendance_model2)$coefficients[1]*attendance$Members + summary(attendance_model2)$coefficients[2] ~ attendance$Members, col = 'orange1')

legend(title = 'away_inter', 'topleft', inset = 0.05, c('1', '0'), col = c('red3', 'orange1'), lty = c(1, 1), lwd = c(2, 2), horiz = FALSE)

The interaction.plot

Interaction can be of many forms. In our case, the second model has two intercepts with the same slope. We could say that some data points contribute to each level.

interaction.plot(attendance$Members, attendance$away_inter, attendance$MCG, fixed = TRUE, type = 'p', xlab = 'Combined membership of clubs playing (000)', ylab = 'MCG Match Day Attendance (000)', col = c('blue3', 'red3'), legend = FALSE, pch = c(1,2), ylim = c(0,100), las = 1)

legend(title = 'away_inter', 'topleft', inset = 0.05, c('0', '1'), col = c('blue3', 'red3'), pch = c(1, 2), horiz = FALSE)



Mapping the AFL clubs

So far, we manage to address the case with classical regression models. Nonetheless, we deal with spatial data. As some regression models specifically deal with time series, there are models to deal with spatial econometrics and geostatistics.

Above all, we can begin by plotting maps. Just like a plot, a map can tell a lot by itself.

Positional data

head(club_location)
## # A tibble: 6 x 6
##          city           state      lat     long  timezone `2006population`
##         <chr>           <chr>    <dbl>    <dbl>     <chr>            <int>
## 1    Adelaide South Australia -34.9333 138.6000  Adelaide          1040719
## 2    Brisbane      Queensland -27.5000 153.0167  Brisbane          1676389
## 3     Carlton        Victoria -37.8000 144.9669 Melbourne          3371888
## 4 Collingwood        Victoria -37.8031 144.9881 Melbourne          3371888
## 5    Essendon        Victoria -37.7556 144.9169 Melbourne          3371888
## 6   Footscray        Victoria -37.7989 144.9000 Melbourne          3371888

Number of clubs.

nrow(club_location)
## [1] 14

We have geographic details; most importantly, we have the geocodes (lat, long).

Preprocess the data

An EventData object is a data frame with at least three fields named EID, X, and Y. Each row specifies an event that occurs at a specific point.

library(PBSmapping)

EID <- row(club_location)[,1]
EID
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14
aflPoints <- data.frame(EID = EID, X = club_location$long, Y = club_location$lat)

aflEvents <- as.EventData(aflPoints, projection = NA)

This operation is also known as geocoding points.

Mapping with PBSmapping

Create a map with PBSmapping (and its extensions). We can generate the map from a ‘shapefile’.

library(PBSmapping)
library(maptools)

ozShape <- importShapefile('aust_shapefiles/STE11aAust', readDBF = TRUE)

Next, we plot the map and add the club positions.

library(PBSmapping)
library(maptools)

plotPolys(ozShape,
  main = "AFL Clubs",
  xlab = "Longitude",
  ylab = "Latitude",
  xlim = c(105, 160),
  ylim = c(-45, -5),
  col = "antiquewhite",
  border = "antiquewhite3",
  lty = 3,
  bg = "aliceblue")

addPoints(aflEvents, col = "coral4", cex = 1, pch = 15)

The PBSmapping requires lots of calculations to generate maps (it can take some time to see the results). The advantage is we can map out anything by creating polygons to highlight a boundaries, region, roads, habitats, rivers, etc. and stick the shapes on top of a basic map.

Since most club are located in or around Melbourne, we zoom in on the Melbourne metropolitan area by changing some of the parameteters.

plotPolys(ozShape,
  main = "AFL Clubs - State of Victoria",
  xlab = "Longitude",
  ylab = "Latitude",
  xlim = c(142, 148),
  ylim = c(-39.5, -37.5),
  col = "antiquewhite",
  border = "antiquewhite3",
  lty = 3,
  bg = "aliceblue")

addPoints(aflEvents, col = "coral4", cex = 2, pch = 'O')

Most clubs are located in the Greater Melbourne with the exception of Geelong at the south-west of the bay.

There are alternative packages for drawing map.

Mapping with ggplot2

ggplot2 (and its extensions) are fast and polyvalent. We can render many map formats. We choose the stylish "watercolor".

library(ggplot2)
library(ggmap)
library(RColorBrewer)

Australia_map <- get_map(location = "Australia", zoom = 4, scale = 'auto', color = 'color', maptype = 'watercolor')

ggmap(Australia_map) +
  geom_point(data = club_location, aes(x = long, y = lat, shape = state), color = 'gray3', cex = 4, pch = 16) +
  labs(x = 'Longitude', y = 'Latitude', title = 'AFL Clubs') +
  theme(legend.position = 'none')

We change some parameters and zoom in on the Melbourne metropolitan area.

Australia_map <- get_map(location = c(lon = 145, lat = -38), zoom = 9, scale = 'auto', color = 'color', maptype = 'terrain')

ggmap(Australia_map) +
  geom_point(data = club_location, aes(x = long, y = lat), color = 'coral4', cex = 3, pch = 16) +
  labs(x = 'Longitude', y = 'Latitude', title = 'AFL Clubs') +
  theme(legend.position = 'none')

Geocoding with ggmaps

Before drawing maps with PBSmapping and ggplot2, we geocoded the club positions and created two objects: aflPoints and aflEvents.

With ggmaps, geocoding is much simpler.

head(club_location, 3)
## # A tibble: 3 x 6
##       city           state      lat     long  timezone `2006population`
##      <chr>           <chr>    <dbl>    <dbl>     <chr>            <int>
## 1 Adelaide South Australia -34.9333 138.6000  Adelaide          1040719
## 2 Brisbane      Queensland -27.5000 153.0167  Brisbane          1676389
## 3  Carlton        Victoria -37.8000 144.9669 Melbourne          3371888

We simply input the coordinates and ggmap does magic with geom_point(data = aflplus, aes(x = long, y = lat).... Each location is an Internet request (be sure to be connected and be advised that any query is recorded like any Google query).

We could also feed in landmarks (cities for e.g.) and the geocode function takes care of the rest (just like googling the names).

library(ggmap)

# Input some landmarks
# The more details, the more precise
# Some requests may fail: mispelling, lack of details, etc.
stopovers <- c("SFO",
             "Chennai",
             "London",
             "Melbourne",
             "Johannesburg, SA") # we added a country

# Geocode the landmarks
geocoded_stopovers <- geocode(stopovers)

# Compute positions
stopovers.x <- geocoded_stopovers$lon
stopovers.y <- geocoded_stopovers$lat
library(ggmap)

# Create an object
mp <- NULL

# Plot the base map
mapWorld <- borders("world", colour = "gray50", fill = "gray50")

# Enrich the object
mp <- ggplot() + mapWorld

# Lay the cities on top
mp <- mp + 
  geom_point(aes(x = stopovers.x, y = stopovers.y) ,color = "red", size = 3) 

# Plot the results
mp

We can also reverse geocode a longitude/latitude location using Google Maps with revgeocode.

Mapping with maps

maps (and its extensions) is a nice and simple package for mapping.

We generate a map and add the club positions.

library(maps)
library(mapdata)
library(maptools)
library(scales)

map(database = 'world', regions = 'australia', fill = TRUE, col = 'antiquewhite', bg = 'aliceblue', mar = c(1,1,1,1))

points(aflPoints$X, aflPoints$Y, pch = 16, col = 'red', cex = 1.5)

title('AFL Clubs')

legend('bottom', inset = 0.05, title = 'Sites',  c('club'), col = c('red'), pch = c(16), horiz = TRUE)

However, zooming in is much more difficult that with the two previous packages. maps displays countries, countries with political subdivisions (the US states or counties for e.g.) or regions-continents (the European Union for e.g.).

We also can opt for the 'world'

library(maps)

map("world", fill = TRUE, col = 'antiquewhite', bg = 'aliceblue', mar = c(1,1,1,1))

points(aflPoints$X, aflPoints$Y, pch = 16, col = 'red', cex = 1.5)

title('The World')

…and focus on the region of 'australia'.

library(maps)

map(database = 'world', regions = 'australia', fill = TRUE, col = 'antiquewhite', bg = 'aliceblue', mar = c(1,1,1,1))

points(aflPoints$X, aflPoints$Y, pch = 16, col = 'red', cex = 1.5)

title('Australia')

Rge maps package offer different projection (for e.g., proj = 'sinusoidal'). However, not all projection will integrate the plot points since ‘alternative’ projections distorts the map.

Mapping with rworldmap

The rworldmap displays the whole planet by default.

library(rworldmap)

newmap <- getMap(resolution = "low")

plot(newmap, asp = 1.5, col = 'antiquewhite', bg = 'aliceblue', mar = c(1,1,1,1))

title('The World')

However, we can delimit an area of the planet.

We generate a local map and add the club positions.

library(rworldmap)

newmap <- getMap(resolution = "low")

plot(newmap, xlim = c(110, 160), ylim = c(-30, -25), asp = 1, col = 'antiquewhite', bg = 'aliceblue', mar = c(1,1,1,1))

box()

points(aflPoints$X, aflPoints$Y, pch = 16, col = 'red', cex = 1.5)

title('AFL Clubs')

legend('bottomleft', inset = 0.05, title = 'Sites',  c('club'), col = c('red'), pch = c(16), horiz = TRUE)

Well, maps are the foundation of geointelligence. Geointelligence comprises geographic information systems (GIS), spatial analysis, and geospatial statistics (or geostatistics). Geointelligence is useful to many situations. We can find applications everywhere.

  • Business.
    • Retail research.
    • Hospitality and tourism research.
    • Competitor location analysis.
    • Market segmentation and clustering.
  • Public health and lifestyle.
    • Patient health profiling.
    • Income disparities and public health.
  • Urbanism.
    • Public transit and commuting.
    • Socioeconomic surveys.
    • Housing markets.
    • City planning.
  • Political science.
    • Voting patterns.
  • And so on.

Takeaway

After visualizing all these maps, we can reach the conclusion that geographic location might play a role, might be an explanatory variable in a regression, whatever the model we use.

  1. We already know 88 % of a match attendance is explained by the club membership (fanbase).
  2. By including a dummy variable for distant clubs increase the explanatory power (92 %) as there are two regimes in the dataset.

However, a dummy variable is a categorical variable: whether we have a distant club or not. What if we wanted to incorporate the impact of distance on a match attendance: a continuous variable?



The geoadditive models

We could implement a geoadditive model that uses distance. A simple geoadditive model takes this form.

\[Y = \beta_0 + \beta_iX_i + ... + \beta_j(Distance) + \varepsilon\]

We must first calculate the distance between the clubs.

First option – Calculating the distance

The simplest method would be the mapdist function from the ggmap package.

Second option – Calculating the geodesic distance

The geodesic distance could be another option. It is a more sophisticated approach. Consult Appendix A for more explanations.

We already have the coordinates in the club_location data frame.

Abbr. City Abbr. City
NM North Melbourne Haw Hawthorn
Ess Essendon Foot Footscray
Carl Carlton Coll Collingwood
Melb Melbourne Bris Brisbane
Gee Geelong Adel Adelaide
Rich Richmond Syd Sydney
StK St Kilda WC West Coast

WC = Perth.

Let’s assume we don’t have the data frame. We first have to geocode each club to find out about their coordinates.

library(ggmap)

# Create a vector to feed the function (max details)
club_location2 <- paste0(club_location$city, ", ", club_location$state, ", Australia")

head(club_location2)
## [1] "Adelaide, South Australia, Australia"
## [2] "Brisbane, Queensland, Australia"     
## [3] "Carlton, Victoria, Australia"        
## [4] "Collingwood, Victoria, Australia"    
## [5] "Essendon, Victoria, Australia"       
## [6] "Footscray, Victoria, Australia"
# Geocode the landmarks
geocoded_club <- geocode(club_location2)

head(geocoded_club)
##        lon      lat
## 1 138.6007 -34.9285
## 2       NA       NA
## 3       NA       NA
## 4       NA       NA
## 5 144.9120 -37.7490
## 6 144.8990 -37.7980
# Compute positions
club.x <- geocoded_club$lon
club.y <- geocoded_club$lat

geocoded_club2 <- as.data.frame(cbind(club = club_location$city, club.x, club.y))

geocoded_club2$club.x <- as.numeric(as.character((geocoded_club2$club.x)))
geocoded_club2$club.y <- as.numeric(as.character((geocoded_club2$club.y)))

head(geocoded_club2)
##          club   club.x   club.y
## 1    Adelaide 138.6007 -34.9285
## 2    Brisbane       NA       NA
## 3     Carlton       NA       NA
## 4 Collingwood       NA       NA
## 5    Essendon 144.9120 -37.7490
## 6   Footscray 144.8990 -37.7980
str(geocoded_club2)
## 'data.frame':    14 obs. of  3 variables:
##  $ club  : Factor w/ 14 levels "Adelaide","Brisbane",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ club.x: num  139 NA NA NA 145 ...
##  $ club.y: num  -34.9 NA NA NA -37.7 ...

Let’s check out how the geocoding function performed well compared to the actuals from club_location.

Keep in mind that when we provide a city to the geocoding function, we do not provide a 3m x 3m area but an area that can span over several kilometers. Results will always vary.

In addition, many clubs are located in the Greater Melbourne. The ‘suburbs’ have coordinates very similar to Melbourne. One exception: Geelong. The city stands apart.

Melbourne and the suburbs.

library(ggmap)

Greater_Melbourne <- get_map(location = c(lon = 144.8, lat = -38), zoom = 9, maptype = 'terrain')

plot(Greater_Melbourne)

Greater Melbourne, Geelong, and St Kilda (from a picture).


We create a common variable so we can compare the geocoded_club2 data frame (the geocoded values) to the club_location data frame (the actuals).

We merge the two data frames and we compare them.

library(dplyr)

# Combine both data frames
check <- inner_join(club_location, geocoded_club2, by = c('city' = 'club'))
str(check)
## Classes 'tbl_df', 'tbl' and 'data.frame':    14 obs. of  8 variables:
##  $ city          : chr  "Adelaide" "Brisbane" "Carlton" "Collingwood" ...
##  $ state         : chr  "South Australia" "Queensland" "Victoria" "Victoria" ...
##  $ lat           : num  -34.9 -27.5 -37.8 -37.8 -37.8 ...
##  $ long          : num  139 153 145 145 145 ...
##  $ timezone      : chr  "Adelaide" "Brisbane" "Melbourne" "Melbourne" ...
##  $ 2006population: int  1040719 1676389 3371888 3371888 3371888 3371888 137220 3371888 3371888 3371888 ...
##  $ club.x        : num  139 NA NA NA 145 ...
##  $ club.y        : num  -34.9 NA NA NA -37.7 ...
# Compare longitude
check[, c(1,4,7)]
## # A tibble: 14 x 3
##               city     long   club.x
##              <chr>    <dbl>    <dbl>
##  1        Adelaide 138.6000 138.6007
##  2        Brisbane 153.0167       NA
##  3         Carlton 144.9669       NA
##  4     Collingwood 144.9881       NA
##  5        Essendon 144.9169 144.9120
##  6       Footscray 144.9000 144.8990
##  7         Geelong 144.3500 144.3617
##  8        Hawthorn 145.0358 145.0340
##  9       Melbourne 144.9667       NA
## 10 North Melbourne 144.9508       NA
## 11        Richmond 144.9981 144.9980
## 12        St Kilda 144.9786       NA
## 13          Sydney 151.2167 151.2093
## 14           Perth 115.8333 115.8605
# Percentage difference
round((check$club.x / check$long) - 1, 3)
##  [1]  0 NA NA NA  0  0  0  0 NA NA  0 NA  0  0
# Compare latitude
check[, c(1,3,8)]
## # A tibble: 14 x 3
##               city      lat    club.y
##              <chr>    <dbl>     <dbl>
##  1        Adelaide -34.9333 -34.92850
##  2        Brisbane -27.5000        NA
##  3         Carlton -37.8000        NA
##  4     Collingwood -37.8031        NA
##  5        Essendon -37.7556 -37.74900
##  6       Footscray -37.7989 -37.79800
##  7         Geelong -38.1583 -38.14992
##  8        Hawthorn -37.8200 -37.82600
##  9       Melbourne -37.8167        NA
## 10 North Melbourne -37.8031        NA
## 11        Richmond -37.8197 -37.82300
## 12        St Kilda -37.8675        NA
## 13          Sydney -33.8831 -33.86882
## 14           Perth -31.9333 -31.95053
# Percentage difference
round((check$club.y / check$lat) - 1, 3)
##  [1] 0.000    NA    NA    NA 0.000 0.000 0.000 0.000    NA    NA 0.000
## [12]    NA 0.000 0.001

In most cases, the percentage difference is less than 0.1%.

Visually, the geocoded values are located exactly where the actuals are.

library(ggmap)

# Australia
Australia_map <- get_map(location = "Australia", zoom = 4, scale = 'auto', color = 'color', maptype = 'terrain')

ggmap(Australia_map) +
  geom_point(data = geocoded_club2, aes(x = club.x, y = club.y), color = 'coral4', cex = 3, pch = 16) +
  labs(x = 'Longitude', y = 'Latitude', title = 'AFL Clubs') +
  theme(legend.position = 'right')

# Zoom
Australia_map <- get_map(location = c(lon = 145, lat = -38), zoom = 9, scale = 'auto', color = 'color', maptype = 'terrain')

ggmap(Australia_map) +
  geom_point(data = geocoded_club2, aes(x = club.x, y = club.y), color = 'coral4', cex = 3, pch = 16) +
  labs(x = 'Longitude', y = 'Latitude', title = 'AFL Clubs') +
  theme(legend.position = 'right')

Down the line, we have to set the benchmark for the city coordinates.

Once we have the coordinates, we can compute the geodesic distances between clubs.

We want to load an enriched version of the attendance dataset.

head(attendance)
## # A tibble: 6 x 9
##      MCG  Other  Temp Members Top50    Date  Home  Away away_inter
##    <dbl>  <dbl> <int>   <dbl> <int>   <chr> <chr> <chr>      <dbl>
## 1  8.653 72.921    24  12.601     5 27/3/93    NM  Bris          1
## 2 49.856 60.848    21  25.991     7  3/4/93   Ess  Carl          0
## 3 24.362 59.842    24  16.948     5 17/4/93    NM  Melb          0
## 4 46.588  9.272    22  27.046     8  1/5/93   Ess   Gee          0
## 5 29.296 74.798    17  22.874     7  8/5/93  Rich   StK          0
## 6 34.372 61.938    16  51.646     8 22/5/93   Ess  Adel          1

Check out the attendance2 dataset.

attendance2[1:5, c(9, 10)]
## # A tibble: 5 x 2
##              rec1      vis2
##             <chr>     <chr>
## 1 North Melbourne  Brisbane
## 2        Essendon   Carlton
## 3 North Melbourne Melbourne
## 4        Essendon   Geelong
## 5        Richmond  St Kilda

attendance2 has the full names of the clubs. Now, we need to add the coordinates from club_location.

library(dplyr)

# A left join keeps all rows on the left and complement the data frame with data from the right given a common key
# Receiving club
attendance3 <- left_join(attendance2, club_location, by = c("rec1" = "city"))

# Rename the coordinates; they belong to the receiving club or rev1
# Remove the original coordinates
attendance3$lat1 <- attendance3$lat; attendance3$lat <- NULL
attendance3$long1 <- attendance3$long; attendance3$long <- NULL
attendance3$`2006population1` <- attendance3$`2006population`; attendance3$`2006population` <- NULL
attendance3$state <- NULL
attendance3$timezone <- NULL


# Visiting club
attendance3 <- left_join(attendance3, club_location, by = c("vis2" = "city"))

# Rename...
# Remove...
attendance3$lat2 <- attendance3$lat; attendance3$lat <- NULL
attendance3$long2 <- attendance3$long; attendance3$long <- NULL
attendance3$`2006population2` <- attendance3$`2006population`; attendance3$`2006population` <- NULL

We feed these detail into a geodesic function provided by the geosphere package. There are many functions to compute distance; we choose the Haversine method because we remain in Australia.

According to the Haversine method, the shortest distance between two points, i.e., the ‘great-circle-distance’ or ‘as the crow flies’, assumes no obstacles, ignoring ellipsoidal effects.

The Earth is not a perfect sphere. However, we are not calculating intercontinental distances. The calculation error is tolerable. Consult Appendix A for more explanations.

library(geosphere)

club1_coordinates <- data.frame(long1 = attendance3$long1, lat1 = attendance3$lat1)

club2_coordinates <- data.frame(long2 = attendance3$long2, lat2 = attendance3$lat2)

club_distance <- distHaversine(club1_coordinates, club2_coordinates) # in meters

head(club_distance)
## [1] 1372323.564    6616.874    2060.939   66973.298    5590.374  647481.790
attendance3 <- cbind(attendance3, distance = club_distance)

We now have a continuous measurement of distance, and not just a categorical variable.

Here is a comparative for 41 games.

bigrams <- paste0(attendance3$rec1, "-", attendance3$vis2)
attendance4 <- attendance3
attendance4$bigrams <- bigrams

pairs <- data.frame(bigrams = attendance4$bigrams, distance = attendance4$distance)

library(dplyr)

pairs <- pairs %>% arrange(desc(distance))

par(mai = (c(0.2,2,0.2,0.2)))
barplot(pairs$distance/1000, horiz = TRUE, names.arg = pairs$bigrams, las = 1, cex.names = 0.8, cex.axis = 0.8)

par(mai = (c(0.2,0.2,0.2,0.2)))

library(ggmap)

plot(get_map(location = c(lon = 133, lat = -27), zoom = 4, maptype = 'terrain'))

Model 3

We can implement a simple geoadditive model.

attendance_model3 <- lm(MCG ~ 0 + Members + distance, data = attendance3)

summary(attendance_model3)
## 
## Call:
## lm(formula = MCG ~ 0 + Members + distance, data = attendance3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.156  -4.121   0.643   7.480  23.453 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## Members   1.696e+00  7.933e-02  21.385  < 2e-16 ***
## distance -1.248e-05  2.699e-06  -4.625 4.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.74 on 39 degrees of freedom
## Multiple R-squared:  0.9255, Adjusted R-squared:  0.9217 
## F-statistic: 242.4 on 2 and 39 DF,  p-value: < 2.2e-16

The \(R^2\) are good and p-values are significant at the 1 % level. According to the model, greater distances have a negative effect on attendance.

par(mfrow = c(2,2))
plot(attendance_model3)

par(mfrow = c(1,1))

Plot the data with the regression lines.

plot(attendance3$MCG ~ attendance3$Members, xlab = 'Combined membership of clubs playing (000)', ylab = 'MCG Match Day Attendance (000)', col = 'blue3', xlim = c(0,60), ylim = c(0,100))

# 1
abline(attendance_model1, col = 'red2', lwd = 2, lty = 3)
# 2
abline(attendance_model2, col = 'darkgreen', lwd = 2, lty = 2)
# 3
abline(attendance_model3, col = 'magenta', lwd = 2, lty = 4)

legend(title = 'Model', 'topleft', inset = 0.05, c('1', '2', '3'), col = c('red3', 'darkgreen', 'magenta'), lty = c(3, 2, 4), lwd = c(2, 2, 2), horiz = FALSE)

We notice that as we go from the first model to the last model, the outliers (further right) weigh less and the regression line get steeper.

However, the continuous distance (Model 3) has the same effect as the categorical (binary) distance (Model 2). In this situation, the continuous distance brings no value. Maybe with more observations, a continuous distance can capture variability that a dummy variable cannot.

In both cases (Model 2 & 3), the distance dampens the Members variable (or extreme values).

We can compare the \(R^2\).

# Model 1
round(summary(attendance_model1)$r.squared, 4)
# Model 2
round(summary(attendance_model2)$r.squared, 4)
# Model 3
round(summary(attendance_model3)$r.squared, 4)
## [1] 0.8847
## [1] 0.9246
## [1] 0.9255

Going further

There are more advanced geoadditive models (geostatistical models or spatial econometrics).

These models include spatial autocorrelation: the fact that adjacent areas are correlated just like with autocorrelation in time series (a.k.a. autoregressive, distributed-lag, and ARIMA models).

Spatial Lag and Spatial Error models might do the trick.



The gravity models

We could use a gravity model of migration (since fans migrate or transit from city to city). Gravity models origin from international trade models and take this form.

\[T_{ij} = \alpha + \beta_1log(GDP_i) + \beta_2log(GDP_j) + \beta_3log\tau_{ij} + \varepsilon_{ij}\]

and

\[log\tau_{ij} = log(distance_{ij})\]

We can add explanatory variables.

\[T_{ij} = \alpha + \beta_1log(GDP_i) + \beta_2log(GDP_j) + \beta_3log\tau_{ij} +\beta_4(Adj_{ij}) + \beta_5(Bloc)) + ... + \varepsilon_{ij}\]

Where we have categorical variables for adjacency, member or not of a trade bloc, etc.

Consult Appendix B for more explanations.

Since we have fans (or tourists) and we want to assess attendance (MCG), we could replace GDP with population (larger cities are more likely to attract more tourists and have larger audience to events), use distance as well as Members and other additional regressors. Games happened is 1993, but the population is from a 2006 census. In addition, we apply the Greater Melbourne population to Melbourne and all suburbian clubs.

attendance_model4 <- lm(log(MCG) ~ log(`2006population1`) + log(`2006population2`) + log(distance) + Members, data = attendance3)

summary(attendance_model4)
## 
## Call:
## lm(formula = log(MCG) ~ log(`2006population1`) + log(`2006population2`) + 
##     log(distance) + Members, data = attendance3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5557 -0.2187  0.0000  0.2041  0.5223 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7.059642   1.640759   4.303 0.000124 ***
## log(`2006population1`) -0.083294   0.088695  -0.939 0.353932    
## log(`2006population2`) -0.114711   0.047126  -2.434 0.020017 *  
## log(distance)          -0.138236   0.020964  -6.594 1.13e-07 ***
## Members                 0.032344   0.005327   6.072 5.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2771 on 36 degrees of freedom
## Multiple R-squared:  0.6963, Adjusted R-squared:  0.6626 
## F-statistic: 20.64 on 4 and 36 DF,  p-value: 6.522e-09

The \(R^2\) are worse than the previous models. Moreover, population is clearly not significant. According to the model, greater distances have a negative effect on attendance (as we would expect since large distances have detrimental effects on trade).

We could run different variations of the model: substituting population with membership or another attendance drivers.



Conclusion

Model 2 (with a dummy for distance) and 3 (geoadditive) appear to be the best models for the situation. According to the law of parsimony (Occam’s razor), we should opt for Model 2 only. With more observations, Model 3 might outperform Model 2. As per Model 3 (gravity), we would need more observations and more explanatory variables to try out.

Appendix A – Calculating Geodesic Distances

The earth is a three-dimensional object. We cannot simply pretend that we are in flatland when we calculate the great-circle distance between two points.

The geodesic distance is the shortest distance over the earth’s surface – giving an ‘as-the-crow-flies’ distance between the points (ignoring any hills).

The geosphere package provides functions to compute geodesic distances.

Spherical Law of Cosines

Many of the formulas for calculating the distance between longitude/latitude points assume a spherical earth. It simplifies things greatly and for most purposes is quite an adequate approximation. The simple formula for calculating the great-circle distance is the Spherical Law of Cosines.

distCosine.

Haversine

However, at very small distances, the inverting of the cosine magnifies rounding errors.

An alternative formulation that is more robust at small distances is the Haversine formula.

distHaversine.

Vincenty inverse formula for ellipsoids

While both the Spherical Law of Cosines and the Haversine formula are simple they both assume a spherical earth.

Taking into account that Earth is not perfectly spherical makes things a bit messier. Vincenty inverse formula for ellipsoids is an iterative method used to calculate the ellipsoidal distance between two points on the surface of a spheroid.

distVincentyEllipsoid and distVincentySphere.

For more explanations, consult this website.

Appendix B – The Gravity Model

The gravity model of migration or trade is a model in urban geography derived from Newton’s law of gravity and used to predict the degree of interaction between two locations.

Newton’s law states that: “Any two bodies attract one another with a force that is proportional to the product of their masses and inversely proportional to the square of the distance between them.”

\[F = G~\frac{m_1 \times m_2}{r^2}\] Where \(F\), the force of attraction, is related to \(G\), a gravitational constant, \(m_1\) and \(m_2\), body 1 and body 2’s masses and \(r^2\), the square of the distance.

In migration, trade or tourism ‘bodies’ and ‘masses’ are replaced by ‘locations’ and ‘importance’. ‘Importance’ can be measured in terms of population numbers, gross domestic product (GDP), or other appropriate variables.

\[T_{ij} = C~\frac{GDP_i \times GDP_j}{Dist_{ij}^2}\]

Where \(T\), trade, is related, to \(C\), a constant, GDP and distance among the trading partners. A similar rationale applies to migration or tourism models; we simply adapt the explanatory variable.

  1. As the importance of one or both of the location increases, there is an increase in trade, migration or traffic.
  2. The farther apart the two locations are, the movement between them will be less (distance decay or the friction of distance).

Distance decay or the friction of distance, \(I = \frac{\alpha}{d^{2}}\), can be graphically represented by a curving line that swoops concavely downward over the x-axis (negative exponential).

We can observe distance decay in another model where it explains the density of pedestrian traffic, the height of buildings or the price of land.

The linear gravity model takes this form.

\[T_{ij} = \alpha + \beta_1log(GDP_i) + \beta_2log(GDP_j) + \beta_3log\tau_{ij} + \varepsilon_{ij}\]

and

\[log\tau_{ij} = log(distance_{ij})\] We can add explanatory variables.

\[T_{ij} = \alpha + \beta_1log(GDP_i) + \beta_2log(GDP_j) + \beta_3log\tau_{ij} +\beta_4(Adj_{ij}) + \beta_5(Bloc)) + ... + \varepsilon_{ij}\]

Where we have categorical variables for adjacency (yes or no) and member (or not) of a trade bloc.