Foreword

  • Output options: ‘pygments’ syntax, the ‘readable’ theme.
  • Snippets and results.
  • Source: ‘A Primer for Spatial Econometrics with Applications in R, Palgrave Macmillan, 2014’.


Cases from the book

The cases are taken from the book examples and exercises.

  • Case 1, Italy macroeconomics
    • 1a, The Barro and Sala-i-Martin model of regional convergence model
    • 1b, Okun’s Law for the 20 Italian regions
    • 1c, Phillips curve for the 20 Italian regions
  • Case 2: UK macroeconomics
  • Case 3: US and spatial analyses (used car price and taxes in 48 states)
  • Case 4: House price determinants in Boston
  • Case 5: The determinants of educational achievement in Georgia
  • Case 6: EU, the impact of education & Hi-tec exports
  • Case 7: The determinants of crime in Columbus, Ohio
  • Case 8: Luxury houses in Baltimore, Maryland
  • Case 9: Health in Mexico


Case 1: Italy macroeconomics

1a, The Barro and Sala-i-Martin model of regional convergence

We use a model…

\[log\frac{y_{it}}{y_{i0}} = \alpha + \beta_0y_{i0} + \varepsilon_i~~~~~~~~~~t = 1, 2, ..., T\] …where the growth rate of GDP per capita, per region, at a certain moment of time (expressed as the logarithm of the ratio) as a function of the GDP per capita at the beginning of the period.

\[log\frac{(GDP~per~Cap)_{it}}{(GDP~per~Cap)_{i0}} = \alpha + \beta_0 (GDP~per~Cap)_{i0} + \varepsilon_i\]

Where for year \(t\), for region \(i\), parameter \(b = -\tfrac{ln(1+\beta)}{T}\) represents the so-called ‘speed of convergence’.

  • If the slope of this linear model is negative, then the regions that are poorer at the beginning of the period experience higher growth rates and,
  • conversely, the regions with the higher GDP per capita at the beginning of the period experience lower growth rates.

This indicates convergence of the regions towards the same level of GDP per capita.

The dataset

We use macroeconomic indicators from 20 Italian regions.

head(barro, 3)
##          region per_capita_gdp growth_of_gdp agg_region
## 1      Piemonte            130           2.7  northwest
## 2 Valle d'Aosta            150           2.5  northwest
## 3     Lombardia            140           2.7  northwest

The classical linear regression model

From the above model, growth_of_gdp is \(log\frac{(GDP~per~Cap)_{it}}{(GDP~per~Cap)_{i0}}\) and per_capita_gdp is \((GDP~per~Cap)_{i0}\).

barro_ols <- lm(growth_of_gdp ~ per_capita_gdp, data = barro)

summary(barro_ols)
## 
## Call:
## lm(formula = growth_of_gdp ~ per_capita_gdp, data = barro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2104 -0.3973  0.0085  0.5880  0.8749 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.161369   0.731837   8.419 1.18e-07 ***
## per_capita_gdp -0.029510   0.005752  -5.130 7.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6425 on 18 degrees of freedom
## Multiple R-squared:  0.5938, Adjusted R-squared:  0.5713 
## F-statistic: 26.32 on 1 and 18 DF,  p-value: 7.011e-05

The coefficients and the F-statistic are highly significant. According to the \(R^2\), the model captures almost 60% of the growth variability.

In the top-right corner plot, the residuals appear normally distributed. In the other plots, the residuals appear evenly spread. We can see some outliers from the Cook’s distance.

We test for normality of the residuals.

library(moments)
jarque.test(barro_ols$residuals)

library(tseries)
jarque.bera.test(barro_ols$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  barro_ols$residuals
## JB = 1.0884, p-value = 0.5803
## alternative hypothesis: greater
## 
## 
##  Jarque Bera Test
## 
## data:  barro_ols$residuals
## X-squared = 1.0884, df = 2, p-value = 0.5803

Both tests accept the null hypothesis of normality.

We test for homoscedasticity.

library(lmtest)
bptest(barro_ols)
## 
##  studentized Breusch-Pagan test
## 
## data:  barro_ols
## BP = 0.0045472, df = 1, p-value = 0.9462

The test accepts the null hypothesis of homoscedasticity.

Since we have a simple linear model, multicollinearity is impossible.

However, the test leads to a correct decision if the residuals are also uncorrelated (autocorrelation).

We test for autocorrelation.

library(lmtest)
dwtest(barro_ols)

library(car)
durbinWatsonTest(barro_ols)
## 
##  Durbin-Watson test
## 
## data:  barro_ols
## DW = 1.1891, p-value = 0.01484
## alternative hypothesis: true autocorrelation is greater than 0
## 
##  lag Autocorrelation D-W Statistic p-value
##    1       0.3781296      1.189083   0.032
##  Alternative hypothesis: rho != 0

We accept the alternative hypothesis of autocorrelation (\(\rho \neq 0\) for sure); seemingly positive (\(\rho > 0\)).

Here is a scatter diagram of the data with the barro_ols model regression line.

First, for every 1 % increase in per_capita_gdp, the growth_of_gdp per capita changes by -0.02951. The richer you are, the slower growth of the GDP you must expect.

Second, the graph shows an evident North-South pattern (in other words, a spatial pattern).

It is obvious the linear regression model, despite an \(R^2\) close to 60 %, is plagued with some sort of autocorrelation of the residuals (the Durbin-Watson tests). Since it is not a time series – it’s a cross-sectional dataset –, the residuals cannot be correlated with their lagged values.

Further analysis is required.

Mapping regional Italy

A map is a powerful visualization tool for exploratory and explanatory analyses.

A map from Wikipedia ( it / en ).

We can download country map files from Global Administrative Areas. There are many available file formats1. For each country, there are many versions for different subdivisions (provinces, states, regions, counties, etc).

We preprocess the data.

# Upload the spatial vector object
library(rgdal)

ITA_reg1 <- readOGR("Italy_shapefiles/ITA_adm1.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "Italy_shapefiles/ITA_adm1.shp", layer: "ITA_adm1"
## with 20 features
## It has 12 fields
## Integer64 fields read as strings:  ID_0 ID_1 CCN_1
#ITA_adm1 <- readRDS("rds/ITA_adm1.rds")
head(ITA_reg1@data, 3)
##   ID_0 ISO NAME_0 ID_1     NAME_1 HASC_1 CCN_1 CCA_1  TYPE_1 ENGTYPE_1
## 0  112 ITA  Italy    1    Abruzzo   <NA>     0    13 Regione    Region
## 1  112 ITA  Italy    2     Apulia   <NA>     0    16 Regione    Region
## 2  112 ITA  Italy    3 Basilicata   <NA>     0    17 Regione    Region
##   NL_NAME_1                              VARNAME_1
## 0      <NA>      Abruzos|Abruzzen|Abruzzes|Abruzzi
## 1      <NA> Apulien|Pouilles|Pouille|Puglia|Puglie
## 2      <NA>                     Basilicate|Lucania
# Prepare the colours
# All regions in forestgreen
myColours <- rep("forestgreen", 20)
# Northern regions in red3
myColours[c(6,7,9,10,13,17,19,20)] <- "red3"
myColours <- data.frame(region = 1:20, myColours = myColours)

# Find the centroids
library(sp)

centroids <- coordinates(ITA_reg1)

Alternative 1

plot(ITA_reg1, lty = 1, col = as.character(myColours$myColours), border = 'darkgrey')
text(centroids, label = ITA_reg1$NAME_1, cex = 0.6, col ='black')

Alternative 2

library(maps)

map(SpatialPolygons2map(ITA_reg1, namefield = NULL), lty = 1, fill = TRUE, col = 'gray95')
# Adding colours to regions is complicated...
text(centroids, label = ITA_reg1$NAME_1, cex = 0.7, col ='brown')
# but we can add lat/long axes
map.axes(cex.axis = 0.8, las = 1)

Alternative 3

# Convert the spatial object to a data frame
library(broom)

ITA_adm1df <- tidy(ITA_adm1)

head(ITA_adm1df, 3)
##       long      lat order  hole piece group id
## 1 13.91542 42.89561     1 FALSE     1   1.1  1
## 2 13.91542 42.89542     2 FALSE     1   1.1  1
## 3 13.91569 42.89542     3 FALSE     1   1.1  1
# Merge with colours
ITA_adm1df <- merge(ITA_adm1df,
                    myColours,
                    by.x = 'id',
                    by.y = 'region')

# Merge again to recapture some variable (NAME_1)
ITA_adm1df <- merge(ITA_adm1df,
                    ITA_adm1@data,
                    by.x = 'id',
                    by.y = 'OBJECTID')

head(ITA_adm1df, 3)
##   id     long      lat order  hole piece group   myColours ID_0 ISO NAME_0
## 1  1 13.91542 42.89561     1 FALSE     1   1.1 forestgreen  112 ITA  Italy
## 2  1 13.91542 42.89542     2 FALSE     1   1.1 forestgreen  112 ITA  Italy
## 3  1 13.91569 42.89542     3 FALSE     1   1.1 forestgreen  112 ITA  Italy
##   ID_1  NAME_1 HASC_1 CCN_1 CCA_1  TYPE_1 ENGTYPE_1 NL_NAME_1
## 1    1 Abruzzo           NA    13 Regione    Region          
## 2    1 Abruzzo           NA    13 Regione    Region          
## 3    1 Abruzzo           NA    13 Regione    Region          
##                           VARNAME_1
## 1 Abruzos|Abruzzen|Abruzzes|Abruzzi
## 2 Abruzos|Abruzzen|Abruzzes|Abruzzi
## 3 Abruzos|Abruzzen|Abruzzes|Abruzzi
# Convert the centroids
centroidsdf <- data.frame(centroids)
names(centroidsdf) <- c('c_long', 'c_lat')
centroidsdf <- cbind(region = 1:20, name = ITA_adm1$NAME_1, centroidsdf)

head(centroidsdf, 3)
##   region       name   c_long    c_lat
## 0      1    Abruzzo 13.85514 42.22803
## 1      2     Apulia 16.62046 40.98463
## 2      3 Basilicata 16.08225 40.50067
# Map it
library(ggplot2)

ggplot(data = ITA_adm1df,
       aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = myColours)) +
  scale_fill_manual(name = 'Regions',
                    labels = c('Southern', 'Northern'),
                    values = c('forestgreen', 'red3')) +
  geom_path(color = 'gray') +
  geom_text(data = centroidsdf,
            aes(x = c_long, y = c_lat, label = ITA_adm1$NAME_1, group = 1), size = 4, colour = 'snow') +
  coord_map()

Alternative 3 produces the best map. The northern regions are in red. They consist in:

# North-West
as.character(subset(barro[, 1], barro$agg_region == 'northwest')) 
## [1] "Piemonte"      "Valle d'Aosta" "Lombardia"     "Liguria"
# North-East
as.character(subset(barro[, 1], barro$agg_region == 'northeast'))
## [1] "Trentino-Alto Adige"   "Veneto"                "Friuli-Venezia Giulia"
## [4] "Emilia-Romagna"

The 20 Italian regions.

as.data.frame(ITA_adm1@data$NAME_1)
##     ITA_adm1@data$NAME_1
## 1                Abruzzo
## 2                 Apulia
## 3             Basilicata
## 4               Calabria
## 5               Campania
## 6         Emilia-Romagna
## 7  Friuli-Venezia Giulia
## 8                  Lazio
## 9                Liguria
## 10             Lombardia
## 11                Marche
## 12                Molise
## 13              Piemonte
## 14              Sardegna
## 15                Sicily
## 16               Toscana
## 17   Trentino-Alto Adige
## 18                Umbria
## 19         Valle d'Aosta
## 20                Veneto

Consequence

The geographic visualization reiterates what the graph illustrated: the dots from the northern regions are all contiguous.

The plots and the maps help in visualizing a phenomenon called ‘spatial autocorrelation’.

The Durbin-Watson test gives a clue about autocorrelation of the residuals, but the test is designed to measure a lagged effect in time series.

The right procedure to test for ‘spatial autocorrelation’ is the ‘Moran’s I test’.

The Moran’s I test

Compute the Moran’s I test using the W matrix.

The .GAL file approach.

library(spdep)

# List the regions
ita_regions <- seq(1,20)

# Read the file mapping contiguity
nbItaly <- read.gal("GAL/Italy.GAL", region.id = ita_regions)
summary(nbItaly)
## Neighbour list object:
## Number of regions: 20 
## Number of nonzero links: 67 
## Percentage nonzero weights: 16.75 
## Average number of links: 3.35 
## Non-symmetric neighbours list
## Link number distribution:
## 
## 1 2 3 4 5 6 
## 3 3 3 8 1 2 
## 3 least connected regions:
## 2 6 20 with 1 link
## 2 most connected regions:
## 8 12 with 6 links
# Convert to a W matrix
wItaly <- nb2listw(nbItaly, style = "W")
summary(wItaly)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 20 
## Number of nonzero links: 67 
## Percentage nonzero weights: 16.75 
## Average number of links: 3.35 
## Non-symmetric neighbours list
## Link number distribution:
## 
## 1 2 3 4 5 6 
## 3 3 3 8 1 2 
## 3 least connected regions:
## 2 6 20 with 1 link
## 2 most connected regions:
## 8 12 with 6 links
## 
## Weights style: W 
## Weights constants summary:
##    n  nn S0     S1   S2
## W 20 400 20 14.375 83.9
# Run the test
lm.morantest(barro_ols, wItaly)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = growth_of_gdp ~ per_capita_gdp, data = barro)
## weights: wItaly
## 
## Moran I statistic standard deviate = 2.2323, p-value = 0.0128
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##       0.27115168      -0.09657234       0.02713493

The test accepts the alternative hypothesis of (positive) spatial autocorrelation (alternative hypothesis: greater).

This is problematic.

Conclusion

Because of spatial autocorrelation, the OLS model is not the proper approach for the situation.

In the next, sub-case, we go deeper and propose an alternative model to deal with spatial autocorrelation.

1b, Okun’s Law for the 20 Italian regions

The Okun’s Law is the inverse relationship between the variation of the real GDP and the variation of the unemployment rate:

\[\Delta Real~GDP \rightarrow \Delta Real~Unemployment~Rate\]

We present the OLS results before to visualize the relationship. Let us take it back from the beginning.

The dataset

The variations of the two variables were observed during the period of 1990-2010. Here are data for 3 of the 20 regions.

Region Variation of Unemployment Rate Variation of Real GDP
Piemonte 4.20 1.0
Valle d’Aosta 3.20 1.9
Lombardia 3.40 1.7
Trentino-Alto Adige 2.75 1.7
Veneto 3.30 1.8
Friuli-Venezia Giulia 3.40 1.9
Liguria 4.80 2.3
Emilia-Romagna 2.90 2.0
Toscana 4.30 1.1
Umbria 4.60 2.3
Marche 4.20 1.8
Lazio 6.40 2.0
Abruzzo 6.20 0.5
Molise 8.10 0.9
Campania 11.20 0.4
Apulia 11.20 1.8
Basilicata 9.60 1.4
Calabria 11.30 0.2
Sicily 13.00 0.1
Sardegna 9.90 0.7

The classical linear regression model

The graph above shows that the variation in the unemployment rate is systematically higher in the South (and vice-versa in the North).

As with the first model (barro_ols of regional convergence), such a geographical pattern might be a symptom of spatial autocorrelation.

The results of the OLS model are shown here.

summary(okun_ols)
## 
## Call:
## lm(formula = variation_of_unemployment_rate ~ variation_of_real_gdp, 
##     data = okun)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4449 -1.7419 -0.3307  1.4994  6.2162 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             10.971      1.283   8.551 9.38e-08 ***
## variation_of_real_gdp   -3.326      0.835  -3.984 0.000871 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.562 on 18 degrees of freedom
## Multiple R-squared:  0.4686, Adjusted R-squared:  0.4391 
## F-statistic: 15.87 on 1 and 18 DF,  p-value: 0.0008705

The coefficients and F-statistic are highly significant.

library(moments)
jarque.test(okun_ols$residuals)

library(lmtest)
bptest(okun_ols)
## 
##  Jarque-Bera Normality Test
## 
## data:  okun_ols$residuals
## JB = 1.2331, p-value = 0.5398
## alternative hypothesis: greater
## 
## 
##  studentized Breusch-Pagan test
## 
## data:  okun_ols
## BP = 0.022502, df = 1, p-value = 0.8808

The Jarque-Barra and Breusch-Pagan tests lead to the acceptance of, respectively, normality and homoscedasticity.

Since we have a simple linear model, multicollinearity is impossible.

We test for autocorrelation.

library(lmtest)
dwtest(okun_ols)

library(car)
durbinWatsonTest(okun_ols)
## 
##  Durbin-Watson test
## 
## data:  okun_ols
## DW = 1.1246, p-value = 0.01202
## alternative hypothesis: true autocorrelation is greater than 0
## 
##  lag Autocorrelation D-W Statistic p-value
##    1         0.38079      1.124628   0.024
##  Alternative hypothesis: rho != 0

We accept the alternative hypothesis of autocorrelation (\(\rho \neq 0\) for sure).

However, the detected autocorrelation is rather spatial. The Durbin-Watson capture some of the spatial autocorrelation because the test is derived from the Moran’s I test about spatial autocorrelation. Before going any further with this test and other spatial techniques, we must first compute the W matrix.

Computing the W matrix

The .GAL file approach.

library(spdep)

# List the regions
ita_regions <- seq(1,20)

# Read the file mapping contiguity
nbItaly1 <- read.gal("GAL/Italy.GAL", region.id = ita_regions)

# Convert to a W matrix
wItaly1 <- nb2listw(nbItaly1, style = "W")
wItaly1
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 20 
## Number of nonzero links: 67 
## Percentage nonzero weights: 16.75 
## Average number of links: 3.35 
## Non-symmetric neighbours list
## 
## Weights style: W 
## Weights constants summary:
##    n  nn S0     S1   S2
## W 20 400 20 14.375 83.9

The spatial vector file approach.

We can download country map files from Global Administrative Areas. We used SpatialPolygonDataFrame files.

library(spdep)
library(rgdal)

# Read the files, without the extension shp, shx and dbf
Italy_map <- readOGR("Italy_shapefiles/ITA_adm1.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "Italy_shapefiles/ITA_adm1.shp", layer: "ITA_adm1"
## with 20 features
## It has 12 fields
## Integer64 fields read as strings:  ID_0 ID_1 CCN_1
# List the variable in the dataset
names(Italy_map)
##  [1] "ID_0"      "ISO"       "NAME_0"    "ID_1"      "NAME_1"   
##  [6] "HASC_1"    "CCN_1"     "CCA_1"     "TYPE_1"    "ENGTYPE_1"
## [11] "NL_NAME_1" "VARNAME_1"
# Identify the centroids
coords <- coordinates(Italy_map)
head(coords, 3)
##       [,1]     [,2]
## 0 13.85514 42.22803
## 1 16.62046 40.98463
## 2 16.08225 40.50067
# Plot the map without and with the ID
par(mfrow = c(1,2))

plot(Italy_map)

plot(Italy_map)
text(coords, label = sapply(slot(Italy_map, "polygons"), function(i) slot(i, "ID")), cex = 0.9, col = 'red3')

par(mfrow = c(1,1))
# Compute the contiguity-based neighbours' list
# queen (all directions)
# rook (orthogonal)
nbItaly2 <- poly2nb(Italy_map, queen = TRUE) # FALSE means rook

# Force the command to tolerate the presence of isolated areas (islands)
# nbItaly2 <- poly2nb(Italy, queen = TRUE, zero.policy = TRUE)

# or...
# Compute the minimum threshold distance
nbItaly2 <- dnearneigh(coords, d1=0, d2=380000, longlat = FALSE)
# Create a W matrix
# with row-standardization of the W matrix
wItaly2 <- nb2listw(nbItaly, glist = NULL, style = "W")

Let us recall the okun_ols model.

okun_ols$call
## lm(formula = variation_of_unemployment_rate ~ variation_of_real_gdp, 
##     data = okun)

The Moran’s I test

For this case (1a, 1b, and 1c), we do not use wItaly2 obtained from the spatial vector file approach, but wItaly1 from the .GAL file approach.

lm.morantest(okun_ols, wItaly1)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = variation_of_unemployment_rate ~
## variation_of_real_gdp, data = okun)
## weights: wItaly1
## 
## Moran I statistic standard deviate = 2.8777, p-value = 0.002003
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##       0.40624326      -0.06861045       0.02722956

The test accepts the null hypothesis of spatial autocorrelation.

The Moran’s I test does not have an explicit alternative hypothesis (so does the Durbin-Watson test). The Moran’s I and D-W tests both suggest, respectively, positive autocorrelation and positive spatial autocorrelation (alternative hypothesis: greater).

Consequence

Autocorrelation inflates the t- and F-tests, leading us to accept tests that might otherwise be rejected.

All the other tests (normality, homoscedasticity) are also biased and conduct to misleading conclusions.

We must re-estimate the model since the OLS is not the proper approach.

The Spatial Lag Model (SLM)

The okun_ols neglects the spatial effects, but we observed a significant spatial autocorrelation.

Let us re-estimate the model as a Spatial Lag Model (SLM) and compute the impact measures.

  • The SLM focuses on the spatial interactions. We ‘know’ the structure of the spatial relationships.
  • \(\rho < 1\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

The OLS model and…

## lm(formula = variation_of_unemployment_rate ~ variation_of_real_gdp, 
##     data = okun)
okun_ols$coefficients
##           (Intercept) variation_of_real_gdp 
##             10.971282             -3.326387

…the SLM.

## lagsarlm(formula = variation_of_unemployment_rate ~ variation_of_real_gdp, 
##     data = okun, listw = wItaly1, type = "lag")
okun_lag_ml$coefficients
##           (Intercept) variation_of_real_gdp 
##              3.181171             -1.154342

In the OLS framework, a 1-unit increase in variation_of_real_gdp in one region change variation_of_unemployment_rate by -3.326387.

Let us print all the results.

summary(okun_lag_ml)
## 
## Call:
## lagsarlm(formula = variation_of_unemployment_rate ~ variation_of_real_gdp, 
##     data = okun, listw = wItaly1, type = "lag")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.18381 -0.79765 -0.26961  0.28063  2.90078 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)            3.18117    1.05224  3.0232 0.002501
## variation_of_real_gdp -1.15434    0.42195 -2.7357 0.006224
## 
## Rho: 0.74698, LR test value: 23.274, p-value: 1.4052e-06
## Asymptotic standard error: 0.10194
##     z-value: 7.3278, p-value: 2.3404e-13
## Wald statistic: 53.697, p-value: 2.3392e-13
## 
## Log likelihood: -34.50671 for lag model
## ML residual variance (sigma squared): 1.4351, (sigma: 1.198)
## Number of observations: 20 
## Number of parameters estimated: 4 
## AIC: 77.013, (AIC for lm: 98.287)
## LM test for residual autocorrelation
## test value: 2.8941, p-value: 0.088904

The \(\rho\) parameter is significant equals 0.7469753. There is a spatial effect (or an influence from the neighbours).

Let us also print additional results: the ‘impacts’.

okun_lag_ml_impact
## Impact measures (lag, exact):
##                          Direct  Indirect     Total
## variation_of_real_gdp -1.563516 -2.998656 -4.562172

Within the SLM framework, we rather have a direct effect of -1.5635161; which is less than in the OLS model.

On the other hand, given the presence of indirect effects produced by the spillover effects from the neighbouring regions, the total effect is, in average, -4.5621717; which is more than the OLS model.

The spatial model incorporates a mechanism of geographical transmission.

In the Spatial lag model, the spillover effect is incorporated in the slope. A variation (going from 1 to 2 units or vice-versa) of the real GDP has a different effect with the SLM.

We can notice the additional spillover effect: the cluster of red squares (the North) pulls in the line. Therefore, the \(x\) (variation_of_real_gdp) is not the only explanatory driver of \(y\) (variation_of_unemployment_rate); the model accounts from the spatial effect.

1c, Phillips curve for the 20 Italian regions

The Phillips curve is an inverse relationship between the rate of unemployment and the rate of inflation. It states that lower unemployment is associated with a higher rate of inflation:

\[\%\Delta Price~index \rightarrow \%\Delta Unemployment~rate\]

We present the OLS results before to visualize the relationship. Let’s take it back from the beginning.

The dataset

Region % variation unemployment rate % variation price index
Piemonte 4.20 2.1
Valle d’Aosta 3.20 1.4
Lombardia 3.40 1.7
Trentino-Alto Adige 2.75 1.8
Veneto 3.30 1.5
Friuli-Venezia Giulia 3.40 1.8
Liguria 4.80 1.7
Emilia-Romagna 2.90 1.9
Toscana 4.30 1.6
Umbria 4.60 1.7
Marche 4.20 1.6
Lazio 6.40 2.0
Abruzzo 6.20 1.6
Molise 8.10 1.9
Campania 11.20 1.8
Apulia 11.20 2.3
Basilicata 9.60 2.0
Calabria 11.30 2.4
Sicily 13.00 2.4
Sardegna 9.90 1.9

The classical linear regression model

The graph above shows that the variation in the unemployment rate is systematically higher in the South and lower in the North. In the South, residuals are more positive and counter-balance the negative residuals from the North, making the residuals looking normally distributed.

This is a visual indication of residual spatial autocorrelation.

The results of the OLS model are shown here.

summary(phillips_ols)
## 
## Call:
## lm(formula = p_variation_unemployment_rate ~ p_variation_price_index, 
##     data = phillips)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3404 -1.3598  0.0828  1.4406  5.2836 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -9.827      3.720  -2.642 0.016568 *  
## p_variation_price_index    8.746      1.984   4.409 0.000338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.437 on 18 degrees of freedom
## Multiple R-squared:  0.5193, Adjusted R-squared:  0.4926 
## F-statistic: 19.44 on 1 and 18 DF,  p-value: 0.0003383

The test results.

library(moments)
jarque.test(okun_ols$residuals)

library(lmtest)
bptest(phillips_ols)
## 
##  Jarque-Bera Normality Test
## 
## data:  okun_ols$residuals
## JB = 1.2331, p-value = 0.5398
## alternative hypothesis: greater
## 
## 
##  studentized Breusch-Pagan test
## 
## data:  phillips_ols
## BP = 0.25565, df = 1, p-value = 0.6131

In summary: high significance, normality, and homoscedasticity.

library(lmtest)
dwtest(phillips_ols)

library(car)
durbinWatsonTest(phillips_ols)
## 
##  Durbin-Watson test
## 
## data:  phillips_ols
## DW = 1.337, p-value = 0.04667
## alternative hypothesis: true autocorrelation is greater than 0
## 
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1981906      1.336989   0.096
##  Alternative hypothesis: rho != 0

We accept the alternative hypothesis of autocorrelation (\(\rho \neq 0\) for sure).

The Moran’s I test

We compute the Moran’s I test using the same W matrix we calculated above.

The .GAL file approach.

lm.morantest(phillips_ols, wItaly1)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = p_variation_unemployment_rate ~
## p_variation_price_index, data = phillips)
## weights: wItaly1
## 
## Moran I statistic standard deviate = 2.3245, p-value = 0.01005
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##       0.31150460      -0.07022407       0.02696905

The test accepts the alternative hypothesis of spatial (positive) autocorrelation.

We must re-estimate the model since the OLS is not the proper approach.

The ‘pure’ Spatial Autoregression (SAR)

As an alternative model, we wish to estimate a purely spatial autoregressive model with a constant term.

We want to test if the p_variation_price_index in one region affects the variations in the neighboring regions through a mechanism of inflation contagion.

  • The ‘pure’ SAR simply captures the neighbouring effects without any explanatory variable(s).
  • When \(\beta\) = 0 and either \(\lambda = 0\) or \(\rho = 0\).
  • \(\rho\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

The results of the estimation of the spatial autoregressive model using maximum likelihood.

phillips_ar_ml <- spautolm(p_variation_price_index ~ 1, data = phillips, listw = wItaly1)

summary(phillips_ar_ml)
## 
## Call: spautolm(formula = p_variation_price_index ~ 1, data = phillips, 
##     listw = wItaly1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.513956 -0.166279 -0.037528  0.116756  0.472830 
## 
## Coefficients: 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1.847127   0.079564  23.216 < 2.2e-16
## 
## Lambda: 0.26428 LR test value: 1.4604 p-value: 0.22686 
## Numerical Hessian standard error of lambda: 0.20979 
## 
## Log likelihood: -1.809013 
## ML residual variance (sigma squared): 0.068531, (sigma: 0.26178)
## Number of observations: 20 
## Number of parameters estimated: 3 
## AIC: 9.618

We get an insignificant \(lambda\); only the constant term is significantly different from zero.

The logLik is close to 0 (a good sign) and the AIC is very low (another good sign).

Consequence

The constant term is significantly different from zero. Thus, we conclude there is no significant geographical transmission of the price variations at the regional level. In other words, we could trust the OLS model.

The spatial lag and the spatial error tests

Let us test the null hypothesis of ‘no spatial autocorrelation’ using the spatial lag and spatial error as explicit alternatives.

\(H_0\): ‘no residuals spatial dependence’ (or ‘residuals spatial independence’).

Let us recall the linear OLS model.

phillips_ols
## 
## Call:
## lm(formula = p_variation_unemployment_rate ~ p_variation_price_index, 
##     data = phillips)
## 
## Coefficients:
##             (Intercept)  p_variation_price_index  
##                  -9.827                    8.746

We then use all tests: LMerr, LMlag, RLMerr, RLMlag. The capital ‘R’ stands for ‘robust’; the robust/resistant2 version of the models.

The results.

Statistic Par. p-value
LMerr 2.700108 1 0.1003415
LMlag 12.43271 1 0.0004219
RLMerr 1.700998 1 0.1921575
RLMlag 11.4336 1 0.0007213

The LM error models (LMerr, RLMerr) accept the null hypothesis of ‘no residuals spatial dependence’.

However, the LM lag models (LMlag, RMLlag) reject the null hypothesis of ‘no residuals spatial dependence’.

‘No residuals spatial dependence’ is synonymous with ‘no spatial autocorrelation’. Two tests confirm the Moran’s I test while two tests contradict it.

Consequence

  • The Moran’s I test detects spatial autocorrelation.
  • The SAR has no significant geographical transmission of the price variations at the regional level.
  • The test above leads to no spatial dependence in the SEM where the error term accounts for unobservable features or omitted variables associated with spatial autocorrelation.
  • The same test also leads to spatial dependence in the SLM where the spatial coefficient incorporates spatial effects as an additional predictor.

We can either trust the OLS model or try the SLM.

Case 2: UK macroeconomics

The dataset

The dataset displays the regional Gross Value Added (GVA) as a percentage of the country labor productivity (reported to a country total of 100) and the business birth rate.

Region % GVA of GDP % Labour Productivity % Business Birth Rate
Wales 3.6 81.5 9.3
Scotland 8.3 96.9 10.9
Northern Ireland 2.3 82.9 6.5
North of England 3.2 86.2 11.2
North West England 9.5 88.6 11.1
Yorkshire & Humberside 6.9 84.7 10.5
East Midlands 6.2 89.2 10.3
West Midlands 7.3 89.1 10.5
East Anglia 8.7 96.8 10.5
Greater London 21.6 139.7 14.6
South East England 14.7 108.3 10.8
South West England 7.7 89.8 9.6

The classical linear regression model

We estimate the gva_p_of_uk as a function of both labor_productivity_p and business_birth_rate_p. We compute four variations.

uk_model1 <- lm(gva_p_of_uk ~ labor_productivity_p + business_birth_rate_p, data = uk)

uk_model2 <- lm(gva_p_of_uk ~ labor_productivity_p, data = uk)

uk_model3 <- lm(gva_p_of_uk ~ business_birth_rate_p, data = uk)

uk_model4 <- lm(labor_productivity_p ~ business_birth_rate_p, data = uk)

The resulting F-statistics and adjusted \(R^2\).

Model 1 Model 2 Model 3 Model 4
value 43.99223 88.97029 16.97958 15.17212
numdf 2.00000 1.00000 1.00000 1.00000
dendf 9.00000 10.00000 10.00000 10.00000
adj R sq. 0.88658 0.88886 0.59228 0.56301

The best model is Model 2 which attains the highest significance level of the F-statistic and the highest adjusted \(R^2\).

##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)          -21.3886580 3.19236731 -6.699936 5.364996e-05
## labor_productivity_p   0.3146017 0.03335328  9.432406 2.708414e-06

The coefficients are all highly significant.

Model 1 is the second best.

##                          Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)           -22.3111831  3.3859356 -6.5893702 0.0001004979
## labor_productivity_p    0.2774967  0.0534564  5.1910845 0.0005708193
## business_birth_rate_p   0.4223862  0.4724337  0.8940646 0.3945672217

One of the coefficient, business_birth_rate_p, is non-significant.

Residuals from Model 1.

Residuals
Wales -0.6329899
Scotland -0.8822571
Northern Ireland -1.1388039
North of England -3.1397582
North West England 2.5364883
Yorkshire & Humberside 1.2721572
East Midlands -0.5921008
West Midlands 0.4511717
East Anglia -0.2855529
Greater London -1.0219450
South East England 2.3965191
South West England 1.0370716

We test normality and homoscedasticity of Model 1’s residuals.

# normality
library(moments)
jarque.test(uk_model1$residuals)

library(tseries)
jarque.bera.test(uk_model1$residuals)

# homoscedasticity
library(lmtest)
bptest(uk_model1)
## 
##  Jarque-Bera Normality Test
## 
## data:  uk_model1$residuals
## JB = 0.091982, p-value = 0.9551
## alternative hypothesis: greater
## 
## 
##  Jarque Bera Test
## 
## data:  uk_model1$residuals
## X-squared = 0.091982, df = 2, p-value = 0.9551
## 
## 
##  studentized Breusch-Pagan test
## 
## data:  uk_model1
## BP = 1.5183, df = 2, p-value = 0.4681

We accept the null hypotheses of normality and homoscedasticity.

Let us explore the residuals. We sort them first.

Residuals
Wales 2.5364883
Scotland 2.3965191
Northern Ireland 1.2721572
North of England 1.0370716
North West England 0.4511717
Yorkshire & Humberside -0.2855529
East Midlands -0.5921008
West Midlands -0.6329899
East Anglia -0.8822571
Greater London -1.0219450
South East England -1.1388039
South West England -3.1397582
dotchart(model1_residuals, las = 1, cex = 0.9)
abline(v = 0, lty = 3, col = 'darkgray')


Consequence

The uk_model1:

  • overestimate the GVA (negative residuals) in the Eastern regions (East of England, East Midlands and Greater London) and in the North (North of England, North West of England, Scotland and Northern Ireland) and,
  • underestimate it is the center.

Thus there are distinct geographical patterns in the residuals.

Computing the W matrix

We create the .GAL file based on the following numbered list.

as.data.frame(uk[, 3])
##                   uk[, 3]
## 1                   Wales
## 2                Scotland
## 3        Northern Ireland
## 4        North of England
## 5      North West England
## 6  Yorkshire & Humberside
## 7           East Midlands
## 8           West Midlands
## 9             East Anglia
## 10         Greater London
## 11     South East England
## 12     South West England

We consider Northern Ireland is adjacent to Scotland and Wales is adjacent to Northern Ireland. From that assumption, we derive the W matrix.

library(spdep)

# List the regions
uk_regions <- seq(1,12)

# Read the file mapping contiguity
nbUK <- read.gal("GAL/UK.GAL", region.id = uk_regions)

# Convert to a W matrix
# with the row-standardized option
wUK <- nb2listw(nbUK, style = "W")

Spatially lagged values

The W matrix transforms the variables in the model. How? By creating a spatial lagged version.

Compute the ‘spatially lagged values’ of the variable gva_p_of_uk for the 12 regions of the UK after this formula.

\[L(y) = W*y\]

Where \(W\) is the W matrix and \(y\) is the vector or variable.

An overview of the row-standardized W matrix in algebraic format.

dim(uk_matrix)
## [1] 12 12
uk_matrix[1:5, 1:5]
##      Scotland Northern.Ireland Wales North.of.England North.West.England
## [1,]        0              0.5   0.0              0.5          0.0000000
## [2,]        1              0.0   0.0              0.0          0.0000000
## [3,]        0              0.0   0.0              0.0          0.3333333
## [4,]        0              0.0   0.0              0.0          0.5000000
## [5,]        0              0.0   0.2              0.2          0.0000000

We compute the product (%*%).

Llabor_productivity_p <- uk_matrix %*% uk$labor_productivity_p

Lgva_p_of_uk <- uk_matrix %*% uk$gva_p_of_uk

The results. L stands for lagged or \(L(y)\).

# y and L(y) of labor productivity
data.frame(Region = uk[, 3], Labor = uk$labor_productivity_p, L_Labour = Llabor_productivity_p)
##                    Region Labor  L_Labour
## 1                   Wales  81.5  91.55000
## 2                Scotland  96.9  81.50000
## 3        Northern Ireland  82.9 105.83333
## 4        North of England  86.2  86.65000
## 5      North West England  88.6  86.42000
## 6  Yorkshire & Humberside  84.7  87.96667
## 7           East Midlands  89.2 101.72000
## 8           West Midlands  89.1  93.52000
## 9             East Anglia  96.8  98.70000
## 10         Greater London 139.7  98.75000
## 11     South East England 108.3 100.92000
## 12     South West England  89.8 108.30000
# y and L(y) of labor productivity
data.frame(Region = uk[, 3], GVA = uk$gva_p_of_uk, L_GVA = Lgva_p_of_uk)
##                    Region  GVA     L_GVA
## 1                   Wales  3.6  5.750000
## 2                Scotland  8.3  3.600000
## 3        Northern Ireland  2.3 12.433333
## 4        North of England  3.2  8.200000
## 5      North West England  9.5  5.180000
## 6  Yorkshire & Humberside  6.9  6.666667
## 7           East Midlands  6.2 11.080000
## 8           West Midlands  7.3  9.200000
## 9             East Anglia  8.7 11.000000
## 10         Greater London 21.6 10.450000
## 11     South East England 14.7 10.300000
## 12     South West England  7.7 14.700000

The Moran scatterplot

Draw a scatter diagram with the variable gva_p_of_uk on the x-axis and the lagged variable gva_p_of_uk or Lgva_p_of_uk on the y-axis.

There is a positive relationship between gva_p_of_uk and its spatially lagged value: this is the demonstration of (positive) spatial autocorrelation.

The scatterplot is done with the plot function. However, the spdep package has its own moran.plot.

moran.plot(uk$gva_p_of_uk, wUK, labels = uk$row_names, col = 'blue3', xlab = 'GVA', ylab = 'L(GVA)', las = 1 )

Test spatial autocorrelation: the Moran’s I test

We recall Model 1.

## lm(formula = gva_p_of_uk ~ labor_productivity_p + business_birth_rate_p, 
##     data = uk)

We test spatial autocorrelation among the residuals.

lm.morantest(uk_model1, wUK)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = gva_p_of_uk ~ labor_productivity_p +
## business_birth_rate_p, data = uk)
## weights: wUK
## 
## Moran I statistic standard deviate = -0.054346, p-value = 0.5217
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##       -0.1697141       -0.1611920        0.0245899

The test accepts the null hypothesis of no autocorrelation. The result does not support the Moran scatterplot. There might be some spatial autocorrelation, but the test does not support the visualization. It might have to do with the W matrix. We use a .GAL file. A map might yield other results.

Consequence

In light of these results, it may be safe to find an alternative to the OLS model and check if this model performs better.

Case 3: US and spatial analyses (used car price and taxes in 48 states)

The dataset

The dataset is part of the spdep package. It is related to the price of used cars by state in the 1960 and the tax and delivery charges for new cars by state during the period fo 1955-59.

head(used.cars, 3)
##    tax.charges price.1960
## AL         129       1461
## AZ         218       1601
## AR         176       1469

The OLS model

We estimate the model and run two tests. tax.charges is the explanatory variable.

used_ols <- lm(price.1960 ~ tax.charges, data = used.cars)

summary(used_ols)
## 
## Call:
## lm(formula = price.1960 ~ tax.charges, data = used.cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -116.701  -45.053   -1.461   43.400  107.807 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1435.7506    27.5796  52.058  < 2e-16 ***
## tax.charges    0.6872     0.1754   3.918 0.000294 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.01 on 46 degrees of freedom
## Multiple R-squared:  0.2503, Adjusted R-squared:  0.234 
## F-statistic: 15.35 on 1 and 46 DF,  p-value: 0.0002939

The F- and t-tests are all significant. The \(R^2\) are low.

We run some tests.

# Normality
library(moments)
jarque.test(used_ols$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  used_ols$residuals
## JB = 1.8906, p-value = 0.3886
## alternative hypothesis: greater
# Homoscedasticity
library(lmtest)
bptest(used_ols)
## 
##  studentized Breusch-Pagan test
## 
## data:  used_ols
## BP = 0.0013219, df = 1, p-value = 0.971

The tests accept the null hypotheses of normality and homoscedasticity.

The Moran’s I test – Part 1

To perform the test, we need to step back. First, we need to compute the W matrix. There are many ways to do so.

We can compute a W matrix from a map.

A first map

From the US Census Bureau website, we can download spatial vector files (shapefiles).

We download a spatial vector file, unzip it, and read it.

library(rgdal)

# Read the files, without the extension shp, shx and dbf
US_State <- readOGR("US_State_shapefiles/cb_2016_us_state_500k.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "US_State_shapefiles/cb_2016_us_state_500k.shp", layer: "cb_2016_us_state_500k"
## with 56 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER

We draw a basic map of the continental states.

# basic plot
plot(US_State, lty = 1, col = 'lightgrey', border = 'snow', xlim = c(-125, -65), ylim = c(25, 50))

Computing the W matrix from the map

library(spdep)

# Compute minimum threshold distance
coords <- coordinates(US_State)
k1 <- knn2nb(knearneigh(coords))
all.linked <- max(unlist(nbdists(k1, coords)))
countnb <- dnearneigh(coords, d1=0, d2=all.linked, longlat = FALSE)

# Create a W matrix
# row-standardized
wUS_State <- nb2listw(countnb, glist = NULL, style = "W")
wUS_State
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 56 
## Number of nonzero links: 2284 
## Percentage nonzero weights: 72.83163 
## Average number of links: 40.78571 
## 
## Weights style: W 
## Weights constants summary:
##    n   nn S0       S1       S2
## W 56 3136 56 11.37061 226.2497
# Verify the 48 regions are accounted for
length(wUS_State$weights)
## [1] 56

The matrix accounts for 56 territories (US states, the District of Columbia, and remote dependencies).

Computing the W matrix from another map

We now download a spatial vector file (shapefiles) from the National Cancer Institute.

library(rgdal)

US_State <- readOGR('US_State_shapefiles/US_State.shp')
## OGR data source with driver: ESRI Shapefile 
## Source: "US_State_shapefiles/US_State.shp", layer: "US_State"
## with 51 features
## It has 11 fields

We draw another simple map of the continental states (this map has a different projection).

plot(US_State, lty = 1, col = 'lightgrey', border = 'snow')

We compute the W matrix from the map.

library(spdep)

# Compute minimum threshold distance
coords <- coordinates(US_State)
k1 <- knn2nb(knearneigh(coords))
all.linked <- max(unlist(nbdists(k1, coords)))
countnb <- dnearneigh(coords, d1=0, d2=all.linked, longlat = FALSE)

# Create a W matrix
# row-standardized
wUS_State <- nb2listw(countnb, glist = NULL, style = "W")
wUS_State
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 51 
## Number of nonzero links: 1134 
## Percentage nonzero weights: 43.59862 
## Average number of links: 22.23529 
## 
## Weights style: W 
## Weights constants summary:
##    n   nn S0       S1       S2
## W 51 2601 51 6.519635 208.3489
# Verify the 48 regions are accounted for
length(wUS_State$weights)
## [1] 51

The matrix accounts for 51 territories (US states and the District of Columbia).

We can use these matrices to compute the Moran’s I test. Ideally, we should reduce the number of territories to 48, removing Alaska, Hawaii, and the District of Columbia.

Centroids

We match the 48 states in the usa48.nb data (included with the spdep package, in the used.cars dataset).

library(spdep)

data(used.cars)
summary(usa48.nb)
## Neighbour list object:
## Number of regions: 48 
## Number of nonzero links: 214 
## Percentage nonzero weights: 9.288194 
## Average number of links: 4.458333 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8 
##  1  4  9 11 10  9  2  2 
## 1 least connected region:
## ME with 1 link
## 2 most connected regions:
## MO TN with 8 links
cont_st <- match(attr(usa48.nb, "region.id"), state.abb)
cents <- as.matrix(as.data.frame(state.center))[cont_st,]

We plot a network 48 states. The network is useful for visualizing how a W matrix is computed; whether we use contiguity or distance.

plot(usa48.nb, cents, col = 'darkgray')

We can also map out a regional network.

IDs <- as.character(state.division[cont_st])
agg_cents <- aggregate(cents, list(IDs), mean)

agg_nb <- aggregate(usa48.nb, IDs)

plot(agg_nb, agg_cents[, 2:3], xlim = c(-125, -65), ylim = c(25, 50), col = 'darkgray')
text(agg_cents[, 2:3], agg_cents[, 1], cex = 0.8, col ='red3')

The Moran’s I test – Part 2

The spdep package already comes with data, usa48.np (included with the spdep package, in the used.cars dataset).

library(spdep)

data(used.cars)
summary(usa48.nb)
## Neighbour list object:
## Number of regions: 48 
## Number of nonzero links: 214 
## Percentage nonzero weights: 9.288194 
## Average number of links: 4.458333 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8 
##  1  4  9 11 10  9  2  2 
## 1 least connected region:
## ME with 1 link
## 2 most connected regions:
## MO TN with 8 links

We compute the W matrix and run two Moran’s I tests on the used.car dataset.

# Compute the W matrix
W <- nb2listw(usa48.nb)

# Compute the test on the linear model (lm) results
lm.morantest(used_ols, W)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = price.1960 ~ tax.charges, data = used.cars)
## weights: W
## 
## Moran I statistic standard deviate = 6.3869, p-value = 8.466e-11
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.574817771     -0.030300549      0.008976437
# Alternative, compute the test directly on the data
moran.test(used.cars$price.1960, W)
## 
##  Moran I test under randomisation
## 
## data:  used.cars$price.1960  
## weights: W  
## 
## Moran I statistic standard deviate = 8.1752, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.783561543      -0.021276596       0.009692214

Both tests conclude to spatial (positive) autocorrelation (alternative hypothesis: greater).

Residuals spatial dependence

This is an alternative test.

# With the W matrix
lm.LMtests(used_ols, W)
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = price.1960 ~ tax.charges, data = used.cars)
## weights: W
## 
## LMErr = 31.793, df = 1, p-value = 1.715e-08

The test accepts the alternative hypothesis of ‘residuals spatial dependence’, a synonym for spatial autocorrelation, and confirms the Moran’s I tests.

The Moran scatterplot

# With the W matrix
moran.plot(used.cars$price.1960, W, labels = rownames(used.cars), col = 'blue3', xlab = '1960 price', ylab = 'L(1960 price)', las = 1 )

There is a positive relationship between price.1960 and its spatially lagged value: this is the demonstration of positive spatial autocorrelation.

Because the model is plagued with spatial autocorrelation, the estimates for the coefficients, the F-statistic as well as the normality and homoscedastic tests are all unreliable.

We must find an alternative model.

The Spatial Error Model (SEM)

  • The model corrects for spatial autocorrelation due to the use of spatial data. We do not know the structure of the spatial relationships. The error term accounts for unobservable features or omitted variables.
  • \(\rho = 0\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda \neq 0\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.
  • We estimate the model with a ML.
used_se_ml <- errorsarlm(price.1960 ~ tax.charges, data = used.cars, listw = W)

summary(used_se_ml)
## 
## Call:errorsarlm(formula = price.1960 ~ tax.charges, data = used.cars, 
##     listw = W)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.824 -17.459   2.406  21.278  64.597 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5283e+03 3.1963e+01 47.8167   <2e-16
## tax.charges 8.8309e-02 1.1923e-01  0.7406   0.4589
## 
## Lambda: 0.819, LR test value: 40.899, p-value: 1.603e-10
## Asymptotic standard error: 0.074051
##     z-value: 11.06, p-value: < 2.22e-16
## Wald statistic: 122.32, p-value: < 2.22e-16
## 
## Log likelihood: -240.7163 for error model
## ML residual variance (sigma squared): 1043.9, (sigma: 32.309)
## Number of observations: 48 
## Number of parameters estimated: 4 
## AIC: 489.43, (AIC for lm: 528.33)

The slope coefficient or the explanatory variable tax.charges is not significant, but Walt statistics is highly significant. price.1960 variability cannot be explained by tax.charges.

On the other hand, the \(\lambda\) parameter is significant.

There exists some spatial dependence among the residuals that contributes to most of the model variability. This spatial dependence is captured by the spatial error coefficient.

The SEM with the Feasible Generalized Least Squares (FGLS)

The FGLS is an alternative to the ML.

used_gse_gls <- GMerrorsar(price.1960 ~ tax.charges, data=used.cars,
listw = W)

summary(used_gse_gls)
## 
## Call:GMerrorsar(formula = price.1960 ~ tax.charges, data = used.cars, 
##     listw = W)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -120.6190  -57.9041   -1.5585   53.3029  116.1085 
## 
## Type: GM SAR estimator
## Coefficients: (GM standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1512.98359   28.69940 52.7183   <2e-16
## tax.charges    0.17802    0.15018  1.1854   0.2359
## 
## Lambda: 0.65398 (standard error): 0.21084 (z-value): 3.1017
## Residual variance (sigma squared): 1672.5, (sigma: 40.896)
## GM argmin sigma squared: 1692.6
## Number of observations: 48 
## Number of parameters estimated: 4

The FGLS estimates confirm previous results with the ML. tax.charges is not significant.

The SEM reduced to a SAR model

  • The ‘pure’ SAR simply captures the neighbouring effects without any explanatory variable(s).
  • \(\rho\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

When \(\beta = 0\) and either \(\lambda = 0\) or \(\rho = 0\).

used_se_ml1 <- spautolm(price.1960 ~ 1, data = used.cars, listw = W)

summary(used_se_ml1)
## 
## Call: spautolm(formula = price.1960 ~ 1, data = used.cars, listw = W)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.152 -18.221   5.249  21.631  63.498 
## 
## Coefficients: 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1542.607     27.321  56.462 < 2.2e-16
## 
## Lambda: 0.82919 LR test value: 54.202 p-value: 1.8086e-13 
## Numerical Hessian standard error of lambda: 0.065233 
## 
## Log likelihood: -240.977 
## ML residual variance (sigma squared): 1045.4, (sigma: 32.333)
## Number of observations: 48 
## Number of parameters estimated: 3 
## AIC: 487.95

All the coefficients and statistics become highly significant.

By reducing the model to a SAR, the price in one state is simply explained by the price in the neighbouring states.

Visualizing on a map

We can depict this conjecture on a continental map of the US. We can see ‘groupings’ and localized price patterns.

library(rgdal)

# Read the files, without the extension shp, shx and dbf
US_State <- readOGR("US_State_shapefiles/cb_2016_us_state_500k.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "US_State_shapefiles/cb_2016_us_state_500k.shp", layer: "cb_2016_us_state_500k"
## with 56 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
# The map data
US_State$STUSPS <- as.character(US_State$STUSPS)

# An alternative data frame for used.cars
used.cars2 <- read.csv('CSV/used.cars2.csv')
used.cars2$state <- as.character(used.cars2$state)
used.cars2$lstate <- as.character(used.cars2$lstate)

# Bind the two
US_State_used.cars <- cbind(data.frame(US_State), used.cars2)

# Set up five categories and assign colors
breaks <-round(quantile((US_State_used.cars$price.1960), seq(0, 1, 1/5), na.rm = TRUE), 1)

library(RColorBrewer)

cols <- brewer.pal(length(breaks), "Blues")

# Draw the map
plot(US_State, col = cols[findInterval(US_State_used.cars$price.1960, breaks, all.inside = TRUE)], xlim = c(-123, -68), ylim = c(26, 50))

# Use a command fromt the maptools package
legend('bottomright', inset = 0, legend = maptools::leglabs(breaks), fill = cols, bty = 'n', cex = 0.9)

Case 4: House price determinants in Boston

The dataset

The dataset is part of the MASS package, but we use a ‘corrected’ version from the spdep package.

The data refer to the median house price observed in 506 Boston area census tracts together with a series of variables that can be thought of as being potential determinants of the house value.

##         TOWN TOWNNO TRACT     LON     LAT MEDV CMEDV    CRIM ZN INDUS CHAS
## 1     Nahant      0  2011 -70.955 42.2550 24.0  24.0 0.00632 18  2.31    0
## 2 Swampscott      1  2021 -70.950 42.2875 21.6  21.6 0.02731  0  7.07    0
## 3 Swampscott      1  2022 -70.936 42.2830 34.7  34.7 0.02729  0  7.07    0
##     NOX    RM  AGE    DIS RAD TAX PTRATIO      B LSTAT
## 1 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03

The dataset contains lon/lat coordinates for the 506 observations.

The map

We map the coordinates on the map of Massachusets.

We create a simple map with maps.

library(maps)

map('state', region = c('mass'), names = TRUE, plot = TRUE, lty = 1)
map.scale(relwidth = 0.15)
points((boston.c$LON)-0.05, (boston.c$LAT)+0.15, col = 'blue3', cex = 0.8)

The points are not accurately positioned. There is a reason we will see.

We create another map with ggplot2.

library(maps)
library(ggplot2)

mass <- map_data('state', region = c('mass'))

mass_maps <- ggplot(data = mass,
                    aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", colour = 'black') +
  geom_point(data = boston.c,
             aes(x = LON-0.05, y = LAT+0.15, fill = NULL, group = NULL),
             size = 1,
             colour = 'blue3',
             shape = 1,
             alpha = 0.5) +
  coord_map()

mass_maps

The points are located in Metropolitan Boston.

library(ggmap)

map <- get_map(location = 'Boston', 
               zoom = 9, 
               source = 'google',
               maptype = 'roadmap',
               language = 'en-EN')

ggmap(map)

Let us use the new map to plot the points.

library(ggplot2)

ggmap(map) +
  geom_point(data = boston.c,
             aes(x = LON-0.05, y = LAT+0.15, fill = NULL, group = NULL),
             size = 1,
             colour = 'red',
             shape = 1,
             alpha = 0.5)

Again, the points are not accurately positioned.

However, the longitude and latitude values in the boston.c dataset are not accurate. We cannot precisely fit them on the maps.

The reason being they are derived from UTM coordinates: Universal Transverse Mercator (UTM). When converting UTM coordinates to LON/LAT coordinate we get errors due to the spatial projection.

Here are the UTM coordinates.

head(boston.utm, 3)
##        x       y
## 1 338.73 4679.73
## 2 339.23 4683.33
## 3 340.37 4682.80

We can plot these coordinates. In the plot, the lon/lat coordinates are in ‘thousands’.

plot(boston.utm, col = 'blue3', las = 1, xlab = 'x (in thousands)', ylab = 'y (in thousands)')
grid()

Here is the UTM map for Boston.


We can get from UTM maps from Free Online Topo Maps. Each ‘square’ is an actual topographic map we can purchase (paper maps or downloads). These maps are used by urban planners or the likes as well as wildlife lovers, hunters, and fishers for planning a trip.

The simple OLS model

We estimate a linear model where MEDV is a function of several drivers.

  • MEDV, median value of owner-occupied housing in thousands USD.
  • CRIM, value of per-capita crimes.
  • RM, average number of rooms per dwelling.
  • INDUS, proportion of non-retail business acres per town.
  • NOX, value of the nitric oxides concentration (ppm) per town.
  • AGE, proportion of owner-occupied units built before 1940.
  • RAD, index of accessibility to radial highways per town.
  • PRRATIO, pupil to teachers ratio per town.
  • B, a transformed proportion of Blacks.
  • LSTAT, percentage of the lower-status population.
  • TAX, full-value property tax per USD 10000 per town.
boston_ols <- lm(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c)

summary(boston_ols)
## 
## Call:
## lm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + 
##     PTRATIO + B + LSTAT + TAX, data = boston.c)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7429  -2.8887  -0.7514   1.8144  26.8277 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.308337   5.199690   7.175 2.66e-12 ***
## CRIM         -0.103402   0.033339  -3.102 0.002035 ** 
## RM            4.074379   0.420639   9.686  < 2e-16 ***
## INDUS         0.018212   0.062015   0.294 0.769138    
## NOX         -17.829176   3.889690  -4.584 5.79e-06 ***
## AGE          -0.002647   0.013353  -0.198 0.842957    
## DIS          -1.210182   0.186123  -6.502 1.94e-10 ***
## RAD           0.304603   0.066878   4.555 6.62e-06 ***
## PTRATIO      -1.131146   0.126079  -8.972  < 2e-16 ***
## B             0.009853   0.002735   3.603 0.000346 ***
## LSTAT        -0.525072   0.051543 -10.187  < 2e-16 ***
## TAX          -0.010901   0.003710  -2.939 0.003452 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.838 on 494 degrees of freedom
## Multiple R-squared:  0.7293, Adjusted R-squared:  0.7233 
## F-statistic:   121 on 11 and 494 DF,  p-value: < 2.2e-16

Without even considering the results, we want to test the null hypotheses of normality and homoscedascity of the residuals.

library(moments)
jarque.test(boston_ols$residuals)

library(lmtest)
bptest(boston_ols)
## 
##  Jarque-Bera Normality Test
## 
## data:  boston_ols$residuals
## JB = 936.74, p-value < 2.2e-16
## alternative hypothesis: greater
## 
## 
##  studentized Breusch-Pagan test
## 
## data:  boston_ols
## BP = 59.214, df = 11, p-value = 1.297e-08

Both tests reject the two null hypotheses of normality and homoscedasticity. Still, let us run two more tests.

Testing spatial autocorrelation and dependence

We then perform a Moran’s I test. The package offers many approaches. We can derive the W matrix with the geographic coordinates.

We get two maps…

library(spdep)

moran.test(boston.c$CMEDV, nb2listw(boston.soi))
## 
##  Moran I test under randomisation
## 
## data:  boston.c$CMEDV  
## weights: nb2listw(boston.soi)  
## 
## Moran I statistic standard deviate = 21.786, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.690285059      -0.001980198       0.001009685
moran.test(boston.c$CMEDV, nb2listw(block.nb))
## 
##  Moran I test under randomisation
## 
## data:  boston.c$CMEDV  
## weights: nb2listw(block.nb)  
## 
## Moran I statistic standard deviate = 22.455, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3122905961     -0.0019801980      0.0001958827

All in all, both Moran’s I tests conclude to spatial (positive) autocorrelation.

We have very small area units (towns, blocks, neighbourhoods in Metropolitan Boston).

Consequence

Given the presence of spatial autocorrelation, the OLS model is unreliable. The house value should change smoothly through space. Expensive neighbourhoods tend to be concentrated in certain zones of Boston, displaying a spatial autocorrelation.

The Spatial Lag Model (SLM)

  • The incorporates spatial effects by including a spatially lagged dependent variable as an additional predictor: spatial interactions.
  • We ‘know’ the structure of the spatial relationships.
  • For example, the price of houses depends on several variables \(x\) and on the price of neighbouring houses (location-location-location!).
  • \(\rho \neq 0\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda = 0\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

We estimate the model with a ML. We use a row-standardized W matrix.

library(spdep)

boston_lag_ml <- lagsarlm(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

summary(boston_lag_ml)
## 
## Call:lagsarlm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14.10013  -2.75585  -0.70554   1.61932  26.72484 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)  2.4040e+01  5.7476e+00   4.1826 2.883e-05
## CRIM        -9.4139e-02  3.2433e-02  -2.9026 0.0037010
## RM           4.0807e+00  4.0823e-01   9.9962 < 2.2e-16
## INDUS        1.2526e-02  6.0021e-02   0.2087 0.8346858
## NOX         -1.3294e+01  3.9258e+00  -3.3862 0.0007086
## AGE          2.8516e-04  1.3028e-02   0.0219 0.9825378
## DIS         -1.1308e+00  1.8291e-01  -6.1823 6.319e-10
## RAD          2.8163e-01  6.5228e-02   4.3176 1.578e-05
## PTRATIO     -9.1942e-01  1.2842e-01  -7.1594 8.102e-13
## B            1.1200e-02  2.6561e-03   4.2169 2.477e-05
## LSTAT       -5.0198e-01  4.9951e-02 -10.0495 < 2.2e-16
## TAX         -9.4799e-03  3.6172e-03  -2.6208 0.0087728
## 
## Rho: 0.23337, LR test value: 19.481, p-value: 1.0163e-05
## Asymptotic standard error: 0.051073
##     z-value: 4.5694, p-value: 4.891e-06
## Wald statistic: 20.88, p-value: 4.891e-06
## 
## Log likelihood: -1499.873 for lag model
## ML residual variance (sigma squared): 21.922, (sigma: 4.6821)
## Number of observations: 506 
## Number of parameters estimated: 14 
## AIC: 3027.7, (AIC for lm: 3045.2)
## LM test for residual autocorrelation
## test value: 82.421, p-value: < 2.22e-16

\(\rho\) is significant. Almost all variables are significant, except INDUS and AGE, as it is with the OLS model. The Wald test and the Likelihood Ratio test are also significant.

The Spatial Lag specification emphasizes the important feature of spatial dependence of the price of houses but has not completely removed spatial autocorrelation from the results. The LM test for residuals shows that there is still some positive and significant residual autocorrelation.

library(moments)

jarque.test(boston_lag_ml$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  boston_lag_ml$residuals
## JB = 975.86, p-value < 2.2e-16
## alternative hypothesis: greater

We still find non-normal residuals.

We also estimate the SLM with the 2SLS. The 2SLS does not rely of the assumption of normality. We use a row-standardized W matrix.

library(spdep)

boston_lag_2sls <- stsls(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

summary(boston_lag_2sls)
## 
## Call:stsls(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -13.98427  -2.74691  -0.67302   1.58226  26.67034 
## 
## Coefficients: 
##                Estimate  Std. Error t value  Pr(>|t|)
## Rho           0.1576808   0.0587531  2.6838 0.0072793
## (Intercept)  28.3432118   6.1074905  4.6407 3.472e-06
## CRIM         -0.0971436   0.0328660 -2.9557 0.0031191
## RM            4.0786601   0.4136283  9.8607 < 2.2e-16
## INDUS         0.0143702   0.0609981  0.2356 0.8137550
## NOX         -14.7647665   3.9916302 -3.6989 0.0002165
## AGE          -0.0006658   0.0131510 -0.0506 0.9596228
## DIS          -1.1565530   0.1841070 -6.2820 3.343e-10
## RAD           0.2890793   0.0660170  4.3789 1.193e-05
## PTRATIO      -0.9880950   0.1349497 -7.3220 2.445e-13
## B             0.0107636   0.0027103  3.9714 7.146e-05
## LSTAT        -0.5094723   0.0510159 -9.9865 < 2.2e-16
## TAX          -0.0099410   0.0036654 -2.7121 0.0066862
## 
## Residual variance (sigma squared): 22.632, (sigma: 4.7573)

The goal of boston_lag_2sls is to confirm the results from boston_lag_ml. In light of the last results, there are no major variations from the first SLM, even from the OLS model.

Consequence

If boston_ols is unreliable because of spatial autocorrelation, boston_lag_ml is not satisfactory because of persistent spatial autocorrelation that was already part of boston_ols and is still part of boston_lag_2sls.

We then need another model that can include extra special components. The SEM could be a good alternative, but we must remember that we ‘know’ the structure of the spatial relationships: the price of houses depends on several variables \(x\) and on the price of neighbouring houses.

Let us opt for the SARAR instead.

The Spatial AutoRegressive with AutoRegressive error structure (SARAR) model

  • \(\rho \neq 0\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda \neq 0\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

We estimate the model with a ML. We use a row-standardized W matrix.

library(spdep)

boston_sarar_ml <- sacsarlm(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

summary(boston_sarar_ml)
## 
## Call:sacsarlm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -17.63685  -2.40489  -0.58622   1.70856  26.18523 
## 
## Type: sac 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)  28.5319859   5.9817646  4.7698 1.844e-06
## CRIM         -0.1247963   0.0298378 -4.1825 2.883e-05
## RM            4.7358225   0.3950673 11.9874 < 2.2e-16
## INDUS        -0.1406408   0.0676956 -2.0775 0.0377512
## NOX         -19.5245970   4.5926918 -4.2512 2.126e-05
## AGE          -0.0275620   0.0126335 -2.1817 0.0291350
## DIS          -1.2255638   0.3557442 -3.4451 0.0005709
## RAD           0.2474460   0.0720040  3.4366 0.0005892
## PTRATIO      -0.5835356   0.1321807 -4.4147 1.012e-05
## B             0.0127394   0.0025617  4.9731 6.590e-07
## LSTAT        -0.4149719   0.0487951 -8.5044 < 2.2e-16
## TAX          -0.0104529   0.0035135 -2.9750 0.0029295
## 
## Rho: -0.11479
## Asymptotic standard error: 0.10497
##     z-value: -1.0936, p-value: 0.27413
## Lambda: 0.84767
## Asymptotic standard error: 0.043931
##     z-value: 19.295, p-value: < 2.22e-16
## 
## LR test value: 93.937, p-value: < 2.22e-16
## 
## Log likelihood: -1462.645 for sac model
## ML residual variance (sigma squared): 17.754, (sigma: 4.2135)
## Number of observations: 506 
## Number of parameters estimated: 15 
## AIC: 2955.3, (AIC for lm: 3045.2)

The results are similar to the 3 previous models.

\(\rho\) is insignificant; \(\lambda\) is positive and significant.

This time INDUS and AGE are weakly significant.

We also estimate the SARAR with the G2SLS. We use a row-standardized W matrix.

library(spdep)

boston_gs2sls <- gstsls(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

summary(boston_gs2sls)
## 
## Call:gstsls(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -16.75042  -2.44711  -0.59901   1.71731  26.66170 
## 
## Type: GM SARAR estimator
## Coefficients: (GM standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## Rho_Wy       -0.0053898   0.0846866 -0.0636 0.9492535
## (Intercept)  28.1609223   6.1149284  4.6053 4.119e-06
## CRIM         -0.1226993   0.0310028 -3.9577 7.568e-05
## RM            4.6897794   0.4070951 11.5201 < 2.2e-16
## INDUS        -0.0869281   0.0668845 -1.2997 0.1937126
## NOX         -18.5523757   4.5076561 -4.1157 3.859e-05
## AGE          -0.0229789   0.0130221 -1.7646 0.0776300
## DIS          -1.3971171   0.2830438 -4.9360 7.972e-07
## RAD           0.2821821   0.0728452  3.8737 0.0001072
## PTRATIO      -0.7191670   0.1339472 -5.3690 7.916e-08
## B             0.0122765   0.0026626  4.6106 4.014e-06
## LSTAT        -0.4423588   0.0499657 -8.8532 < 2.2e-16
## TAX          -0.0109574   0.0036522 -3.0002 0.0026981
## 
## Lambda: 0.64614
## Residual variance (sigma squared): 19.249, (sigma: 4.3874)
## GM argmin sigma squared: 19.536
## Number of observations: 506 
## Number of parameters estimated: 15

These results confirm the previous results. We also recall that boston_gs2sls does not rely on the assumption of normality.

Consequence

We can conclude the SARAR is the best model for the situation.

  • MEDV, median value of owner-occupied housing in thousands USD.
  • INDUS, proportion of non-retail business acres per town.
  • AGE, proportion of owner-occupied units built before 1940.

First, INDUS and AGE have no significant effect on MDEV.

  • CRIM, value of per-capita crimes.
  • RM, average number of rooms per dwelling.
  • NOX, value of the nitric oxides concentration (ppm) per town.
  • RAD, index of accessibility to radial highways per town.
  • PRRATIO, pupil to teachers ratio per town.
  • B, a transformed proportion of Blacks.
  • LSTAT, percentage of the lower-status population.
  • TAX, full-value property tax per USD 10000 per town.

Second, all the other coefficients are significant and with the right signs. We can read off the impact of each variable on MDEV.

Third, \(\rho\) is insignificant; \(\lambda\) is positive and significant.

The Spatial Error Model (SEM)

  • The model corrects for spatial autocorrelation due to the use of spatial data. We do not know the structure of the spatial relationships. The error term accounts for unobservable features or omitted variables associated with a location in the real estate model, for instance.
  • \(\rho = 0\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda \neq 0\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

We estimate the SEM with the ML. We use a row-standardized W matrix.

library(spdep)

boston_se_ml <- errorsarlm(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

summary(boston_se_ml)
## 
## Call:errorsarlm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -17.56220  -2.33823  -0.62335   1.74523  26.49245 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)  25.7794669   5.4781848  4.7058 2.528e-06
## CRIM         -0.1229337   0.0299560 -4.1038 4.064e-05
## RM            4.8045309   0.3937173 12.2030 < 2.2e-16
## INDUS        -0.1357363   0.0672701 -2.0178 0.0436140
## NOX         -19.4945052   4.5838628 -4.2529 2.111e-05
## AGE          -0.0268291   0.0126813 -2.1156 0.0343759
## DIS          -1.2983711   0.3406202 -3.8118 0.0001380
## RAD           0.2586205   0.0716616  3.6089 0.0003075
## PTRATIO      -0.6021464   0.1316527 -4.5737 4.791e-06
## B             0.0131762   0.0025505  5.1662 2.389e-07
## LSTAT        -0.4167747   0.0488789 -8.5267 < 2.2e-16
## TAX          -0.0107466   0.0035168 -3.0558 0.0022449
## 
## Lambda: 0.82263, LR test value: 92.442, p-value: < 2.22e-16
## Asymptotic standard error: 0.042025
##     z-value: 19.575, p-value: < 2.22e-16
## Wald statistic: 383.17, p-value: < 2.22e-16
## 
## Log likelihood: -1463.392 for error model
## ML residual variance (sigma squared): 17.93, (sigma: 4.2344)
## Number of observations: 506 
## Number of parameters estimated: 14 
## AIC: 2954.8, (AIC for lm: 3045.2)

The \(\lambda\) is significant.

However, results from all the other model showed signs of non-normality of the residuals.

Before discussing the other results, we also estimate the GGLS. The model does not rely on the assumption of normality. We use a row-standardized W matrix.

library(spdep)

boston_se_gls <- GMerrorsar(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

summary(boston_se_gls)
## 
## Call:GMerrorsar(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1650  -3.1668  -1.0855   1.7166  29.0929 
## 
## Type: GM SAR estimator
## Coefficients: (GM standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)  26.8053548   5.5357108  4.8423 1.284e-06
## CRIM         -0.1230767   0.0309175 -3.9808 6.868e-05
## RM            4.7588378   0.4050184 11.7497 < 2.2e-16
## INDUS        -0.1122504   0.0682580 -1.6445 0.1000724
## NOX         -19.0381338   4.6251629 -4.1162 3.852e-05
## AGE          -0.0251919   0.0130583 -1.9292 0.0537069
## DIS          -1.3723587   0.3146351 -4.3617 1.290e-05
## RAD           0.2705240   0.0736073  3.6752 0.0002376
## PTRATIO      -0.6545771   0.1349757 -4.8496 1.237e-06
## B             0.0127743   0.0026295  4.8581 1.185e-06
## LSTAT        -0.4284721   0.0502013 -8.5351 < 2.2e-16
## TAX          -0.0108476   0.0036439 -2.9770 0.0029113
## 
## Lambda: 0.742 (standard error): 0.40397 (z-value): 1.8368
## Residual variance (sigma squared): 19.153, (sigma: 4.3764)
## GM argmin sigma squared: 19.16
## Number of observations: 506 
## Number of parameters estimated: 14

This time, the \(\lambda\) is non significant.

INDUS and AGE are not significant. As with the SLM, the SME is not the proper model for the situation.

Consequence

We prove the SEM is not a good alternative. Even in theory, the SEM is not the proper model because we ‘know’ the structure of the spatial relationships: the price of houses depends on several variables \(x\) and on the price of neighbouring houses.

Finally, let us rewind a little bit and run Lagrange Multiplier diagnostics for spatial dependence in linear models. We use a row-standardized W matrix.

library(spdep)

lm.LMtests(boston_ols, listw = nb2listw(block.nb, style = "W"), test = c("LMerr", "LMlag", "RLMerr", "RLMlag", "SARMA"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS +
## RAD + PTRATIO + B + LSTAT + TAX, data = boston.c)
## weights: nb2listw(block.nb, style = "W")
## 
## LMerr = 132.98, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS +
## RAD + PTRATIO + B + LSTAT + TAX, data = boston.c)
## weights: nb2listw(block.nb, style = "W")
## 
## LMlag = 22.702, df = 1, p-value = 1.892e-06
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS +
## RAD + PTRATIO + B + LSTAT + TAX, data = boston.c)
## weights: nb2listw(block.nb, style = "W")
## 
## RLMerr = 110.29, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS +
## RAD + PTRATIO + B + LSTAT + TAX, data = boston.c)
## weights: nb2listw(block.nb, style = "W")
## 
## RLMlag = 0.014788, df = 1, p-value = 0.9032
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS +
## RAD + PTRATIO + B + LSTAT + TAX, data = boston.c)
## weights: nb2listw(block.nb, style = "W")
## 
## SARMA = 132.99, df = 2, p-value < 2.2e-16
  • The LMerr is significant: there is spatial dependence and the SEM can be an alternative.
  • The RLMerr, the robust version of the test, confirms the LMerr. However, we prove above that in theory, the SEM is not the proper alternative when we find spatial dependence in linear models if we ‘know’ the structure of the spatial relationships: the price of houses depends on several variables \(x\) and on the price of neighbouring houses.
  • The LMlag is significant: there is spatial dependence and the SLM can be an alternative.
  • The RLMlag, the robust version of the test, contradicts the LMlag. As we demonstrate above, the SLM is not the proper alternative.
  • The SARMA is significant: there is spatial dependence and the SARAR can be an alternative.

The Lagrange Multiplier diagnostics confirm the preceding conclusions.

The simple OLS model – Part 2

We estimate the effect RM, the average number of rooms per dwelling, on MEDV, the median value of owner-occupied housing in thousands USD.

boston_ols2 <- lm(MEDV ~ RM, data = boston.c)

summary(boston_ols2)
## 
## Call:
## lm(formula = MEDV ~ RM, data = boston.c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.346  -2.547   0.090   2.986  39.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -34.671      2.650  -13.08   <2e-16 ***
## RM             9.102      0.419   21.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.616 on 504 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4825 
## F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16

Without even considering the results, we want to test the null hypotheses of normality.

library(moments)

jarque.test(boston_ols2$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  boston_ols2$residuals
## JB = 612.45, p-value < 2.2e-16
## alternative hypothesis: greater

The test rejects the two null hypotheses of normality.

We also test for ‘residuals spatial dependence’.

## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ RM, data = boston.c)
## weights: nb2listw(block.nb)
## 
## LMerr = 388.98, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ RM, data = boston.c)
## weights: nb2listw(block.nb)
## 
## LMlag = 210.03, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ RM, data = boston.c)
## weights: nb2listw(block.nb)
## 
## RLMerr = 188.21, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ RM, data = boston.c)
## weights: nb2listw(block.nb)
## 
## RLMlag = 9.2573, df = 1, p-value = 0.002346
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = MEDV ~ RM, data = boston.c)
## weights: nb2listw(block.nb)
## 
## SARMA = 398.24, df = 2, p-value < 2.2e-16

Overall, the Lagrange multiplier diagnostics show signs of spatial dependence and all the other models, the SME, the SLM, and the SARAR, could be good alternatives. Theoritically speaking, the only valuable alternative is the SLM because we ‘know’ the structure of the spatial relationships.

The Spatial Lag Model (SLM) – Part 2

We estimate the SLM with the ML. We use a row-standardized W matrix.

library(spdep)

boston_lag_ml2 <- lagsarlm(MEDV ~ RM, data = boston.c, listw = nb2listw(block.nb, style = "W"))

summary(boston_lag_ml2)
## 
## Call:
## lagsarlm(formula = MEDV ~ RM, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -19.21256  -2.93495  -0.42299   2.37392  38.97158 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -38.10146    2.35821 -16.157 < 2.2e-16
## RM            7.69956    0.40769  18.886 < 2.2e-16
## 
## Rho: 0.54958, LR test value: 106.02, p-value: < 2.22e-16
## Asymptotic standard error: 0.047871
##     z-value: 11.48, p-value: < 2.22e-16
## Wald statistic: 131.8, p-value: < 2.22e-16
## 
## Log likelihood: -1620.065 for lag model
## ML residual variance (sigma squared): 34.66, (sigma: 5.8873)
## Number of observations: 506 
## Number of parameters estimated: 4 
## AIC: 3248.1, (AIC for lm: 3352.2)
## LM test for residual autocorrelation
## test value: 30.599, p-value: 3.1724e-08

All parameters are significant, even the LM test for residual autocorrelation.

The two simply serve as a comparison basis for the next concept.

The Geographically-weighted Regression (GWR) model

The GWR consider the relationship between two variables is not constant in the whole geographical area and it is more sensible to conjecture. The relationship varies in the space according to some regular pattern.

We estimate the GWR using a Gaussian kernel. The bandwidth is specified using the cross-validation method.

library(sp)

# Create a system of coordinates of the centroids of each polygon
# matrix object
coords <- data.frame(boston.c$LON, boston.c$LAT)
coords <- as.matrix(coords)

library(spgwr)

# Calibrate the cross-validation bandwidth
# This is an iterative process
# gweight =  gwr.bisquare, gwr.tricube, gwr.Gauss
boston_bw <- gwr.sel(MEDV ~ RM, data = boston.c, coords, gweight = gwr.Gauss, adapt = TRUE)
## Adaptive q: 0.381966 CV score: 19814.33 
## Adaptive q: 0.618034 CV score: 21053.22 
## Adaptive q: 0.236068 CV score: 18406.07 
## Adaptive q: 0.145898 CV score: 16785.69 
## Adaptive q: 0.09016994 CV score: 15393.9 
## Adaptive q: 0.05572809 CV score: 14371.56 
## Adaptive q: 0.03444185 CV score: 13166.36 
## Adaptive q: 0.02128624 CV score: 11471.73 
## Adaptive q: 0.01315562 CV score: 9858.911 
## Adaptive q: 0.008130619 CV score: 8763.524 
## Adaptive q: 0.005024999 CV score: 7480.574 
## Adaptive q: 0.00310562 CV score: 7126.649 
## Adaptive q: 0.002039628 CV score: NA 
## Adaptive q: 0.003838757 CV score: 7082.71 
## Adaptive q: 0.00361768 CV score: 7037.385 
## Adaptive q: 0.003530104 CV score: 7033.506 
## Adaptive q: 0.003570794 CV score: 7034.146 
## Adaptive q: 0.003489414 CV score: 7035.017 
## Adaptive q: 0.003530104 CV score: 7033.506
summary(boston_bw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00353 0.00353 0.00353 0.00353 0.00353 0.00353
# Estimate the GWR model
boston_gwr <- gwr(MEDV ~ RM, data = boston.c, coords, adapt = boston_bw, hatmatrix = TRUE)

The results are stored in a spatial data frame (SDF).

We can see the statistical summaries for the intercept and the RM variable. The quasi-global \(R^2\) is strong.

Here is the head of the results (in a data frame) and a focus on the RM variable.

# The SDF, part of the object
head(boston_gwr$SDF, 3)
##         coordinates     sum.w X.Intercept.        RM X.Intercept._se
## 1 (-70.955, 42.255) 13.234200    -35.76783  8.882036        6.925902
## 2 (-70.95, 42.2875)  5.506060    -41.20943  9.761382       11.302326
## 3 (-70.936, 42.283)  5.791417    -58.18195 12.778682       10.119707
##      RM_se     gwr.e     pred   pred.se   localR2 X.Intercept._se_EDF
## 1 1.151137 1.3684393 22.63156 0.8135323 0.9199276             7.69631
## 2 1.837793 0.1315916 21.46841 0.9955433 0.9346412            12.55955
## 3 1.612012 1.0671195 33.63288 1.6537259 0.9565849            11.24538
##   RM_se_EDF pred.se.1
## 1  1.279185 0.9040262
## 2  2.042221 1.1062833
## 3  1.791326 1.8376794
# One of the SDF variables
head(boston_gwr$SDF$RM, 3)
## [1]  8.882036  9.761382 12.778682

The most interesting statistics is the mean and median value of RM: they are very similar to the SLM’s RM coefficient.

summary(boston_gwr$SDF$RM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -8.206   3.809   7.229   6.965   9.761  21.512
boston_lag_ml2$coefficients
## (Intercept)          RM 
##  -38.101456    7.699561

The OLS and SLM are based on the average error. We could also run quantile regressions based on the median error.

The difference with the SLM is the variable variance.

The results on the Metropolitan Boston UTM map

We first recall the original data points.

plot(boston.utm, col = 'blue3', las = 0.7)
grid()

Now, the GWR results.

library(RColorBrewer)

#cols <- brewer.pal(length(breaks), "Blues")
cols <- c("#C6DBEF", "#9ECAE1", "#6BAED6", "#4292C6", "#2171B5", "#084594")

library(sp)

spplot(boston_gwr$SDF, "RM", cuts = quantile(boston_gwr$SDF$RM), 
col.regions = cols, cex = 0.7, bg = 'gray')

We recall that we want to assess the median value with the average number of rooms per dwelling.

The map shows the regression residuals grouped into 4 bands. The residuals are the difference between the observed median value and the assessed median value. A negative residual value means than the model underestimated the observed value.

The model underestimates the median value in downtown Boston. A proof that valuation cannot solely be bases on the average number of rooms. The model also overestimates the median value as we move away from the city center.

We can then see spatial patterns: a central cluster (the palest blue) extending further north along the northern shore, another pattern around the central cluster with alternating pockets (mid and dark blues). Overall, the residuals are not random.

The MESS

We want to compare the Matrix Exponential Spatial Specification (MESS) with the specification methods we have covered so far; such as the SLM estimated. On large datasets, the MESS can reduce the burden of computation without sacrificing accuracy.

The results from the first SLM

summary(boston_lag_ml)
## 
## Call:lagsarlm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14.10013  -2.75585  -0.70554   1.61932  26.72484 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)  2.4040e+01  5.7476e+00   4.1826 2.883e-05
## CRIM        -9.4139e-02  3.2433e-02  -2.9026 0.0037010
## RM           4.0807e+00  4.0823e-01   9.9962 < 2.2e-16
## INDUS        1.2526e-02  6.0021e-02   0.2087 0.8346858
## NOX         -1.3294e+01  3.9258e+00  -3.3862 0.0007086
## AGE          2.8516e-04  1.3028e-02   0.0219 0.9825378
## DIS         -1.1308e+00  1.8291e-01  -6.1823 6.319e-10
## RAD          2.8163e-01  6.5228e-02   4.3176 1.578e-05
## PTRATIO     -9.1942e-01  1.2842e-01  -7.1594 8.102e-13
## B            1.1200e-02  2.6561e-03   4.2169 2.477e-05
## LSTAT       -5.0198e-01  4.9951e-02 -10.0495 < 2.2e-16
## TAX         -9.4799e-03  3.6172e-03  -2.6208 0.0087728
## 
## Rho: 0.23337, LR test value: 19.481, p-value: 1.0163e-05
## Asymptotic standard error: 0.051073
##     z-value: 4.5694, p-value: 4.891e-06
## Wald statistic: 20.88, p-value: 4.891e-06
## 
## Log likelihood: -1499.873 for lag model
## ML residual variance (sigma squared): 21.922, (sigma: 4.6821)
## Number of observations: 506 
## Number of parameters estimated: 14 
## AIC: 3027.7, (AIC for lm: 3045.2)
## LM test for residual autocorrelation
## test value: 82.421, p-value: < 2.22e-16

Reestimating the SLM and timer

We reestimate the model (ML, row-standardized W matrix) with a function to compute the computation time.

SLMfunction <- function(){
boston_lag_ml <- lagsarlm(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

return(summary(boston_lag_ml))
}
SLMtime.taken <- rep(0, 10)

for (i in 1:10) {
start.time <- Sys.time()
SLMfunction()
end.time <- Sys.time()
SLMtime.taken[i] <- end.time - start.time
}

mean(SLMtime.taken)
## [1] 0.3101689

Estimating the MESS and timer

We estimate the MESS using the same variable and parameters as with the previous model (using a row-standardized W matrix).

MESSfunction <- function(){
boston_mess <- lagmess(MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, style = "W"))

return(summary(boston_mess))
}
MESStime.taken <- rep(0, 10)

for (i in 1:10) {
start.time <- Sys.time()
MESSfunction()
end.time <- Sys.time()
MESStime.taken[i] <- end.time - start.time
}

mean(MESStime.taken)
## [1] 0.2221496

Coefficients

SLM

summary(boston_lag_ml)
## 
## Call:lagsarlm(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14.10013  -2.75585  -0.70554   1.61932  26.72484 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)  2.4040e+01  5.7476e+00   4.1826 2.883e-05
## CRIM        -9.4139e-02  3.2433e-02  -2.9026 0.0037010
## RM           4.0807e+00  4.0823e-01   9.9962 < 2.2e-16
## INDUS        1.2526e-02  6.0021e-02   0.2087 0.8346858
## NOX         -1.3294e+01  3.9258e+00  -3.3862 0.0007086
## AGE          2.8516e-04  1.3028e-02   0.0219 0.9825378
## DIS         -1.1308e+00  1.8291e-01  -6.1823 6.319e-10
## RAD          2.8163e-01  6.5228e-02   4.3176 1.578e-05
## PTRATIO     -9.1942e-01  1.2842e-01  -7.1594 8.102e-13
## B            1.1200e-02  2.6561e-03   4.2169 2.477e-05
## LSTAT       -5.0198e-01  4.9951e-02 -10.0495 < 2.2e-16
## TAX         -9.4799e-03  3.6172e-03  -2.6208 0.0087728
## 
## Rho: 0.23337, LR test value: 19.481, p-value: 1.0163e-05
## Asymptotic standard error: 0.051073
##     z-value: 4.5694, p-value: 4.891e-06
## Wald statistic: 20.88, p-value: 4.891e-06
## 
## Log likelihood: -1499.873 for lag model
## ML residual variance (sigma squared): 21.922, (sigma: 4.6821)
## Number of observations: 506 
## Number of parameters estimated: 14 
## AIC: 3027.7, (AIC for lm: 3045.2)
## LM test for residual autocorrelation
## test value: 82.421, p-value: < 2.22e-16

MESS

summary(boston_mess)
## Matrix exponential spatial lag model:
## 
## Call:
## lagmess(formula = MEDV ~ CRIM + RM + INDUS + NOX + AGE + DIS + 
##     RAD + PTRATIO + B + LSTAT + TAX, data = boston.c, listw = nb2listw(block.nb, 
##     style = "W"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14.07129  -2.77314  -0.69572   1.60769  26.81956 
## 
## Coefficients:
##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)  24.3258321   5.1004948  4.7693 2.437e-06
## CRIM         -0.0939751   0.0327030 -2.8736 0.0042332
## RM            4.0999221   0.4126144  9.9364 < 2.2e-16
## INDUS         0.0129253   0.0608323  0.2125 0.8318243
## NOX         -13.4476775   3.8154864 -3.5245 0.0004637
## AGE           0.0002278   0.0130982  0.0174 0.9861314
## DIS          -1.1339853   0.1825721 -6.2112 1.117e-09
## RAD           0.2825197   0.0656024  4.3065 2.001e-05
## PTRATIO      -0.9257634   0.1236741 -7.4855 3.296e-13
## B             0.0112944   0.0026824  4.2105 3.028e-05
## LSTAT        -0.5021326   0.0505598 -9.9315 < 2.2e-16
## TAX          -0.0095532   0.0036390 -2.6252 0.0089280
## 
## Residual standard error: 4.7456 on 494 degrees of freedom
## Multiple R-squared:  0.70368,    Adjusted R-squared:  0.69708 
## F-statistic: 106.64 on 11 and 494 DF,  p-value: < 2.22e-16
## 
## Alpha: -0.2537, standard error: 0.061409
##     z-value: -4.1313, p-value: 3.6068e-05
## LR test value: 19.493, p-value: 1.0099e-05
## Implied rho: 0.2240748

Comparing the results, we notice that the MESS specification substantially confirms the conclusions of the SLM estimation (using ML). In both cases, the \(\rho\) parameter is positive and significantly different from 0.

Time difference

knitr::kable(time.taken)
SLMtime.taken MESStime.taken
1 0.3633149 0.4641178
2 0.3144600 0.2065680
3 0.2988303 0.1974759
4 0.3004863 0.1940000
5 0.3027320 0.1842351
6 0.3045528 0.1926391
7 0.3181975 0.1876292
8 0.2998209 0.1967158
9 0.2999413 0.1886079
10 0.2993534 0.2095072
mean 0.3101689 0.2221496

With a very small sample size, the mean difference in computing times is obviously negligible. However, the example shows that in big datasets, the MESS specification can substantially reduce the computation burden without loss of accuracy of the estimates. We can extrapolate the time gain with Big Data.

Case 5: The determinants of educational achievement in Georgia

The dataset

The dataset is a collection of data from the 159 counties of the state of Georgia. The dataset contains the following variables:

  • Bach, proportion of residents with a Bachelor’s degree or higher.
  • Rural, proportion of people living in rural neighbourhoods.
  • Eld, proportion of elderly residents.
  • FB, proportion of residents who are foreign-born.
  • Pov, proportion of residents who are living below the poverty line.
  • Black, proportion of residents who are ethnic Blacks.

The dataset also contains information about the total population of each county and a series of other information about the geography (area, latitude/longitude of the centroids of each county).

The data is part of the spgwr package. We first load the data from georgia.

library(spgwr)

data(georgia)

At the same time, we loaded a SpatialPolygonDataFrame.

class(gSRDF)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

The dataset we are looking for is part of the SpatialPolygonDataFrame. We invoke it with a suffix: @data.

head(gSRDF@data, 3)
##   SR_ID AreaKey Latitude  Longitud TotPop90 PctRural PctBach PctEld PctFB
## 0     1   13001 31.75339 -82.28558    15744     75.6     8.2  11.43  0.64
## 1     2   13003 31.29486 -82.87474     6213    100.0     6.4  11.77  1.58
## 2     3   13005 31.55678 -82.45115     9566     61.7     6.6  11.11  0.27
##   PctPov PctBlack  ID        X       Y
## 0   19.9    20.76 133 941396.6 3521764
## 1   26.0    26.86 158 895553.0 3471916
## 2   24.1    15.42 146 930946.4 3502787
summary(gSRDF)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x -85.60520 -80.84126
## y  30.35541  35.00068
## Is projected: FALSE 
## proj4string : [+proj=longlat +datum=NAD27]
## Data attributes:
##      SR_ID        AreaKey         Latitude        Longitud     
##  1      :  1   Min.   :13001   Min.   :30.72   Min.   :-85.50  
##  10     :  1   1st Qu.:13082   1st Qu.:31.79   1st Qu.:-84.44  
##  100    :  1   Median :13161   Median :32.75   Median :-83.69  
##  101    :  1   Mean   :13161   Mean   :32.81   Mean   :-83.58  
##  102    :  1   3rd Qu.:13242   3rd Qu.:33.79   3rd Qu.:-82.85  
##  103    :  1   Max.   :13321   Max.   :34.92   Max.   :-81.09  
##  (Other):153                                                   
##     TotPop90         PctRural         PctBach          PctEld     
##  Min.   :  1915   Min.   :  2.50   Min.   : 4.20   Min.   : 1.46  
##  1st Qu.:  9220   1st Qu.: 54.70   1st Qu.: 7.60   1st Qu.: 9.81  
##  Median : 16934   Median : 72.30   Median : 9.40   Median :12.07  
##  Mean   : 40744   Mean   : 70.18   Mean   :10.95   Mean   :11.74  
##  3rd Qu.: 36058   3rd Qu.:100.00   3rd Qu.:12.00   3rd Qu.:13.70  
##  Max.   :648951   Max.   :100.00   Max.   :37.50   Max.   :22.96  
##                                                                   
##      PctFB           PctPov         PctBlack           ID        
##  Min.   :0.040   Min.   : 2.60   Min.   : 0.00   Min.   :  1.00  
##  1st Qu.:0.415   1st Qu.:14.05   1st Qu.:11.75   1st Qu.: 41.50  
##  Median :0.720   Median :18.60   Median :27.64   Median : 85.00  
##  Mean   :1.131   Mean   :19.34   Mean   :27.39   Mean   : 86.04  
##  3rd Qu.:1.265   3rd Qu.:24.65   3rd Qu.:40.06   3rd Qu.:130.50  
##  Max.   :6.740   Max.   :35.90   Max.   :79.64   Max.   :173.00  
##                                                                  
##        X                 Y          
##  Min.   : 635964   Min.   :3401148  
##  1st Qu.: 741144   1st Qu.:3523380  
##  Median : 809737   Median :3636468  
##  Mean   : 820944   Mean   :3636238  
##  3rd Qu.: 894584   3rd Qu.:3741656  
##  Max.   :1059706   Max.   :3872640  
## 

The map

Just like with shapefiles or .rds files, we can plot the SpatialPolygonDataFrame. We can customize the map.

The OLS model

We consider a model where Bach, the percentage of Bachelor or higher degrees, in one county as a function of the other variables.

georgia_ols <- lm(PctBach ~ PctRural + PctEld + PctFB + PctPov + PctBlack, data = gSRDF)

summary(georgia_ols)
## 
## Call:
## lm(formula = PctBach ~ PctRural + PctEld + PctFB + PctPov + PctBlack, 
##     data = gSRDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4471 -2.0275 -0.3775  1.6223 18.7245 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.24373    1.75329   9.835  < 2e-16 ***
## PctRural    -0.07032    0.01358  -5.179 6.93e-07 ***
## PctEld       0.01145    0.12953   0.088 0.929693    
## PctFB        1.85247    0.30683   6.037 1.14e-08 ***
## PctPov      -0.25524    0.07248  -3.522 0.000566 ***
## PctBlack     0.04911    0.02648   1.854 0.065602 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.714 on 153 degrees of freedom
## Multiple R-squared:  0.5884, Adjusted R-squared:  0.575 
## F-statistic: 43.75 on 5 and 153 DF,  p-value: < 2.2e-16
  • PctBach, proportion of residents with a Bachelor’s degree or higher.
  • PctRural, proportion of people living in rural neighbourhoods.
  • PctEld, proportion of elderly residents.
  • PctFB, proportion of residents who are foreign-born.
  • PctPov, proportion of residents who are living below the poverty line.
  • PctBlack, proportion of residents who are ethnic Blacks.

Pct means percentage or proportion.

We do not analyze the results, we keep them for a comparative basis.

The GWR

The GWR consider the relationship between two variables is not constant in the whole geographical area and it is more sensible to conjecture. The relationship varies in the space according to some regular pattern.

We estimate the GWR using a Gaussian kernel. The bandwidth is specified using the cross-validation method.

# Calibrate the cross-validation bandwidth
# This is an iterative process
# gweight =  gwr.bisquare, gwr.tricube, gwr.Gauss
georgia_bw <- gwr.sel(PctBach ~ TotPop90 + PctRural + PctEld + PctFB + PctPov + PctBlack, data = gSRDF, adapt = TRUE)
## Adaptive q: 0.381966 CV score: 2048.743 
## Adaptive q: 0.618034 CV score: 2028.561 
## Adaptive q: 0.763932 CV score: 2041.554 
## Adaptive q: 0.593544 CV score: 2027.538 
## Adaptive q: 0.5306416 CV score: 2026.568 
## Adaptive q: 0.536485 CV score: 2026.541 
## Adaptive q: 0.5402641 CV score: 2026.493 
## Adaptive q: 0.5606152 CV score: 2026.641 
## Adaptive q: 0.5480375 CV score: 2026.425 
## Adaptive q: 0.5568526 CV score: 2026.552 
## Adaptive q: 0.5514046 CV score: 2026.493 
## Adaptive q: 0.5458334 CV score: 2026.425 
## Adaptive q: 0.5468793 CV score: 2026.417 
## Adaptive q: 0.5469222 CV score: 2026.417 
## Adaptive q: 0.5473482 CV score: 2026.417 
## Adaptive q: 0.547116 CV score: 2026.416 
## Adaptive q: 0.5471567 CV score: 2026.415 
## Adaptive q: 0.5472298 CV score: 2026.416 
## Adaptive q: 0.5471567 CV score: 2026.415
summary(georgia_bw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5472  0.5472  0.5472  0.5472  0.5472  0.5472
# Estimate the GWR model
georgia_gwr <- gwr(PctBach ~ PctRural + PctEld + PctFB + PctPov + PctBlack, data = gSRDF, adapt = georgia_bw)

# Results are stored in a spatial d.f (SDF)
georgia_gwr
## Call:
## gwr(formula = PctBach ~ PctRural + PctEld + PctFB + PctPov + 
##     PctBlack, data = gSRDF, adapt = georgia_bw)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.5471567 (about 86 of 159 data points)
## Summary of GWR coefficient estimates at data points:
##                   Min.   1st Qu.    Median   3rd Qu.      Max.  Global
## X.Intercept. 15.534870 15.930226 17.116353 17.999262 18.439278 17.2437
## PctRural     -0.077269 -0.073377 -0.066971 -0.060634 -0.057472 -0.0703
## PctEld       -0.084916 -0.047010 -0.031993 -0.019940  0.010712  0.0114
## PctFB         1.299964  1.540455  1.970330  2.424340  2.622509  1.8525
## PctPov       -0.288659 -0.260353 -0.217175 -0.177695 -0.140704 -0.2552
## PctBlack      0.011314  0.023969  0.038085  0.055239  0.067478  0.0491

Here is the head of the results, the SDF, (in a data frame) and a focus on the PctBlack variable.

# One of the SDF variables
head(georgia_gwr$SDF$PctBlack, 3)
## [1] 0.04836296 0.06127023 0.05297678

Visualize the results with a splom

The correlation between…

Visualize the results with maps

We want to focus on PctBlack.

We break down the results into 5 bands (brks) using the findInterval function.

With ssplot, we break down the results using the quantile function.

library(RColorBrewer)

#cols <- brewer.pal(length(breaks), "Blues")
cols <- c("#C6DBEF", "#9ECAE1", "#6BAED6", "#4292C6", "#2171B5", "#084594")

library(sp)

spplot(georgia_gwr$SDF, "PctBlack", cuts = quantile(georgia_gwr$SDF$PctBlack), 
col.regions = cols, cex = 0.7)

With ssplot, we break down the results into 5 bands (brks) using the findInterval function.

library(RColorBrewer)

#cols <- brewer.pal(length(breaks), "Blues")
cols <- c("#C6DBEF", "#9ECAE1", "#6BAED6", "#4292C6", "#2171B5", "#084594")

library(sp)

spplot(georgia_gwr$SDF, "PctBlack", cuts = findInterval(georgia_gwr$SDF$PctBlack, brks, all.inside = TRUE), 
col.regions = cols, cex = 0.7)

So far, the last map is the best one. The map displays the geographical variability of the regression coefficients related to the variable PctBlack (darker shades for higher coefficients). The lack of homogeneity of the coefficient is evident, with a clear pattern decreasing from South-West to North-East. Thus the variable PctBlack has a positive impact on PctBach in the South and in the Western counties, but, in contrast, a negative impact in the North-Eastern counties.

We print the histogram.

hist(georgia_gwr$SDF$PctBlack, main = '', xlab = '', ylim = c(5, 20), col = 'red3', density = 10, las = 1)
grid()

Let us turn to PctEld (usingfindInterval and brks).

spplot(georgia_gwr$SDF, "PctEld", cuts = findInterval(georgia_gwr$SDF$PctEld, brks, all.inside = TRUE), 
col.regions = cols, cex = 0.7)

hist(georgia_gwr$SDF$PctEld, main = '', xlab = '', ylim = c(0, 35), col = 'red3', density = 10, las = 1)
grid()

Let us turn to PctPov (usingfindInterval and brks).

spplot(georgia_gwr$SDF, "PctPov", cuts = findInterval(georgia_gwr$SDF$PctPov, brks, all.inside = TRUE), 
col.regions = cols, cex = 0.7)

hist(georgia_gwr$SDF$PctPov, main = '', xlab = '', ylim = c(0, 35), col = 'red3', density = 10, las = 1)
grid()

And we could carry on.

Case 6: EU, the impact of education & Hi-tec exports

We have a model about the impact of the percentage of GDP devoted to education in 2009 (p_education_expense_2009) on the growth of GDP per capita during the period of 2010-11 (growth_2010_2011). We want to test spatial correlation among the 27 EU members.

europe[c(1:3, 12:15), c(1,2,3,4)]
##    country_code        country p_education_expense_2009 growth_2010_2011
## 1            BE        Belgium                     42.0         1.046931
## 2            BG       Bulgaria                     27.9         1.058252
## 3            CZ Czech Republic                     17.5         1.041237
## 12           LT      Lithuania                     40.6         1.191176
## 13           HU        Hungary                     23.9         1.045752
## 14           NL    Netherlands                     40.5         1.083871
## 15           AT        Austria                     23.5         1.057823

The map

27 countries were members of the EU on July 2012.

Computing the W matrix

We load the .GAL file (contiguity-based neighbours).

library (spdep)

eu_countries <- seq(1, 27)
nbEurope <- read.gal("GAL/Europe.GAL", region.id = eu_countries)
wEurope <- nb2listw(nbEurope, style = "W")

The SARAR

  • \(\rho\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.
  • We test the impact of the percentage of GDP devoted to education on the growth of GDP per capita.
europe_sarar_ml <- sacsarlm(growth_2010_2011 ~ p_education_expense_2009, data = europe, listw = wEurope)

summary(europe_sarar_ml)
## 
## Call:sacsarlm(formula = growth_2010_2011 ~ p_education_expense_2009, 
##     data = europe, listw = wEurope)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0762434 -0.0142667 -0.0045352  0.0076759  0.0888500 
## 
## Type: sac 
## Coefficients: (asymptotic standard errors) 
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)              0.31179343 0.19882464  1.5682  0.11684
## p_education_expense_2009 0.00093129 0.00055315  1.6836  0.09226
## 
## Rho: 0.68248
## Asymptotic standard error: 0.19268
##     z-value: 3.5421, p-value: 0.00039691
## Lambda: -0.47927
## Asymptotic standard error: 0.28655
##     z-value: -1.6726, p-value: 0.094412
## 
## LR test value: 4.049, p-value: 0.13206
## 
## Log likelihood: 50.72507 for sac model
## ML residual variance (sigma squared): 0.00099499, (sigma: 0.031543)
## Number of observations: 27 
## Number of parameters estimated: 5 
## AIC: -91.45, (AIC for lm: -91.401)

The coefficients are not significant.

\(\rho\) is significant, but not \(\lambda\).

The SEM

  • The model corrects for spatial autocorrelation due to the use of spatial data. We do not know the structure of the spatial relationships. The error term accounts for unobservable features or omitted variables associated with a location in the real estate model, for instance.
  • \(\rho = 0\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.
  • We test the impact of the percentage of GDP devoted to education on the growth of GDP per capita.
europe_sem_ml <- errorsarlm(growth_2010_2011 ~ p_education_expense_2009, data = europe, listw = wEurope)

summary(europe_sem_ml)
## 
## Call:errorsarlm(formula = growth_2010_2011 ~ p_education_expense_2009, 
##     data = europe, listw = wEurope)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0907286 -0.0167880 -0.0040583  0.0115190  0.1049786 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)              1.02893540 0.02875936  35.777  < 2e-16
## p_education_expense_2009 0.00143114 0.00080309   1.782  0.07474
## 
## Lambda: 0.31614, LR test value: 2.4487, p-value: 0.11762
## Asymptotic standard error: 0.18784
##     z-value: 1.683, p-value: 0.092374
## Wald statistic: 2.8325, p-value: 0.092374
## 
## Log likelihood: 49.92492 for error model
## ML residual variance (sigma squared): 0.0013927, (sigma: 0.037319)
## Number of observations: 27 
## Number of parameters estimated: 4 
## AIC: -91.85, (AIC for lm: -91.401)

\(\beta_1\) and \(\lambda\) are not significant.

The SLM

  • It focuses on the spatial interactions. We ‘know’ the structure of the spatial relationships.
  • \(\rho\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.
  • We test the impact of the percentage of GDP devoted to education on the growth of GDP per capita.
europe_slm_ml <- lagsarlm(growth_2010_2011 ~ p_education_expense_2009, data = europe, listw = wEurope)

summary(europe_slm_ml)
## 
## Call:lagsarlm(formula = growth_2010_2011 ~ p_education_expense_2009, 
##     data = europe, listw = wEurope)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0882519 -0.0175305 -0.0027772  0.0102252  0.1061910 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)              0.67489998 0.19028461  3.5468  0.00039
## p_education_expense_2009 0.00133422 0.00071315  1.8709  0.06136
## 
## Rho: 0.33223, LR test value: 2.9646, p-value: 0.085103
## Asymptotic standard error: 0.18056
##     z-value: 1.84, p-value: 0.065761
## Wald statistic: 3.3858, p-value: 0.065761
## 
## Log likelihood: 50.18291 for lag model
## ML residual variance (sigma squared): 0.0013603, (sigma: 0.036883)
## Number of observations: 27 
## Number of parameters estimated: 4 
## AIC: -92.366, (AIC for lm: -91.401)
## LM test for residual autocorrelation
## test value: 0.8004, p-value: 0.37097

\(\beta_1\) and \(\rho\) are not significant. However, it is the best model with the best AIC (the lowest).

# c4, p.161
# ex 4.3 EU heteor. SARAR

The heteroscedastic SARAR – Parametric

The SARAR model above was based on the homoscedasticity of the residuals.

We are going to relax this assumption. Let us estimate the same SARAR model, but considering a space-varying error variance using the Generalized 2-Stage Least Squares (G2SLS).

We test the impact of the percentage of GDP devoted to education on the growth of GDP per capita.

library(sphet)

europe_sarar_gs2sls2 <- gstslshet(growth_2010_2011 ~ p_education_expense_2009, data = europe, listw = wEurope)

summary(europe_sarar_gs2sls2)
## 
##  Generalized stsls
## 
## Call:
## gstslshet(formula = growth_2010_2011 ~ p_education_expense_2009, 
##     data = europe, listw = wEurope)
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.090765 -0.016785  0.000698  0.000415  0.014464  0.108209 
## 
## Coefficients:
##                            Estimate Std. Error t-value Pr(>|t|)
## (Intercept)               0.5084726  0.7745943  0.6564   0.5115
## p_education_expense_2009  0.0011627  0.0009185  1.2658   0.2056
## lambda                    0.4922055  0.7467230  0.6592   0.5098
## rho                      -0.4830959  0.6418456 -0.7527   0.4517
## 
## Wald test that rho and lambda are both zero:
##  Statistics: 0.0001582 p-val: 0.98996

Compared to the other SARAR, the coefficients do not change much and remain with the same signs. There is an improvement in the significance level.

The spheh package inverses \(\rho\) and \(\lambda\)

The Wald test is borderline on rejecting the hypothesis that both parameters are equal to 0: \(\rho = 0\) (the spatial contagion effect) and \(\lambda = 0\) (the spatial correlation or error correlation parameter).

summary(europe_sarar_ml)
## 
## Call:sacsarlm(formula = growth_2010_2011 ~ p_education_expense_2009, 
##     data = europe, listw = wEurope)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0762434 -0.0142667 -0.0045352  0.0076759  0.0888500 
## 
## Type: sac 
## Coefficients: (asymptotic standard errors) 
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)              0.31179343 0.19882464  1.5682  0.11684
## p_education_expense_2009 0.00093129 0.00055315  1.6836  0.09226
## 
## Rho: 0.68248
## Asymptotic standard error: 0.19268
##     z-value: 3.5421, p-value: 0.00039691
## Lambda: -0.47927
## Asymptotic standard error: 0.28655
##     z-value: -1.6726, p-value: 0.094412
## 
## LR test value: 4.049, p-value: 0.13206
## 
## Log likelihood: 50.72507 for sac model
## ML residual variance (sigma squared): 0.00099499, (sigma: 0.031543)
## Number of observations: 27 
## Number of parameters estimated: 5 
## AIC: -91.45, (AIC for lm: -91.401)

Consequence

Between the two models, the parametric heteroscedastic regression has the most significant coefficients.

# Homoscedastic
data.frame(europe_sarar_ml$coefficients)
##                          europe_sarar_ml.coefficients
## (Intercept)                              0.3117934317
## p_education_expense_2009                 0.0009312885
# Heteroscedastic, parametric
data.frame(europe_sarar_gs2sls2$coefficients)
##                          europe_sarar_gs2sls2.coefficients
## (Intercept)                                    0.508472639
## p_education_expense_2009                       0.001162651
## lambda                                         0.492205516
## rho                                           -0.483095933
# c4, p.161
# ex 4.5 EU compare

The Probit models

We consider additional data related to the intensity of Hi-tec export. In particular, we classified each Member State as “high-intensity” if the percentage of Hi-tec exports is greater than 18 % of the GDP.

europe[c(1:3, 12:15), c(1,2,5,6)]
##    country_code        country p_hitec_exports hitec_intensity
## 1            BE        Belgium             8.8               0
## 2            BG       Bulgaria             4.6               0
## 3            CZ Czech Republic            15.2               0
## 12           LT      Lithuania             5.8               0
## 13           HU        Hungary            22.2               1
## 14           NL    Netherlands            18.4               1
## 15           AT        Austria            11.7               0

The Probit using GLM

europe_probit <- glm(hitec_intensity ~ p_hitec_exports, data = europe, family = binomial(link="probit"))

summary(europe_probit)
## 
## Call:
## glm(formula = hitec_intensity ~ p_hitec_exports, family = binomial(link = "probit"), 
##     data = europe)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0042  -0.5150  -0.3132  -0.2660   1.7146  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.18433    0.65988  -3.310 0.000932 ***
## p_hitec_exports  0.07855    0.03490   2.251 0.024400 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.875  on 26  degrees of freedom
## Residual deviance: 19.044  on 25  degrees of freedom
## AIC: 23.044
## 
## Number of Fisher Scoring iterations: 6

The percentage of Hi-tec export is merely significant.

We interpret the coefficients as odd ratios: OR = exp(coefficient).

exp(0.07855)
## [1] 1.081717
  • For a 1-unit increase in p_hitec_exports, the odds (OR > 1) the percentage of Hi-tec exports greater than 18 % of the GDP increase.
  • For a 1-unit increase, if the OR < 1, the odds the percentage of Hi-tec exports greater than 18 % of the GDP would decrease.
  • We can also combine all coefficients to derive a formula computing the predicted probability that the percentage of Hi-tec exports is greater than 18 % of the GDP.

Compute the W Matrix for Spatial Probits

We can use a row-standardized W matrix based on contiguity.

eu_countries <- seq(1, 27)
nbEurope <- read.gal("GAL/Europe.GAL", region.id = eu_countries)

library(McSpatial)

# instead of nb2listw...
wEurope <- nb2mat(nbEurope, style="W")

The Spatial Probit using ML

library(McSpatial)

europe_ml_sprobit <- spprobitml(hitec_intensity ~ p_hitec_exports, data = europe, wmat = wEurope, stdprobit = FALSE)
## Conditional on rho 
## rho =  0.9589694 
##                   Estimate Std. Error   z-value    Pr(>|z|)
## hitec_intensity -3.2150252 1.23219721 -2.609181 0.009075929
## p_hitec_exports  0.2228363 0.08991608  2.478270 0.013202128
## Unconditional Standard Errors 
##                   Estimate Std. Error    z-value     Pr(>|z|)
## hitec_intensity -3.2150252  8.8912740 -0.3615933 7.176560e-01
## p_hitec_exports  0.2228363  0.6918943  0.3220670 7.474019e-01
## rho              0.9589694  0.1818015  5.2748151 1.328900e-07
## Number of observations =  27
summary(europe_ml_sprobit)
##       Length Class  Mode   
## coef  3      -none- numeric
## logl  1      -none- numeric
## vmat1 4      -none- numeric
## vmat2 9      -none- numeric

We retain the second group of results with \(\rho\). All the coefficients are non-significant, except \(\rho\).

The Spatial Probit using GMM

library(McSpatial)

# init the parameter: rho = rho0
# rho0 means any value |rho0|<1
# Let's pick the value from the model above
# value above 0.7 do not work
# we lower it to 0.65
rho = 0.65

europe_gmm_sprobit <- gmmprobit(hitec_intensity ~ p_hitec_exports, data = europe, wmat = wEurope, startrho = rho)
## STANDARD PROBIT ESTIMATES 
## 
## Call:
## glm(formula = form, family = binomial(link = "probit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0042  -0.5150  -0.3132  -0.2660   1.7146  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.18433    0.65988  -3.310 0.000932 ***
## p_hitec_exports  0.07855    0.03490   2.251 0.024400 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.875  on 26  degrees of freedom
## Residual deviance: 19.044  on 25  degrees of freedom
## AIC: 23.044
## 
## Number of Fisher Scoring iterations: 6
## 
## [1] 1.00000 1.40538
## [1] 2.0000000 0.8884966
## [1] 3.0000000 0.3118476
## [1] 4.00000000 0.09423677
## [1] 5.0000000000 0.0009816892
## [1] 6.000000e+00 7.282569e-06
## SPATIAL GMM PROBIT ESTIMATES 
##                    Estimate Std. Error    z-value    Pr(>|z|)
## (Intercept)     -1.87396411 0.71164146 -2.6332981 0.008456009
## p_hitec_exports  0.08265415 0.04315649  1.9152194 0.055464532
## WXB              0.38078433 0.59551725  0.6394178 0.522551170
summary(europe_gmm_sprobit)
##      Length Class  Mode   
## coef 3      -none- numeric
## se   3      -none- numeric

We recompute the non-spatial probit model, then the spatial probit model. The \(\rho\) coefficient is represented by WXB. The percentage of Hi-tec export is barely significant, \(\rho\) is not at all.

The Spatial Probit LGMM

library(McSpatial)

europe_lgmm_sprobit <- spprobit(hitec_intensity ~ p_hitec_exports, data = europe, wmat = wEurope)
## 
## Call:
## glm(formula = form, family = binomial(link = "probit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0042  -0.5150  -0.3132  -0.2660   1.7146  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.18433    0.65988  -3.310 0.000932 ***
## p_hitec_exports  0.07855    0.03490   2.251 0.024400 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.875  on 26  degrees of freedom
## Residual deviance: 19.044  on 25  degrees of freedom
## AIC: 23.044
## 
## Number of Fisher Scoring iterations: 6
## 
## STANDARD PROBIT ESTIMATES 
## LINEARIZED GMM PROBIT ESTIMATES 
##                 Estimate Std. Error  z-value Pr(>|z|)
## (Intercept)     -1.66315    1.25053 -1.32996  0.18353
## p_hitec_exports  0.07811    0.04216  1.85250  0.06395
## WXB              0.43705    0.86427  0.50569  0.61308
## Number of observations =  27
summary(europe_lgmm_sprobit)
##      Length Class  Mode   
## coef  3     -none- numeric
## se    3     -none- numeric
## u    27     -none- numeric
## gmat 81     -none- numeric

Again, we recompute the non-spatial probit model, then the spatial probit model. The \(\rho\) coefficient is represented by WXB. The percentage of Hi-tec export and \(\rho\) are not significant.

The results

# GLM
data.frame(GLM = europe_probit$coefficients)
##                         GLM
## (Intercept)     -2.18432894
## p_hitec_exports  0.07854539
# ML
data.frame(ML = europe_ml_sprobit$coef)
##                         ML
## hitec_intensity -3.2150252
## p_hitec_exports  0.2228363
## rho              0.9589694
# GMM
data.frame(GMM = europe_gmm_sprobit$coef)
##           GMM
## 1 -1.87396411
## 2  0.08265415
## 3  0.38078433
# LGMM
data.frame(LGMM = europe_lgmm_sprobit$coef)
##                        LGMM
## (Intercept)     -1.66314941
## p_hitec_exports  0.07810561
## WXB              0.43705193

In all 3 Spatial Probits, only the GMM and LGMM converge with the GLM. Since \(\rho\) is not significant at all, we can rule out any spatial contagion and stick to the simplest of all models for the coefficients: the GLM. We already interpreted the OR for that model.

Case 7: The determinants of crime in Columbus, Ohio

Spatial models make the assumption that spatial autocorrelation is homoscedastic. The SARAR model assumes homoscedastic residuals.

What if autocorrelation changes from neighbour to neighbour, region to region, country to country.

We can also use a SARAR model with heteroscedastic disturbances. These variations are taken into account by a variance-covariance matrix. The SARAR model:

  • \(\rho\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

\[y = Z\beta + \rho Wy +u~~~~~~~~|\rho|<1\]

\[u = \lambda Wu + \varepsilon~~~~~~~~|\lambda|<1\]

Where \(E(\varepsilon_i^2)=\sigma_i^2\).

Going for one model or another must only aim the objective of improving the significance of the coefficients.

The dataset

We estimate a model about the impact of household income and home values on crime rates for 49 city neighborhoods crime in Columbus (Ohio).

The dataset is part of the spdep package (data(columbus)).

The columbus dataset.

head(columbus, 3)
##          AREA PERIMETER COLUMBUS. COLUMBUS.I POLYID NEIG  HOVAL    INC
## 1005 0.309441  2.440629         2          5      1    5 80.467 19.531
## 1001 0.259329  2.236939         3          1      2    1 44.567 21.232
## 1006 0.192468  2.187547         4          6      3    6 26.350 15.956
##         CRIME     OPEN    PLUMB DISCBD     X     Y   AREA NSA NSB EW CP
## 1005 15.72598 2.850747 0.217155   5.03 38.80 44.07 10.391   1   1  1  0
## 1001 18.80175 5.296720 0.320581   4.27 35.62 42.38  8.621   1   1  0  0
## 1006 30.62678 4.534649 0.374404   3.89 39.82 41.18  6.981   1   1  1  0
##      THOUS NEIGNO    PERIM
## 1005  1000   1005 2.440629
## 1001  1000   1001 2.236939
## 1006  1000   1006 2.187547

We can find measures of Crime, Income, and HOVAL, for household valuation, per district, along with the district dimensions, long/lat coordinates, and other geographic tags.

The map of Columbus

The 49 neighbourhoods come from the census tracts.

We can plot a network of the 49 neighbourhoods (useful for computing the contiguity among the neighbourhoods).

plot(col.gal.nb, coords, col = 'darkgray')

We can plot the 49 neighbourhoods for a clearer representation.

plot(columbus_map, col = 'lightgray', border = 'darkgray')
text(col_coords, label = sapply(slot(columbus_map, "polygons"), function(i) slot(i, "ID")), cex = 0.9, col = 'blue3')

Finally, we can plot the centroids of the 49 neighbourhoods on a XY grid (useful for computing the distance among the centroids).

plot(columbus$Y ~ columbus$X, ylab = 'Y', xlab = 'X', col = 'blue3')
grid()

The W matrix based on contiguity

The columbus dataset already contains an object for computing a row-standardization weights W matrix. col.gal.nb is derived from a .GAL file.

listw <- nb2listw(col.gal.nb, style = "W")
summary(listw)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 49 
## Number of nonzero links: 230 
## Percentage nonzero weights: 9.579342 
## Average number of links: 4.693878 
## Link number distribution:
## 
##  2  3  4  5  6  7  8  9 10 
##  7  7 13  4  9  6  1  1  1 
## 7 least connected regions:
## 1005 1008 1045 1047 1049 1048 1015 with 2 links
## 1 most connected region:
## 1017 with 10 links
## 
## Weights style: W 
## Weights constants summary:
##    n   nn S0       S1       S2
## W 49 2401 49 23.48489 204.6687

The Moran’s I test and the Moran scatterplot

moran.test(columbus$CRIME, listw)
## 
##  Moran I test under randomisation
## 
## data:  columbus$CRIME  
## weights: listw  
## 
## Moran I statistic standard deviate = 5.3427, p-value = 4.578e-08
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.485770914      -0.020833333       0.008991121
moran.plot(columbus$CRIME, listw, col = 'blue3')

We have signs of spatial positive autocorrelation.

The SARAR models

The following table shows the head of the metrics for the model: a measure of CRIME per district (the dependable variable) and 2 explanatory variables: HOVAL for household valuation per district and Income per district.

columbus_df <- as.data.frame(columbus[, c(9, 7, 8)])
columbus_df <- columbus_df[1:5,]
knitr::kable(columbus_df, col.names = c('Crime', 'HH Value', 'Income'), align = 'c')
Crime HH Value Income
1005 15.72598 80.467 19.531
1001 18.80175 44.567 21.232
1006 30.62678 26.350 15.956
1002 32.38776 33.200 4.477
1007 50.73151 23.225 11.252

The homoscedastic SARAR

  • \(\rho\) is the moving average parameter, the spatial effect, the smoothing factor or the spatial coefficient.
  • \(\lambda\) is the spatial autoregression parameter, the spatial dependence parameter or the spatial error coefficient.

We estimate the SARAR with a constant error variance first using the G2SLS.

library(spdep)

columbus_sarar_gs2sls <- gstsls(CRIME ~ INC + HOVAL, data = columbus, listw = listw)

summary(columbus_sarar_gs2sls)
## 
## Call:gstsls(formula = CRIME ~ INC + HOVAL, data = columbus, listw = listw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -37.80802  -5.63758  -0.23303   6.49483  24.31213 
## 
## Type: GM SARAR estimator
## Coefficients: (GM standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## Rho_Wy       0.455519   0.190156  2.3955  0.016598
## (Intercept) 44.116333  11.237096  3.9260 8.639e-05
## INC         -1.020821   0.393592 -2.5936  0.009498
## HOVAL       -0.265474   0.092974 -2.8554  0.004299
## 
## Lambda: -0.039195
## Residual variance (sigma squared): 107.06, (sigma: 10.347)
## GM argmin sigma squared: 97.038
## Number of observations: 49 
## Number of parameters estimated: 6

All coefficients are significantly different from zero including \(\rho\) showing that the number of crimes in one census tract is significantly related to that of the neighbouring census tract.

However, the various neighbourhoods are very different in terms of their size. The hypothesis of constant error variance is rather implausible.

The heteroscedastic SARAR – Parametric

Let us estimate the same SARAR model, but considering a space-varying error variance using the G2SLS.

library(sphet)

columbus_sarar_gs2sls2 <- gstslshet(CRIME ~ INC + HOVAL, data = columbus, listw = listw)

summary(columbus_sarar_gs2sls2)
## 
##  Generalized stsls
## 
## Call:
## gstslshet(formula = CRIME ~ INC + HOVAL, data = columbus, listw = listw)
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -37.468  -5.520  -0.398  -0.005   6.260  24.142 
## 
## Coefficients:
##              Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept) 44.124087   7.506698  5.8780 4.153e-09 ***
## INC         -0.987477   0.460184 -2.1458  0.031886 *  
## HOVAL       -0.275573   0.176910 -1.5577  0.119305    
## lambda       0.452910   0.143270  3.1612  0.001571 ** 
## rho          0.064822   0.305091  0.2125  0.831742    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Wald test that rho and lambda are both zero:
##  Statistics: 3.5907 p-val: 0.058105

The results reveal that there is still a significant relationship of variable CRIME with the variable INC, but no longer with HOVAL.

The spheh package inverses \(\rho\) and \(\lambda\)

The contagion effect (\(\rho \neq\) 0) is confirmed with this new setting while no significant spatial correlation is found in the residuals and the spatial error parameter,\(\rho\), = 0.

The Wald test almost rejects the hypothesis that both parameters are equal to 0.

The \(\sigma^2\), the single variance parameter, is no longer estimated.

The heteroscedastic SARAR – Non-parametric

Let us estimate the model using the non-parametric estimation of the variance-covariance matrix: the spatial autocorrelation consistent (HAC) version of the SARAR.

We need 2 extra objects: the matrix of pairwise distances between all the spatial units and the typology of the kernel function (we opt for the Triangular function although there are other options).

Pairwise distances come from the sphet package(data(coldis)).

There are other non-parametric distributions:

  • Uniform;
  • Epanechnikov (quadratic; bell-shape without crossing the y-axis on both side with zero tails);
  • Triangular or Bartlett;
  • Bisquare (quadratic; bisquare weight or Tukey’s biweight);
  • Parzen (bell-shape with long tails);
  • Triweight (quadratic);
  • Gaussian or normal;
  • TH or Tikey-Hanning;
  • Cosine;
  • Quadratic spectral.
columbus_sarar_hac <- stslshac(CRIME ~ INC + HOVAL, data = columbus, listw = listw, distance = coldis, type = "Triangular")

summary(columbus_sarar_hac)
## 
##  Stsls with Spatial HAC standard errors
## 
## Call:
## stslshac(formula = CRIME ~ INC + HOVAL, data = columbus, listw = listw, 
##     distance = coldis, type = "Triangular")
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -37.83067  -5.43232  -0.44058   6.26723  24.21600 
## 
## Coefficients:
##             Estimate SHAC St.Er. t-value  Pr(>|t|)    
## Wy           0.45464     0.16508  2.7540  0.005886 ** 
## (Intercept) 44.11639     8.07571  5.4628 4.686e-08 ***
## INC         -1.00772     0.51189 -1.9686  0.048994 *  
## HOVAL       -0.26950     0.17867 -1.5084  0.131458    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results are substantially similar to those obtained with the parametric estimator although more significant (lower standard errors).

This new framework does not estimate \(\lambda\), only \(\rho\) (or Wy since the coefficient is \(\rho\) as \(\rho Wy\)). Now, the entire error correlation matrix (and not the single parameter \(\lambda\)) is non-parametrically estimated.

Consequence

Among the three models, the non-parametric heteroscedastic regression has the most significant coefficients.

# Homoscedastic
data.frame(columbus_sarar_gs2sls$coefficients)
##             columbus_sarar_gs2sls.coefficients
## Rho_Wy                               0.4555186
## (Intercept)                         44.1163333
## INC                                 -1.0208207
## HOVAL                               -0.2654743
# Heteroscedastic, parametric
data.frame(columbus_sarar_gs2sls2$coefficients)
##             columbus_sarar_gs2sls2.coefficients
## (Intercept)                         44.12408700
## INC                                 -0.98747702
## HOVAL                               -0.27557250
## lambda                               0.45291032
## rho                                  0.06482198
# Heteroscedastic, non-parametric
data.frame(columbus_sarar_hac$coefficients)
##             columbus_sarar_hac.coefficients
## Wy                                0.4546376
## (Intercept)                      44.1163859
## INC                              -1.0077219
## HOVAL                            -0.2695028

We can rule error correlation, keep the contagion coefficient \(\rho\), and conclude that income per neighbourhood has a slight influence on crime (the higher the income, the lower the crime rate). The contagion effect is important. If the crime rate increases in one neighbourhood, the propagation rate of spreading in the surrounding neighbourhoods in almost 50%. For an increase of 2 crimes in A, B may undergo an increase of 1 crime.

Case 8: Luxury houses in Baltimore, Maryland

In many instances, the dependent variable may assume only a discrete number of possible outcomes. We might be interested in the presence or absence of a certain technology in one region as a function of a number of variables. This an example of a spatial model for binary response variables. Similar situations emerge in explaining consumer choice, electoral behaviour, patient choices, criminal behaviour or modeling count data.

The dataset

In order to illustrate the spatial binary models for discrete choices, let us examine a set of data about house prices and a series of variables related to their value observed in 211 locations in Baltimore.

head(baltimore, 3)
##   STATION PRICE NROOM DWELL NBATH PATIO FIREPL AC BMENT NSTOR GAR AGE
## 1       1    47     4     0   1.0     0      0  0     2     3   0 148
## 2       2   113     7     1   2.5     1      1  1     2     2   2   9
## 3       3   165     7     1   2.5     1      1  0     3     2   2  23
##   CITCOU  LOTSZ  SQFT   X   Y
## 1      0   5.70 11.25 907 534
## 2      1 279.51 28.92 922 574
## 3      1  70.64 30.62 920 581
  • STATION, the location.
  • PRICE, price of the house.
  • NROOM, number of rooms.
  • NBATH, number of bathrooms.
  • AGE, age of construction.
  • SQFT, square feet.
  • and other qualitative variables related to the presence or absence of characteristics (patio, fireplace, air conditioning, basement, garage, etc.).

We transform the variable PRICE into a binary variable, PRICEB, classifying a house as “expensive” (1 if the price is greater than $40K, 0 otherwise).

price_exp <- ifelse(baltimore$PRICE > 40, 1, 0)

baltimore$PRICEB <- price_exp

The map

The coordinates of the 49 locations. The XY coordinates are useful for computing the distance between the neighbours.

The Logit model

We start with a Logit model.

baltimore_logit <- glm(PRICEB ~ NROOM + NBATH + AGE + SQFT, data = baltimore, family = binomial(link = "logit"))

The Probit model

We estimate the Probit, then we compare all the results.

baltimore_probit <- glm(PRICEB ~ NROOM + NBATH + AGE + SQFT, data = baltimore, family = binomial(link = "probit"))
logprob <- data.frame(Logit = baltimore_logit$coefficients,
                      Probit = baltimore_probit$coefficients)

knitr::kable(logprob, align = 'c')
Logit Probit
(Intercept) -3.7878197 -2.1621945
NROOM 0.5262431 0.2528669
NBATH 0.5673503 0.3656256
AGE -0.0518257 -0.0211286
SQFT 0.1072458 0.0579614

The models lead to different estimates of the coefficients, but the signs are the same.

Let us study the details.

summary(baltimore_logit)
## 
## Call:
## glm(formula = PRICEB ~ NROOM + NBATH + AGE + SQFT, family = binomial(link = "logit"), 
##     data = baltimore)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0447  -0.8630  -0.1835   0.8259   3.8935  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.78782    1.01886  -3.718 0.000201 ***
## NROOM        0.52624    0.23213   2.267 0.023392 *  
## NBATH        0.56735    0.35130   1.615 0.106308    
## AGE         -0.05183    0.01214  -4.268 1.97e-05 ***
## SQFT         0.10725    0.03553   3.019 0.002539 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.47  on 210  degrees of freedom
## Residual deviance: 216.63  on 206  degrees of freedom
## AIC: 226.63
## 
## Number of Fisher Scoring iterations: 5
summary(baltimore_probit)
## 
## Call:
## glm(formula = PRICEB ~ NROOM + NBATH + AGE + SQFT, family = binomial(link = "probit"), 
##     data = baltimore)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2042  -0.8954  -0.2502   0.9187   3.8712  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.162195   0.566727  -3.815 0.000136 ***
## NROOM        0.252867   0.128005   1.975 0.048217 *  
## NBATH        0.365626   0.203727   1.795 0.072704 .  
## AGE         -0.021129   0.006286  -3.361 0.000775 ***
## SQFT         0.057961   0.019820   2.924 0.003452 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.47  on 210  degrees of freedom
## Residual deviance: 222.45  on 206  degrees of freedom
## AIC: 232.45
## 
## Number of Fisher Scoring iterations: 8

AGE and SQFT (square feet) come out as the most significant factors in explaining the classification of an expensive house. NROOM (number of rooms) and NBATH (number of bathrooms) appear to be less relevant.

AIC is lower for logit; it is the best model of the two.

The Spatial Logit/Probit

We can adapt the Logit/Probit to account for spatial dependence.

In the field, the spatial Probit specification is the most popular; we are going to focus on Probit from now on. There are three estimation strategies:

  • Maximum likelihood (ML).
  • Generalized Method of Moments (GMM).
  • A linearized version of the GMM (LGMM).

If we have a sound model, the three models should converge to the same conclusion.

Compute the W matrix based on distance

We can use a row-standardized W matrix based on minimum threshold distance and defining as neighbours point falling within a distance 22km.

The spatial weight matrix is based on distance (with lower d1 and upper d2 bounds for distance).

longlat = TRUE if point coordinates are longitude-latitude decimal degrees, in which case distances are measured in kilometers. We rather have UTM coordinates in km.

The Spatial Probit Model using ML

library(McSpatial)

baltimore_ml_sprobit <- spprobitml(PRICEB ~ NROOM + NBATH + AGE + SQFT, data = baltimore, wmat = wBaltimore, stdprobit = FALSE)
## Conditional on rho 
## rho =  0.8164026 
##           Estimate  Std. Error   z-value     Pr(>|z|)
## PRICEB -1.02618685 0.466577317 -2.199393 2.785000e-02
## NROOM   0.21958292 0.143851943  1.526451 1.268976e-01
## NBATH  -0.06945085 0.203631142 -0.341062 7.330569e-01
## AGE    -0.02364809 0.004474888 -5.284622 1.259645e-07
## SQFT    0.04463416 0.020582788  2.168519 3.011924e-02
## Unconditional Standard Errors 
##           Estimate Std. Error    z-value     Pr(>|z|)
## PRICEB -1.02618685 0.61103381 -1.6794273 9.306880e-02
## NROOM   0.21958292 0.17852239  1.2300022 2.186963e-01
## NBATH  -0.06945085 0.24476974 -0.2837395 7.766100e-01
## AGE    -0.02364809 0.00239488 -9.8744375 0.000000e+00
## SQFT    0.04463416 0.02587294  1.7251293 8.450417e-02
## rho     0.81640265 0.11128967  7.3358352 2.202682e-13
## Number of observations =  211
summary(baltimore_ml_sprobit)
##       Length Class  Mode   
## coef   6     -none- numeric
## logl   1     -none- numeric
## vmat1 25     -none- numeric
## vmat2 36     -none- numeric

We retain the second group of results with \(\rho\). All the coefficients are highly significant.

The Spatial Probit Model using GMM

library(McSpatial)

# init the parameter: rho = rho0
# rho0 means any value |rho0|<1
# Let's pick the value from the model above
rho <- 0.8164026

baltimore_gmm_sprobit <- gmmprobit(PRICEB ~ NROOM + NBATH + AGE + SQFT, data = baltimore, wmat = wBaltimore, startrho = rho)
## STANDARD PROBIT ESTIMATES 
## 
## Call:
## glm(formula = form, family = binomial(link = "probit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2042  -0.8954  -0.2502   0.9187   3.8712  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.162195   0.566727  -3.815 0.000136 ***
## NROOM        0.252867   0.128005   1.975 0.048217 *  
## NBATH        0.365626   0.203727   1.795 0.072704 .  
## AGE         -0.021129   0.006286  -3.361 0.000775 ***
## SQFT         0.057961   0.019820   2.924 0.003452 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.47  on 210  degrees of freedom
## Residual deviance: 222.45  on 206  degrees of freedom
## AIC: 232.45
## 
## Number of Fisher Scoring iterations: 8
## 
## [1] 1.000000 1.167471
## [1] 2.000000 0.132197
## [1] 3.00000000 0.06438008
## [1] 4.00000000 0.02795205
## [1] 5.0000000 0.0133001
## [1] 6.00000000 0.01105761
## [1] 7.000000000 0.004463825
## [1] 8.00000000 0.00382946
## [1] 9.000000000 0.001603939
## [1] 10.000000000  0.001328716
## [1] 1.10000e+01 5.76622e-04
## [1] 1.200000e+01 4.637865e-04
## [1] 1.300000e+01 2.067739e-04
## [1] 1.400000e+01 1.623632e-04
## [1] 1.500000e+01 7.402782e-05
## SPATIAL GMM PROBIT ESTIMATES 
##                Estimate  Std. Error     z-value     Pr(>|z|)
## (Intercept) -0.99759481 0.495711838 -2.01244904 4.417262e-02
## NROOM        0.18258203 0.133457133  1.36809494 1.712824e-01
## NBATH        0.01590723 0.199251830  0.07983482 9.363686e-01
## AGE         -0.02040279 0.008355108 -2.44195407 1.460801e-02
## SQFT         0.04038379 0.018317902  2.20460764 2.748164e-02
## WXB          0.78398770 0.162680894  4.81917500 1.441531e-06
summary(baltimore_gmm_sprobit)
##      Length Class  Mode   
## coef 6      -none- numeric
## se   6      -none- numeric

We recompute the non-spatial probit model, then the spatial probit model. The \(\rho\) coefficient is represented by WXB. All coefficients are highly significant.

The Spatial Probit Model using LGMM

library(McSpatial)

baltimore_lgmm_sprobit <- spprobit(PRICEB ~ NROOM + NBATH + AGE + SQFT, data = baltimore, wmat = wBaltimore)
## 
## Call:
## glm(formula = form, family = binomial(link = "probit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2042  -0.8954  -0.2502   0.9187   3.8712  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.162195   0.566727  -3.815 0.000136 ***
## NROOM        0.252867   0.128005   1.975 0.048217 *  
## NBATH        0.365626   0.203727   1.795 0.072704 .  
## AGE         -0.021129   0.006286  -3.361 0.000775 ***
## SQFT         0.057961   0.019820   2.924 0.003452 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.47  on 210  degrees of freedom
## Residual deviance: 222.45  on 206  degrees of freedom
## AIC: 232.45
## 
## Number of Fisher Scoring iterations: 8
## 
## STANDARD PROBIT ESTIMATES 
## LINEARIZED GMM PROBIT ESTIMATES 
##             Estimate Std. Error  z-value Pr(>|z|)
## (Intercept) -1.73088    0.54944 -3.15025  0.00163
## NROOM        0.22432    0.11409  1.96621  0.04927
## NBATH        0.21695    0.20039  1.08261  0.27898
## AGE         -0.01437    0.00670 -2.14367  0.03206
## SQFT         0.04102    0.02093  1.96010  0.04998
## WXB          1.04295    0.24204  4.30899  0.00002
## Number of observations =  211
summary(baltimore_lgmm_sprobit)
##      Length Class  Mode   
## coef    6   -none- numeric
## se      6   -none- numeric
## u     211   -none- numeric
## gmat 1266   -none- numeric

Again, we recompute the non-spatial probit model, then the spatial probit model. The \(\rho\) coefficient is represented by WXB. All coefficients are highly significant.

The results

# GLM
data.frame(GLM = baltimore_probit$coefficients)
##                     GLM
## (Intercept) -2.16219453
## NROOM        0.25286688
## NBATH        0.36562556
## AGE         -0.02112855
## SQFT         0.05796142
# ML
data.frame(ML = baltimore_ml_sprobit$coef)
##                 ML
## PRICEB -1.02618685
## NROOM   0.21958292
## NBATH  -0.06945085
## AGE    -0.02364809
## SQFT    0.04463416
## rho     0.81640265
# GMM
data.frame(GMM = baltimore_gmm_sprobit$coef)
##           GMM
## 1 -0.99759481
## 2  0.18258203
## 3  0.01590723
## 4 -0.02040279
## 5  0.04038379
## 6  0.78398770
# LGMM
data.frame(LGMM = baltimore_lgmm_sprobit$coef)
##                    LGMM
## (Intercept) -1.73088308
## NROOM        0.22432392
## NBATH        0.21694910
## AGE         -0.01436618
## SQFT         0.04102019
## WXB          1.04295266

In all 3 spatial Probits, \(\rho\), the contagion effect, is highly significant.

In all 4 Probits, AGE comes out as the most important explanatory factor. NROOM (number of rooms) and SQFT (square feet) are less relevant compared to the GLM Probit because spatial variations are captured by \(\rho\), the spatial effect.

We interpret the coefficients as odd ratios: OR = exp(coefficient).

# AGE
exp(-0.02112855)
## [1] 0.9790931
# NBATH
exp(0.36562556)
## [1] 1.441415
  • For a 1-unit increase in AGE, the odds (OR < 1) the price is greater than $40K decrease.
  • For a 1-unit increase in NBATH, the odds (OR > 1) the price is greater than $40K increase.
  • We can also combine all coefficients to derive a formula computing the predicted probability the price is greater than $40K.

Case 9: Health in Mexico

We want to compare the Matrix Exponential Spatial Specification (MESS) with the specification methods we have covered so far; such as the SLM estimated using ML. On large datasets, the MESS can reduce the burden of computation without sacrificing accuracy.

The dataset

We estimate the relationship which explains the number of doctors per 1000 inhabitants in a given region as a function of the number of health staff per 1000 inhabitants, the percentage of population over 65, and the per capita health expenditure in 2000 and in 2010.

head(mexico, 3)
##   code           states number_of_doctors_per_1000_inhabitants
## 1    0 Distrito Federal                                    3.1
## 2    1         Guerrero                                    1.4
## 3    2           Mexico                                    1.0
##   per_capita_health_expenditure_2010 per_capita_health_expenditure_2000
## 1                           9.437522                           0.752100
## 2                           2.572524                           0.427321
## 3                           2.780234                           0.435988
##   health_staff_per_1000_inhabitants p_of_population_over_65
## 1                              15.3                7.771425
## 2                               5.5                6.917765
## 3                               4.7                4.911075

The map

This is a view of the 32 Mexican states.

# id centroids
coords <- coordinates(Mexico)

# names
state_no <- data.frame(No = seq(1,32), State = sapply(slot(Mexico, "polygons"), function(i) slot(i, "ID")))

state_1_16 <- state_no[1:16,]
state_17_32 <- state_no[17:32,]

# plot the map with the ID
plot(Mexico, col = 'lightgray', border = 'darkgray')
text(coords, 
     label = ,
     cex = 0.8,
     col = 'blue3')

knitr::kable(cbind(state_1_16, state_17_32))
No State No State
1 Aguascalientes 17 Morelos
2 Baja California 18 Nayarit
3 Baja California Sur 19 Nuevo León
4 Campeche 20 Oaxaca
5 Chiapas 21 Puebla
6 Chihuahua 22 Querétaro
7 Coahuila 23 Quintana Roo
8 Colima 24 San Luis Potosí
9 Distrito Federal 25 Sinaloa
10 Durango 26 Sonora
11 Guanajuato 27 Tabasco
12 Guerrero 28 Tamaulipas
13 Hidalgo 29 Tlaxcala
14 Jalisco 30 Veracruz
15 México 31 Yucatán
16 Michoacán 32 Zacatecas

The W matrix

There are different methods for deriving the W matrix. One method might be better than another. The point here is not to find the perfect method, but to demonstrate that the MESS can reduce the burden of computation without sacrificing accuracy when estimating coefficients.

From the spatial object Mexico, we can compute a contiguity-based neighbours’ list, a minimum threshold distance neighbours’ list or a k-nearest neighbours’ list. We choose the later. It is a quick method.

We set the parameter to k=3 or three neighbours and we consider a row-standardized (style = "W") weight matrix.

coords <- coordinates(Mexico)
IDs <- row.names(as(Mexico, "data.frame"))
Mexico_kn1 <- knn2nb(knearneigh(coords, k=3), row.names=IDs)

wmexico <- nb2listw(Mexico_kn1, style = "W")

The Spatial Lag Model (SLM)

We estimate a SLM using both the ML and the MESS approximation (below).

We wrap a function around the model.

SLMfunction <- function(){
mexico_lag_ml <- lagsarlm(number_of_doctors_per_1000_inhabitants ~ health_staff_per_1000_inhabitants + per_capita_health_expenditure_2000 + per_capita_health_expenditure_2010 + p_of_population_over_65, data = mexico, listw = wmexico)

return(summary(mexico_lag_ml))
}

We run the function 10 times and compute the mean execution time.

SLMtime.taken <- rep(0, 10)

for (i in 1:10) {
start.time <- Sys.time()
SLMfunction()
end.time <- Sys.time()
SLMtime.taken[i] <- end.time - start.time
}

mean(SLMtime.taken)
## [1] 0.2026995

The Matrix Exponential Spatial Specification (MESS)

We wrap another function around the second model.

MESSfunction <- function(){
mexico_mess <- lagmess(number_of_doctors_per_1000_inhabitants ~ health_staff_per_1000_inhabitants + per_capita_health_expenditure_2000 + per_capita_health_expenditure_2010 + p_of_population_over_65, data = mexico, listw = wmexico)

return(summary(mexico_mess))
}

We run the function and compute the execution time.

MESStime.taken <- rep(0, 10)

for (i in 1:10) {
start.time <- Sys.time()
MESSfunction()
end.time <- Sys.time()
MESStime.taken[i] <- end.time - start.time
}

mean(MESStime.taken)
## [1] 0.106505

Coefficients

SLM

summary(mexico_lag_ml)
## 
## Call:
## lagsarlm(formula = number_of_doctors_per_1000_inhabitants ~ health_staff_per_1000_inhabitants + 
##     per_capita_health_expenditure_2000 + per_capita_health_expenditure_2010 + 
##     p_of_population_over_65, data = mexico, listw = wmexico)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.1767777 -0.0896541  0.0023072  0.0667111  0.3543029 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                                     Estimate Std. Error z value  Pr(>|z|)
## (Intercept)                         0.023714   0.213666  0.1110 0.9116256
## health_staff_per_1000_inhabitants   0.296077   0.028484 10.3947 < 2.2e-16
## per_capita_health_expenditure_2000 -0.481055   0.233260 -2.0623 0.0391776
## per_capita_health_expenditure_2010 -0.158543   0.043447 -3.6491 0.0002631
## p_of_population_over_65             0.067847   0.021292  3.1865 0.0014403
## 
## Rho: -0.017449, LR test value: 0.037148, p-value: 0.84716
## Asymptotic standard error: 0.08785
##     z-value: -0.19862, p-value: 0.84256
## Wald statistic: 0.03945, p-value: 0.84256
## 
## Log likelihood: 23.62746 for lag model
## ML residual variance (sigma squared): 0.013371, (sigma: 0.11563)
## Number of observations: 32 
## Number of parameters estimated: 7 
## AIC: -33.255, (AIC for lm: -35.218)
## LM test for residual autocorrelation
## test value: 0.43245, p-value: 0.51079

MESS

summary(mexico_mess)
## Matrix exponential spatial lag model:
## 
## Call:
## lagmess(formula = number_of_doctors_per_1000_inhabitants ~ health_staff_per_1000_inhabitants + 
##     per_capita_health_expenditure_2000 + per_capita_health_expenditure_2010 + 
##     p_of_population_over_65, data = mexico, listw = wmexico)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.176785 -0.089686  0.002314  0.066710  0.354291 
## 
## Coefficients:
##                                     Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                         0.023962   0.148017  0.1619  0.872599
## health_staff_per_1000_inhabitants   0.296086   0.030331  9.7620 2.375e-10
## per_capita_health_expenditure_2000 -0.481065   0.252626 -1.9043  0.067583
## per_capita_health_expenditure_2010 -0.158548   0.047231 -3.3568  0.002356
## p_of_population_over_65             0.067848   0.023180  2.9270  0.006866
## 
## Residual standard error: 0.12589 on 27 degrees of freedom
## Multiple R-squared:  0.9317, Adjusted R-squared:  0.92159 
## F-statistic: 92.084 on 4 and 27 DF,  p-value: 2.4958e-15
## 
## Alpha: 0.01747, standard error: 0.090339
##     z-value: 0.19338, p-value: 0.84666
## LR test value: 0.037256, p-value: 0.84694
## Implied rho: -0.01762326

Comparing the results, we notice that the MESS specification substantially confirms the conclusions of the SLM estimation (using ML) in terms of both the sign and the significance of the parameters.

health_staff_per_1000_inhabitants and p_of_population_over_65 are the most significant factors explaining the doctor geographical distribution in the 32 states. We can rule out per_capita_health_expenditure_2010 and per_capita_health_expenditure_2000 because of the wrong signs or the lack of strong statistical significance.

In both cases, the \(\rho\) parameter is negative and not significantly different from 0. The standard errors are all also remarkably similar (not significant).

There is no spatial autocorration influencing number_of_doctors_per_1000_inhabitants. The later is mostly explanable by health_staff_per_1000_inhabitants and p_of_population_over_65, not by health expenditures.

Time difference

knitr::kable(time.taken)
SLMtime.taken MESStime.taken
1 0.2688236 0.0880284
2 0.2000403 0.1149104
3 0.1955986 0.1142478
4 0.1889908 0.1037061
5 0.1926386 0.1056063
6 0.1932580 0.1135113
7 0.1962404 0.1017318
8 0.1958611 0.1061468
9 0.1965327 0.1054802
10 0.1990106 0.1116807
mean 0.2026995 0.1065050

With a very small sample size, the mean difference in computing times is obviously negligible. However, the example shows that in big datasets, the MESS specification can substantially reduce the computation burden without loss of accuracy of the estimates. We can extrapolate the time gain with Big Data.



  1. Other formats: ESRI file geodatabases, shapefiles, geopackage (SpatialLite) and Google Earth kmz. See the website.

  2. Robust regression methods are designed to be not overly affected by violations of assumptions. For example, a few outliers may affect the normality of the residuals. Robust methods minimize the effect of a few outliers such as a median not influenced like the mean by a few large values in the observations