Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets and results.
  • Source: ‘DrivenData Water Pumps Challenge’ from DrivenData.


1, Intro to DrivenData Water Pumps Challenge

Data Mining the Water Table

Using data from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don’t work at all?

This is an intermediate-level practice competition by DrivenData1.

Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which water points will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

Let’s start with loading in the training and testing set into our R environment. We will use the training set to build our model, and the test set to validate it. Here is a quick explanation of the 3 data frames that we will import:

  • train_values corresponds to the independent variables for the training set.
  • train_labels contains the dependent variable (status_group) for each of the rows in train_values.
  • test_values is the independent variables that need predictions.

Understanding our data

# Check it out
str(train_values)
## 'data.frame':    59400 obs. of  40 variables:
##  $ id                   : int  69572 8776 34310 67743 19728 9944 19816 54551 53934 46144 ...
##  $ amount_tsh           : num  6000 0 25 0 0 20 0 0 0 0 ...
##  $ date_recorded        : Factor w/ 356 levels "2002-10-14","2004-01-07",..: 48 310 301 273 105 47 187 195 220 126 ...
##  $ funder               : Factor w/ 1898 levels "","0","Aar","Abasia",..: 1371 471 827 1741 21 987 347 1418 1826 594 ...
##  $ gps_height           : int  1390 1399 686 263 0 0 0 0 0 0 ...
##  $ installer            : Factor w/ 2146 levels "","-","0","AAR",..: 1557 593 2116 1924 119 433 446 433 2039 119 ...
##  $ longitude            : num  34.9 34.7 37.5 38.5 31.1 ...
##  $ latitude             : num  -9.86 -2.15 -3.82 -11.16 -1.83 ...
##  $ wpt_name             : Factor w/ 37400 levels "24","Abass","Abbas",..: 32435 37197 15288 37293 34797 36139 21942 36485 23155 22872 ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : Factor w/ 9 levels "Internal","Lake Nyasa",..: 2 5 6 8 5 6 1 4 4 5 ...
##  $ subvillage           : Factor w/ 19288 levels "","##","1","14Kambalage",..: 11808 15840 9069 8977 7700 11812 3798 16345 3436 11439 ...
##  $ region               : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 4 10 9 13 5 21 18 18 20 5 ...
##  $ region_code          : int  11 20 21 90 18 4 17 17 14 18 ...
##  $ district_code        : int  5 2 4 63 1 8 3 3 6 1 ...
##  $ lga                  : Factor w/ 125 levels "Arusha Rural",..: 52 104 109 88 27 69 105 26 116 27 ...
##  $ ward                 : Factor w/ 2092 levels "Aghondi","Akheri",..: 1427 1577 1625 1572 1688 1323 1853 151 435 498 ...
##  $ population           : int  109 280 250 58 0 1 0 0 0 0 ...
##  $ public_meeting       : Factor w/ 3 levels "","False","True": 3 1 3 3 3 3 3 3 3 3 ...
##  $ recorded_by          : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scheme_management    : Factor w/ 13 levels "","Company","None",..: 9 4 9 9 1 9 9 1 9 1 ...
##  $ scheme_name          : Factor w/ 2697 levels "","14 Kambarage",..: 2290 1 2168 1 1 2692 1 1 1 1 ...
##  $ permit               : Factor w/ 3 levels "","False","True": 2 3 3 3 3 3 3 3 3 3 ...
##  $ construction_year    : int  1999 2010 2009 1986 0 2009 0 0 0 0 ...
##  $ extraction_type      : Factor w/ 18 levels "afridev","cemo",..: 4 4 4 15 4 15 16 9 5 9 ...
##  $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 2 2 2 11 2 11 12 6 3 6 ...
##  $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 1 1 1 6 1 6 2 2 2 2 ...
##  $ management           : Factor w/ 12 levels "company","other",..: 8 12 8 8 2 8 8 12 8 8 ...
##  $ management_group     : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 2 5 5 5 5 5 ...
##  $ payment              : Factor w/ 7 levels "never pay","other",..: 3 1 5 1 1 5 1 7 1 1 ...
##  $ payment_type         : Factor w/ 7 levels "annually","monthly",..: 1 3 6 3 3 6 3 7 3 3 ...
##  $ water_quality        : Factor w/ 8 levels "coloured","fluoride",..: 7 7 7 7 7 5 7 4 5 7 ...
##  $ quality_group        : Factor w/ 6 levels "colored","fluoride",..: 3 3 3 3 3 5 3 4 5 3 ...
##  $ quantity             : Factor w/ 5 levels "dry","enough",..: 2 3 2 1 4 2 2 2 4 2 ...
##  $ quantity_group       : Factor w/ 5 levels "dry","enough",..: 2 3 2 1 4 2 2 2 4 2 ...
##  $ source               : Factor w/ 10 levels "dam","hand dtw",..: 9 6 1 4 6 5 4 8 4 8 ...
##  $ source_type          : Factor w/ 7 levels "borehole","dam",..: 7 4 2 1 4 3 1 6 1 6 ...
##  $ source_class         : Factor w/ 3 levels "groundwater",..: 1 2 2 1 2 3 1 1 1 1 ...
##  $ waterpoint_type      : Factor w/ 7 levels "cattle trough",..: 2 2 3 3 2 3 5 5 5 5 ...
##  $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 2 2 2 2 2 2 4 4 4 4 ...
str(train_labels)
## 'data.frame':    59400 obs. of  2 variables:
##  $ id          : int  69572 8776 34310 67743 19728 9944 19816 54551 53934 46144 ...
##  $ status_group: Factor w/ 3 levels "functional","functional needs repair",..: 1 1 1 3 1 1 3 3 3 1 ...
str(test_values)
## 'data.frame':    14850 obs. of  40 variables:
##  $ id                   : int  50785 51630 17168 45559 49871 52449 24806 28965 36301 54122 ...
##  $ amount_tsh           : num  0 0 0 0 500 0 0 0 30 0 ...
##  $ date_recorded        : Factor w/ 331 levels "2001-03-26","2004-01-04",..: 256 256 253 243 307 284 35 246 244 298 ...
##  $ funder               : Factor w/ 981 levels "","0","Aar","Abasia",..: 177 253 1 223 75 253 253 224 893 417 ...
##  $ gps_height           : int  1996 1569 1567 267 1260 1685 550 234 584 1083 ...
##  $ installer            : Factor w/ 1092 levels "","0","AAR","ABASIA",..: 232 247 1 274 90 247 305 275 491 487 ...
##  $ longitude            : num  35.3 36.7 34.8 38.1 35 ...
##  $ latitude             : num  -4.06 -3.31 -5 -9.42 -10.95 ...
##  $ wpt_name             : Factor w/ 10840 levels "21","24","Abada",..: 656 1778 9675 5943 6054 7901 9302 4990 5803 3099 ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : Factor w/ 9 levels "Internal","Lake Nyasa",..: 1 6 1 8 8 6 7 8 8 6 ...
##  $ subvillage           : Factor w/ 8444 levels "","1","10C","14Kambarage",..: 3915 2719 5399 2797 3546 4318 2429 1373 203 816 ...
##  $ region               : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 9 1 19 8 17 1 4 13 13 7 ...
##  $ region_code          : int  21 2 13 80 10 2 11 9 90 3 ...
##  $ district_code        : int  3 2 2 43 3 2 7 4 33 7 ...
##  $ lga                  : Factor w/ 125 levels "Arusha Rural",..: 64 1 110 50 62 1 35 117 89 107 ...
##  $ ward                 : Factor w/ 1959 levels "Aghondi","Akheri",..: 18 651 1678 1193 1076 347 932 1049 1218 1778 ...
##  $ population           : int  321 300 500 250 60 200 600 1 40 1 ...
##  $ public_meeting       : Factor w/ 3 levels "","False","True": 3 3 3 1 1 3 3 3 3 3 ...
##  $ recorded_by          : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scheme_management    : Factor w/ 12 levels "","Company","Other",..: 4 8 8 8 10 8 8 10 8 10 ...
##  $ scheme_name          : Factor w/ 1790 levels "","14 Kambarage",..: 1 1670 1480 1 99 1647 1 95 1271 36 ...
##  $ permit               : Factor w/ 3 levels "","False","True": 3 3 1 3 3 3 3 3 2 3 ...
##  $ construction_year    : int  2012 2000 2010 1987 2000 1990 2007 1982 1997 2003 ...
##  $ extraction_type      : Factor w/ 17 levels "afridev","cemo",..: 10 4 10 10 4 4 5 14 4 4 ...
##  $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 7 2 7 7 2 2 3 11 2 2 ...
##  $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 4 1 4 4 1 1 2 6 1 1 ...
##  $ management           : Factor w/ 12 levels "company","other",..: 4 8 8 8 10 8 8 8 8 10 ...
##  $ management_group     : Factor w/ 5 levels "commercial","other",..: 3 5 5 5 5 5 5 5 5 5 ...
##  $ payment              : Factor w/ 7 levels "never pay","other",..: 1 1 1 7 4 1 1 1 5 4 ...
##  $ payment_type         : Factor w/ 7 levels "annually","monthly",..: 3 3 3 7 2 3 3 3 6 2 ...
##  $ water_quality        : Factor w/ 8 levels "coloured","fluoride",..: 7 7 7 7 7 7 5 7 7 7 ...
##  $ quality_group        : Factor w/ 6 levels "colored","fluoride",..: 3 3 3 3 3 3 5 3 3 3 ...
##  $ quantity             : Factor w/ 5 levels "dry","enough",..: 4 3 3 1 2 2 2 1 3 2 ...
##  $ quantity_group       : Factor w/ 5 levels "dry","enough",..: 4 3 3 1 2 2 2 1 3 2 ...
##  $ source               : Factor w/ 10 levels "dam","hand dtw",..: 6 9 6 8 9 9 4 4 9 9 ...
##  $ source_type          : Factor w/ 7 levels "borehole","dam",..: 4 7 4 6 7 7 1 1 7 7 ...
##  $ source_class         : Factor w/ 3 levels "groundwater",..: 2 1 2 1 1 1 1 1 1 1 ...
##  $ waterpoint_type      : Factor w/ 7 levels "cattle trough",..: 7 2 7 7 2 2 5 3 2 2 ...
##  $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 6 2 6 6 2 2 4 2 2 2 ...

Water table

To simplify things, it is common to merge the independent values and the dependent labels into one data frame.

The objective of this challenge is to predict whether or not a water pump is functional for the test set. We are given the status of the pumps for the newly merged train data frame as the column status_group.

How many water pumps in the train data frame are functional? In what proportion.

# Merge data frames to create the data frame train
train <- merge(train_labels, train_values)

# Look at the number of pumps in each functional status group
table(train$status_group)
## 
##              functional functional needs repair          non functional 
##                   32259                    4317                   22824
# As proportions
prop.table(table(train$status_group))
## 
##              functional functional needs repair          non functional 
##              0.54308081              0.07267677              0.38424242
# Table of the quantity variable vs the status of the pumps
table(train$quantity, train$status_group)
##               
##                functional functional needs repair non functional
##   dry                 157                      37           6052
##   enough            21648                    2400           9138
##   insufficient       7916                    1450           5763
##   seasonal           2325                     416           1309
##   unknown             213                      14            562
prop.table(table(train$status_group, train$quantity))
##                          
##                                    dry       enough insufficient
##   functional              0.0026430976 0.3644444444 0.1332659933
##   functional needs repair 0.0006228956 0.0404040404 0.0244107744
##   non functional          0.1018855219 0.1538383838 0.0970202020
##                          
##                               seasonal      unknown
##   functional              0.0391414141 0.0035858586
##   functional needs repair 0.0070033670 0.0002356902
##   non functional          0.0220370370 0.0094612795
# As row-wise proportions, quantity vs status_group
# 1 row-wise
# 2 col-wise
prop.table(table(train$quantity, train$status_group), margin = 1)
##               
##                 functional functional needs repair non functional
##   dry          0.025136087             0.005923791    0.968940122
##   enough       0.652323269             0.072319653    0.275357078
##   insufficient 0.523233525             0.095842422    0.380924053
##   seasonal     0.574074074             0.102716049    0.323209877
##   unknown      0.269961977             0.017743980    0.712294043
par(mar = c(5.1, 4.2, 4.1, 2.1))
mosaicplot(prop.table(table(train$quantity, train$status_group), margin = 1), main = '')

# Plot it
axis(2, at = c(0.75,0.5,0.25), labels = c('non funct.', 'needs repair', 'functional'), pos = 0, las = 1, tick = 0, cex.axis = 0.7)

par(mar = c(5.1, 4.1, 4.1, 2.1))

It looks like if the quantity variable is ‘dry’, it is likely that the pump is not functional.

Explore and Visualize

Another great way to explore our data is to create a few visualizations. This can help us better understand the structure and potential limitations of particular variables.

# Load the ggplot package and examine train
library(ggplot2)

str(train)
## 'data.frame':    59400 obs. of  41 variables:
##  $ id                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ status_group         : Factor w/ 3 levels "functional","functional needs repair",..: 3 1 1 1 3 1 3 1 3 3 ...
##  $ amount_tsh           : num  0 0 0 10 0 50 0 0 0 0 ...
##  $ date_recorded        : Factor w/ 356 levels "2002-10-14","2004-01-07",..: 230 39 61 341 56 32 206 57 72 104 ...
##  $ funder               : Factor w/ 1898 levels "","0","Aar","Abasia",..: 1635 1499 840 438 213 1272 456 1825 1825 281 ...
##  $ gps_height           : int  0 1978 0 1639 0 28 0 0 0 0 ...
##  $ installer            : Factor w/ 2146 levels "","-","0","AAR",..: 1806 1690 1001 230 278 1461 573 566 289 375 ...
##  $ longitude            : num  33.1 34.8 36.1 37.1 36.2 ...
##  $ latitude             : num  -5.12 -9.4 -6.28 -3.19 -6.1 ...
##  $ wpt_name             : Factor w/ 37400 levels "24","Abass","Abbas",..: 29563 32435 1063 445 2598 21496 28856 4091 9217 35406 ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : Factor w/ 9 levels "Internal","Lake Nyasa",..: 4 7 9 6 9 9 1 7 9 5 ...
##  $ subvillage           : Factor w/ 19288 levels "","##","1","14Kambalage",..: 9069 8905 17876 18716 8618 13587 14740 13625 10144 10815 ...
##  $ region               : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 20 4 3 7 3 15 18 3 3 5 ...
##  $ region_code          : int  14 11 1 3 1 60 17 1 1 18 ...
##  $ district_code        : int  3 4 4 5 4 43 3 1 5 8 ...
##  $ lga                  : Factor w/ 125 levels "Arusha Rural",..: 125 92 12 17 12 70 105 77 15 13 ...
##  $ ward                 : Factor w/ 2092 levels "Aghondi","Akheri",..: 316 2049 1357 1091 1003 2061 2026 2079 392 1711 ...
##  $ population           : int  0 20 0 25 0 6922 0 0 0 0 ...
##  $ public_meeting       : Factor w/ 3 levels "","False","True": 1 3 3 3 3 3 3 3 3 3 ...
##  $ recorded_by          : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scheme_management    : Factor w/ 13 levels "","Company","None",..: 9 1 9 11 9 6 9 9 9 9 ...
##  $ scheme_name          : Factor w/ 2697 levels "","14 Kambarage",..: 1 1 1461 1091 1 1 1 1141 526 1326 ...
##  $ permit               : Factor w/ 3 levels "","False","True": 3 2 3 3 3 2 3 3 2 3 ...
##  $ construction_year    : int  0 2008 0 1999 0 0 0 0 0 0 ...
##  $ extraction_type      : Factor w/ 18 levels "afridev","cemo",..: 1 13 8 4 9 15 10 10 8 8 ...
##  $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 1 10 5 2 6 11 7 7 5 5 ...
##  $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 2 5 3 1 2 6 4 4 3 3 ...
##  $ management           : Factor w/ 12 levels "company","other",..: 8 8 8 10 8 5 8 8 5 8 ...
##  $ management_group     : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 5 1 5 5 1 5 ...
##  $ payment              : Factor w/ 7 levels "never pay","other",..: 7 1 5 5 7 5 1 1 5 1 ...
##  $ payment_type         : Factor w/ 7 levels "annually","monthly",..: 7 3 6 6 7 6 3 3 6 3 ...
##  $ water_quality        : Factor w/ 8 levels "coloured","fluoride",..: 4 7 7 7 7 7 7 4 7 7 ...
##  $ quality_group        : Factor w/ 6 levels "colored","fluoride",..: 4 3 3 3 3 3 3 4 3 3 ...
##  $ quantity             : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
##  $ quantity_group       : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
##  $ source               : Factor w/ 10 levels "dam","hand dtw",..: 8 8 4 9 8 4 8 9 4 4 ...
##  $ source_type          : Factor w/ 7 levels "borehole","dam",..: 6 6 1 7 6 1 6 7 1 1 ...
##  $ source_class         : Factor w/ 3 levels "groundwater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ waterpoint_type      : Factor w/ 7 levels "cattle trough",..: 5 5 3 2 5 3 7 6 3 3 ...
##  $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 4 4 2 2 4 2 6 5 2 2 ...
# Create bar plot for quantity
qplot(quantity, data = train, geom = "bar", fill = status_group) + 
  theme(legend.position = "top") + 
  theme(axis.text.x = element_text(angle = -20, hjust = 0))

prop.table(table(train$status_group, train$quantity))
##                          
##                                    dry       enough insufficient
##   functional              0.0026430976 0.3644444444 0.1332659933
##   functional needs repair 0.0006228956 0.0404040404 0.0244107744
##   non functional          0.1018855219 0.1538383838 0.0970202020
##                          
##                               seasonal      unknown
##   functional              0.0391414141 0.0035858586
##   functional needs repair 0.0070033670 0.0002356902
##   non functional          0.0220370370 0.0094612795

From the graph and the table, we can conclude that the ‘dry’ wells appears to be mostly non-functional (not functional, not functional with repair needs at the lowest).

# Create bar plot for quality_group, the quality of the water
qplot(quality_group, data = train, geom = "bar", fill = status_group) + 
  theme(legend.position = "top") + 
  theme(axis.text.x = element_text(angle = -20, hjust = 0))

# Create bar plot for waterpoint_type, the kind of water point
qplot(waterpoint_type, data = train, geom = "bar", fill = status_group) + 
  theme(legend.position = "top") + 
  theme(axis.text.x = element_text(angle = -20, hjust = 0))

# extraction_type_class - the kind of extraction the water point uses
qplot(extraction_type_class, data = train, geom = "bar", fill = status_group) + 
  theme(legend.position = "top") + 
  theme(axis.text.x = element_text(angle = -20, hjust = 0))

# payment - What the water costs
qplot(payment, data = train, geom = "bar", fill = status_group) + 
  theme(legend.position = "top") + 
  theme(axis.text.x = element_text(angle = -20, hjust = 0))

Description of the variables.

Which Well Quantities are Functional?

# Create a histogram for `construction_year` grouped by `status_group`
ggplot(train, aes(x = construction_year)) + 
  geom_histogram(bins = 20) + 
  facet_grid( ~ status_group)

# Now subsetting when construction_year is larger than 0
ggplot(subset(train, construction_year > 0), aes(x = construction_year)) +
  geom_histogram(bins = 20) + 
  facet_grid( ~ status_group)

Continuous Variable Viz

We made some great plots that compared some categorical variables based on the well status. Now we can look at some ordinal or continuous variables.

# Create a histogram for `construction_year` grouped by `status_group`
ggplot(train, aes(x = construction_year)) + 
  geom_histogram(bins = 20) + 
  facet_grid( ~ status_group)

As we can see, the first plot showed us that there were a lot of missing values coded as 0’s.

ggplot(subset(train, construction_year > 0), aes(x = construction_year)) +
  geom_histogram(bins = 20) + 
  facet_grid( ~ status_group)

After subsetting out the missing value, we could see that there may some differences between the distribution of functional wells and non-functional wells.

Mapping Well Locations

Two other variables that would be worth checking out would be longitude and latitude.

It would make sense that the location of the wells could be connected to the probability that they are functioning. We could look at a histogram of the two variables, but we could be missing some major features of the data.

# Create scatter plot: latitude vs longitude with color as status_group
ggplot(subset(train2, latitude < 0 & longitude > 0), aes(x = latitude, y = longitude, color = status_group)) +
  geom_point(shape = 1) +
  theme(legend.position = "top")

Interactive map with a modified dataset.

library(googleVis)

# Create a column 'latlong' to input into gvisGeoChart
train2$latlong <- paste(round(train2$latitude, 2), round(train2$longitude, 2), sep = ":")

# Use gvisGeoChart to create an interactive map with well locations
wells_map <- gvisGeoChart(train2[1:1000,], locationvar = "latlong", colorvar = "status_group", sizevar = "Size", options = list(region = "TZ"))

# Plot wells_map
wells_map

Here is the .gif file:



2, Predict and Measure

First prediction

Let’s start making a few predictions using a common machine learning technique called a Random Forest.

Set the number of trees to grow to 5, make sure we can inspect variable importance (set to TRUE), and set nodesize to 2.

# Load the randomForest library
library(randomForest)

# Set seed and create a random forest classifier
set.seed(42)

model_forest <- randomForest(as.factor(status_group) ~ longitude + latitude + extraction_type_group + quality_group + quantity + waterpoint_type + construction_year, data = train, importance = TRUE, ntree = 5, nodesize = 2)

# Use random forest to predict the values in train
pred_forest_train <- predict(model_forest, train)

# Observe the first few rows of our predictions
head(pred_forest_train)
##              1              2              3              4              5 
## non functional     functional     functional     functional non functional 
##              6 
##     functional 
## Levels: functional functional needs repair non functional

Evaluation the Random Forest

library(caret)

confusionMatrix(pred_forest_train, train$status_group)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair
##   functional                   30717                    2686
##   functional needs repair        251                    1145
##   non functional                1291                     486
##                          Reference
## Prediction                non functional
##   functional                        4943
##   functional needs repair            123
##   non functional                   17758
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8354          
##                  95% CI : (0.8323, 0.8383)
##     No Information Rate : 0.5431          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6841          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.9522                        0.26523
## Specificity                     0.7189                        0.99321
## Pos Pred Value                  0.8010                        0.75379
## Neg Pred Value                  0.9268                        0.94520
## Prevalence                      0.5431                        0.07268
## Detection Rate                  0.5171                        0.01928
## Detection Prevalence            0.6456                        0.02557
## Balanced Accuracy               0.8356                        0.62922
##                      Class: non functional
## Sensitivity                         0.7780
## Specificity                         0.9514
## Pos Pred Value                      0.9090
## Neg Pred Value                      0.8729
## Prevalence                          0.3842
## Detection Rate                      0.2990
## Detection Prevalence                0.3289
## Balanced Accuracy                   0.8647

The positive predictive value is 0.91 for non-functional wells. That is a pretty high predictive value. The overall accuracy of our model on the training set was 0.84.

Variable importance

Look at how important the inputs were to the predictive model.

importance(model_forest)
##                       functional functional needs repair non functional
## longitude               16.71721                7.490207       8.546189
## latitude                17.16512                9.495809      21.302410
## extraction_type_group   19.61447                5.066070      12.426716
## quality_group           10.47112                3.370981       7.865812
## quantity                26.53906                9.100936      43.594567
## waterpoint_type         18.39191               12.293394      24.068165
## construction_year       21.38358               10.318916      30.289797
##                       MeanDecreaseAccuracy MeanDecreaseGini
## longitude                        71.157035        4315.9566
## latitude                         23.103322        4296.2056
## extraction_type_group            27.563737        1736.4858
## quality_group                     9.928202         546.9821
## quantity                         42.336038        4412.6914
## waterpoint_type                  40.533549        2755.5968
## construction_year                34.637218        2107.0972
varImpPlot(model_forest)

It looks like the quality_group variable had the least predictive power in this model. The quantity variable seemed to be the most useful variable in this model.

Adding features

There are a lot of features in this data set, and many of them will not be useful inputs to commonly for machine learning techniques without some adjustments. That is why this data set is all about feature selection and feature engineering.

We can examine the variable installer by using summary(train$installer). We can see that there are a lot of terms that are likely the same installer, but have differenct names. For example, there are many instances that refer to ‘Government’: ‘Gover’, ‘GOVER’, ‘Government’, ‘Govt’ etc. All of these will be considered different factors unless we find a way to aggregate them.

One quick (and simplistic) way to do this would be to take the first 3 letters of each factor and make them lower case. Then, we can aggregate the terms that are most frequent and only use those as predictors, putting all other variables into an ‘other’ category.

# Observe the installer variable
summary(train$installer)
##                           DWE                               
##                         17402                          3655 
##                    Government                           RWE 
##                          1825                          1206 
##                         Commu                        DANIDA 
##                          1060                          1050 
##                          KKKT                        Hesawa 
##                           898                           840 
##                             0                          TCRS 
##                           777                           707 
##            Central government                           CES 
##                           622                           610 
##                     Community                         DANID 
##                           553                           552 
##              District Council                        HESAWA 
##                           551                           539 
##                           LGA                  World vision 
##                           408                           408 
##                        WEDECO                         TASAF 
##                           397                           396 
##              District council                         Gover 
##                           392                           383 
##                         AMREF                         TWESA 
##                           329                           316 
##                            WU                          Dmdd 
##                           301                           287 
##                          ACRA                  World Vision 
##                           278                           270 
##                          SEMA                            DW 
##                           249                           246 
##                         OXFAM                            Da 
##                           234                           224 
##                          Gove                 Idara ya maji 
##                           222                           222 
##                        UNICEF    Sengerema Water Department 
##                           222                           214 
##                     Kiliwater                          FinW 
##                           210                           208 
##                         NORAD                            DH 
##                           208                           202 
##                     Villagers                          DWSP 
##                           199                           192 
##                        Distri          Lawatefuka water sup 
##                           181                           180 
##          Magadini-Makiwaru wa                            RC 
##                           175                           174 
##                            FW          KKKT _ Konde and DWE 
##                           173                           166 
##                         Centr                           WVT 
##                           162                           158 
##                           MWE           Handeni Trunk Main( 
##                           157                           156 
##                            Is                         Norad 
##                           154                           152 
##                    Fini Water                         RWSSP 
##                           149                           149 
##                         SHIPO                       Private 
##                           147                           143 
##                        Kuwait                         JAICA 
##                           142                           141 
##                  Central govt                       Artisan 
##                           138                           135 
##                           ISF                    Fini water 
##                           135                           133 
##                         GOVER                          DDCA 
##                           128                           126 
##                            HE                     RC CHURCH 
##                           125                           125 
##                         Tardo                       Mission 
##                           123                           121 
##                         World                            Ir 
##                           121                           119 
##                      wananchi           Consulting Engineer 
##                           119                           116 
##                         Amref                          JICA 
##                           114                           113 
##                           TWE            Central Government 
##                           111                           110 
##                         MUWSA                           DED 
##                           107                           105 
##                    FINI WATER                     WATER AID 
##                           103                           103 
## Halmashauri ya wilaya sikonge                           HSW 
##                           102                           100 
##                Wizara ya maji                         Jaica 
##                           100                            98 
##                        Unisef                            Go 
##                            98                            97 
##                            Ki                        OXFARM 
##                            97                            96 
##                    World Bank                         Roman 
##                            95                            94 
##                         Angli                           VWC 
##                            92                            91 
##     District water department                          DMDD 
##                            89                            89 
##                         Missi                             H 
##                            87                            86 
##                         Shipo                       (Other) 
##                            86                         12250
# Make installer lowercase, take first 3 letters as a sub string
train$install_3 <- substr(tolower(train$installer), 1, 3)
train$install_3[train$install_3 %in% c(" ", "", "0", "_", "-")] <- "other"

# Take the top 15 substrings from above by occurance frequency
install_top_15 <- names(summary(as.factor(train$install_3)))[1:15]
train$install_3[!(train$install_3 %in% install_top_15)] <- "other"
train$install_3 <- as.factor(train$install_3)

# Table of the install_3 variable vs the status of the pumps
table(train$install_3, train$status_group)
##        
##         functional functional needs repair non functional
##   cen          263                      26            783
##   ces          538                       1             71
##   com         1140                      92            449
##   dan         1042                      95            545
##   dis          529                     101            744
##   dwe         9454                    1622           6356
##   fin          127                      34            641
##   gov          945                     296           1427
##   hes          791                      54            557
##   kkk          568                      78            517
##   other      15358                    1583           9033
##   rwe          334                     140            818
##   tas          307                      37            167
##   tcr          305                      43            386
##   wor          558                     115            330
# As row-wise proportions, install_3 vs status_group
prop.table(table(train$install_3, train$status_group), margin = 1)
##        
##          functional functional needs repair non functional
##   cen   0.245335821             0.024253731    0.730410448
##   ces   0.881967213             0.001639344    0.116393443
##   com   0.678167757             0.054729328    0.267102915
##   dan   0.619500595             0.056480380    0.324019025
##   dis   0.385007278             0.073508006    0.541484716
##   dwe   0.542335934             0.093047269    0.364616797
##   fin   0.158354115             0.042394015    0.799251870
##   gov   0.354197901             0.110944528    0.534857571
##   hes   0.564194009             0.038516405    0.397289586
##   kkk   0.488392089             0.067067928    0.444539983
##   other 0.591283591             0.060945561    0.347770848
##   rwe   0.258513932             0.108359133    0.633126935
##   tas   0.600782779             0.072407045    0.326810176
##   tcr   0.415531335             0.058583106    0.525885559
##   wor   0.556331007             0.114656032    0.329012961
test <- test_values

# Create install_3 for the test set using same top 15 from above
test$install_3 <- substr(tolower(test$installer),1,3)
test$install_3[test$install_3 %in% c(" ", "", "0", "_", "-")] <- "other"
test$install_3[!(test$install_3 %in% install_top_15)] <- "other"
test$install_3 <- as.factor(test$install_3)

It looks like there are a few installer groups that show a high proportion of non functional wells.

Predict, Submit and Next Steps

Now that we have created the new variable install_3, it is time to make a new prediction. Then we can observe the importance and statistics to see how it performs.

Then, we can use the model to create predictions based on the test set. These results can be stored in a data frame called submission with a column for the well id and the predicted status_group.

This data frame can then be written to a .csv with write.csv and submitted to DrivenData here.

Ex: write.csv(submission, file = filepath.

This has only been an introduction to the water pumps challenge; there are still many ways to improve and refine our predictions. Here are a few suggestions for some next steps:

  • More feature selection: Look through the features in train and continue to add features that may have predictive power.
  • More feature engineering: Look to features that can be engineered to create more useful predictors. There are many location and character variables that could be refined to improve our predictions.
  • Improve the random forest: One way to improve the accuracy is to increase the number of trees produced by the random forest and experiment with increasing and decreasing the minimum node size.
  • Look to new machine learning techniques: There are many other machine learning techniques that could be utilzed to improve these models. KNN, SVM and logistic regression models and ensembles could give us the edge to climb the leaderboard.
# randomForest and caret packages are pre-loaded
set.seed(42)

model_forest <- randomForest(as.factor(status_group) ~ longitude + latitude + extraction_type_group + quantity + waterpoint_type + construction_year + install_3, data = train, importance = TRUE, ntree = 5, nodesize = 2)

# Predict using the training values
pred_forest_train <- predict(model_forest, train)
importance(model_forest)
##                       functional functional needs repair non functional
## longitude              22.607973               11.441151      15.278711
## latitude               22.275082               11.739175      15.450517
## extraction_type_group   9.641744               10.703152      26.780392
## quantity               23.019245               15.160051      40.438273
## waterpoint_type        34.160155                8.735793      17.743501
## construction_year      23.752795                8.585766       8.095198
## install_3              14.538354                7.826015      10.086711
##                       MeanDecreaseAccuracy MeanDecreaseGini
## longitude                         39.82497         4528.996
## latitude                          30.42600         4397.020
## extraction_type_group             14.29616         1841.559
## quantity                          33.36426         4312.917
## waterpoint_type                   40.66599         2790.131
## construction_year                 18.32689         2352.908
## install_3                         13.68290         1566.425
confusionMatrix(pred_forest_train, train$status_group)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair
##   functional                   30562                    2114
##   functional needs repair        347                    1637
##   non functional                1350                     566
##                          Reference
## Prediction                non functional
##   functional                        4338
##   functional needs repair            202
##   non functional                   18284
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8499         
##                  95% CI : (0.847, 0.8527)
##     No Information Rate : 0.5431         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7158         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.9474                        0.37920
## Specificity                     0.7623                        0.99003
## Pos Pred Value                  0.8257                        0.74886
## Neg Pred Value                  0.9242                        0.95316
## Prevalence                      0.5431                        0.07268
## Detection Rate                  0.5145                        0.02756
## Detection Prevalence            0.6231                        0.03680
## Balanced Accuracy               0.8548                        0.68462
##                      Class: non functional
## Sensitivity                         0.8011
## Specificity                         0.9476
## Pos Pred Value                      0.9051
## Neg Pred Value                      0.8842
## Prevalence                          0.3842
## Detection Rate                      0.3078
## Detection Prevalence                0.3401
## Balanced Accuracy                   0.8744
# Predict using the test values
pred_forest_test <- predict(model_forest, test)

# Create submission data frame
submission <- data.frame(test$id)
submission$status_group <- pred_forest_test
names(submission)[1] <- "id"

Now we can submit the results of the random forest by downloading this .csv and visiting the submission page for DriveData. We will see the model has an accuracy of 0.80 on the test set. Keep working and climb that leaderboard!



Notes


  1. DataDriven organizes data science competitions to save the world.