Foreword
Data Mining the Water Table
Using data from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don’t work at all?
This is an intermediate-level practice competition by DrivenData1.
Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which water points will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.
Let’s start with loading in the training and testing set into our R environment. We will use the training set to build our model, and the test set to validate it. Here is a quick explanation of the 3 data frames that we will import:
train_values corresponds to the independent variables for the training set.train_labels contains the dependent variable (status_group) for each of the rows in train_values.test_values is the independent variables that need predictions.Understanding our data
# Check it out
str(train_values)## 'data.frame': 59400 obs. of 40 variables:
## $ id : int 69572 8776 34310 67743 19728 9944 19816 54551 53934 46144 ...
## $ amount_tsh : num 6000 0 25 0 0 20 0 0 0 0 ...
## $ date_recorded : Factor w/ 356 levels "2002-10-14","2004-01-07",..: 48 310 301 273 105 47 187 195 220 126 ...
## $ funder : Factor w/ 1898 levels "","0","Aar","Abasia",..: 1371 471 827 1741 21 987 347 1418 1826 594 ...
## $ gps_height : int 1390 1399 686 263 0 0 0 0 0 0 ...
## $ installer : Factor w/ 2146 levels "","-","0","AAR",..: 1557 593 2116 1924 119 433 446 433 2039 119 ...
## $ longitude : num 34.9 34.7 37.5 38.5 31.1 ...
## $ latitude : num -9.86 -2.15 -3.82 -11.16 -1.83 ...
## $ wpt_name : Factor w/ 37400 levels "24","Abass","Abbas",..: 32435 37197 15288 37293 34797 36139 21942 36485 23155 22872 ...
## $ num_private : int 0 0 0 0 0 0 0 0 0 0 ...
## $ basin : Factor w/ 9 levels "Internal","Lake Nyasa",..: 2 5 6 8 5 6 1 4 4 5 ...
## $ subvillage : Factor w/ 19288 levels "","##","1","14Kambalage",..: 11808 15840 9069 8977 7700 11812 3798 16345 3436 11439 ...
## $ region : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 4 10 9 13 5 21 18 18 20 5 ...
## $ region_code : int 11 20 21 90 18 4 17 17 14 18 ...
## $ district_code : int 5 2 4 63 1 8 3 3 6 1 ...
## $ lga : Factor w/ 125 levels "Arusha Rural",..: 52 104 109 88 27 69 105 26 116 27 ...
## $ ward : Factor w/ 2092 levels "Aghondi","Akheri",..: 1427 1577 1625 1572 1688 1323 1853 151 435 498 ...
## $ population : int 109 280 250 58 0 1 0 0 0 0 ...
## $ public_meeting : Factor w/ 3 levels "","False","True": 3 1 3 3 3 3 3 3 3 3 ...
## $ recorded_by : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
## $ scheme_management : Factor w/ 13 levels "","Company","None",..: 9 4 9 9 1 9 9 1 9 1 ...
## $ scheme_name : Factor w/ 2697 levels "","14 Kambarage",..: 2290 1 2168 1 1 2692 1 1 1 1 ...
## $ permit : Factor w/ 3 levels "","False","True": 2 3 3 3 3 3 3 3 3 3 ...
## $ construction_year : int 1999 2010 2009 1986 0 2009 0 0 0 0 ...
## $ extraction_type : Factor w/ 18 levels "afridev","cemo",..: 4 4 4 15 4 15 16 9 5 9 ...
## $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 2 2 2 11 2 11 12 6 3 6 ...
## $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 1 1 1 6 1 6 2 2 2 2 ...
## $ management : Factor w/ 12 levels "company","other",..: 8 12 8 8 2 8 8 12 8 8 ...
## $ management_group : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 2 5 5 5 5 5 ...
## $ payment : Factor w/ 7 levels "never pay","other",..: 3 1 5 1 1 5 1 7 1 1 ...
## $ payment_type : Factor w/ 7 levels "annually","monthly",..: 1 3 6 3 3 6 3 7 3 3 ...
## $ water_quality : Factor w/ 8 levels "coloured","fluoride",..: 7 7 7 7 7 5 7 4 5 7 ...
## $ quality_group : Factor w/ 6 levels "colored","fluoride",..: 3 3 3 3 3 5 3 4 5 3 ...
## $ quantity : Factor w/ 5 levels "dry","enough",..: 2 3 2 1 4 2 2 2 4 2 ...
## $ quantity_group : Factor w/ 5 levels "dry","enough",..: 2 3 2 1 4 2 2 2 4 2 ...
## $ source : Factor w/ 10 levels "dam","hand dtw",..: 9 6 1 4 6 5 4 8 4 8 ...
## $ source_type : Factor w/ 7 levels "borehole","dam",..: 7 4 2 1 4 3 1 6 1 6 ...
## $ source_class : Factor w/ 3 levels "groundwater",..: 1 2 2 1 2 3 1 1 1 1 ...
## $ waterpoint_type : Factor w/ 7 levels "cattle trough",..: 2 2 3 3 2 3 5 5 5 5 ...
## $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 2 2 2 2 2 2 4 4 4 4 ...
str(train_labels)## 'data.frame': 59400 obs. of 2 variables:
## $ id : int 69572 8776 34310 67743 19728 9944 19816 54551 53934 46144 ...
## $ status_group: Factor w/ 3 levels "functional","functional needs repair",..: 1 1 1 3 1 1 3 3 3 1 ...
str(test_values)## 'data.frame': 14850 obs. of 40 variables:
## $ id : int 50785 51630 17168 45559 49871 52449 24806 28965 36301 54122 ...
## $ amount_tsh : num 0 0 0 0 500 0 0 0 30 0 ...
## $ date_recorded : Factor w/ 331 levels "2001-03-26","2004-01-04",..: 256 256 253 243 307 284 35 246 244 298 ...
## $ funder : Factor w/ 981 levels "","0","Aar","Abasia",..: 177 253 1 223 75 253 253 224 893 417 ...
## $ gps_height : int 1996 1569 1567 267 1260 1685 550 234 584 1083 ...
## $ installer : Factor w/ 1092 levels "","0","AAR","ABASIA",..: 232 247 1 274 90 247 305 275 491 487 ...
## $ longitude : num 35.3 36.7 34.8 38.1 35 ...
## $ latitude : num -4.06 -3.31 -5 -9.42 -10.95 ...
## $ wpt_name : Factor w/ 10840 levels "21","24","Abada",..: 656 1778 9675 5943 6054 7901 9302 4990 5803 3099 ...
## $ num_private : int 0 0 0 0 0 0 0 0 0 0 ...
## $ basin : Factor w/ 9 levels "Internal","Lake Nyasa",..: 1 6 1 8 8 6 7 8 8 6 ...
## $ subvillage : Factor w/ 8444 levels "","1","10C","14Kambarage",..: 3915 2719 5399 2797 3546 4318 2429 1373 203 816 ...
## $ region : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 9 1 19 8 17 1 4 13 13 7 ...
## $ region_code : int 21 2 13 80 10 2 11 9 90 3 ...
## $ district_code : int 3 2 2 43 3 2 7 4 33 7 ...
## $ lga : Factor w/ 125 levels "Arusha Rural",..: 64 1 110 50 62 1 35 117 89 107 ...
## $ ward : Factor w/ 1959 levels "Aghondi","Akheri",..: 18 651 1678 1193 1076 347 932 1049 1218 1778 ...
## $ population : int 321 300 500 250 60 200 600 1 40 1 ...
## $ public_meeting : Factor w/ 3 levels "","False","True": 3 3 3 1 1 3 3 3 3 3 ...
## $ recorded_by : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
## $ scheme_management : Factor w/ 12 levels "","Company","Other",..: 4 8 8 8 10 8 8 10 8 10 ...
## $ scheme_name : Factor w/ 1790 levels "","14 Kambarage",..: 1 1670 1480 1 99 1647 1 95 1271 36 ...
## $ permit : Factor w/ 3 levels "","False","True": 3 3 1 3 3 3 3 3 2 3 ...
## $ construction_year : int 2012 2000 2010 1987 2000 1990 2007 1982 1997 2003 ...
## $ extraction_type : Factor w/ 17 levels "afridev","cemo",..: 10 4 10 10 4 4 5 14 4 4 ...
## $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 7 2 7 7 2 2 3 11 2 2 ...
## $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 4 1 4 4 1 1 2 6 1 1 ...
## $ management : Factor w/ 12 levels "company","other",..: 4 8 8 8 10 8 8 8 8 10 ...
## $ management_group : Factor w/ 5 levels "commercial","other",..: 3 5 5 5 5 5 5 5 5 5 ...
## $ payment : Factor w/ 7 levels "never pay","other",..: 1 1 1 7 4 1 1 1 5 4 ...
## $ payment_type : Factor w/ 7 levels "annually","monthly",..: 3 3 3 7 2 3 3 3 6 2 ...
## $ water_quality : Factor w/ 8 levels "coloured","fluoride",..: 7 7 7 7 7 7 5 7 7 7 ...
## $ quality_group : Factor w/ 6 levels "colored","fluoride",..: 3 3 3 3 3 3 5 3 3 3 ...
## $ quantity : Factor w/ 5 levels "dry","enough",..: 4 3 3 1 2 2 2 1 3 2 ...
## $ quantity_group : Factor w/ 5 levels "dry","enough",..: 4 3 3 1 2 2 2 1 3 2 ...
## $ source : Factor w/ 10 levels "dam","hand dtw",..: 6 9 6 8 9 9 4 4 9 9 ...
## $ source_type : Factor w/ 7 levels "borehole","dam",..: 4 7 4 6 7 7 1 1 7 7 ...
## $ source_class : Factor w/ 3 levels "groundwater",..: 2 1 2 1 1 1 1 1 1 1 ...
## $ waterpoint_type : Factor w/ 7 levels "cattle trough",..: 7 2 7 7 2 2 5 3 2 2 ...
## $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 6 2 6 6 2 2 4 2 2 2 ...
Water table
To simplify things, it is common to merge the independent values and the dependent labels into one data frame.
The objective of this challenge is to predict whether or not a water pump is functional for the test set. We are given the status of the pumps for the newly merged train data frame as the column status_group.
How many water pumps in the train data frame are functional? In what proportion.
# Merge data frames to create the data frame train
train <- merge(train_labels, train_values)
# Look at the number of pumps in each functional status group
table(train$status_group)##
## functional functional needs repair non functional
## 32259 4317 22824
# As proportions
prop.table(table(train$status_group))##
## functional functional needs repair non functional
## 0.54308081 0.07267677 0.38424242
# Table of the quantity variable vs the status of the pumps
table(train$quantity, train$status_group)##
## functional functional needs repair non functional
## dry 157 37 6052
## enough 21648 2400 9138
## insufficient 7916 1450 5763
## seasonal 2325 416 1309
## unknown 213 14 562
prop.table(table(train$status_group, train$quantity))##
## dry enough insufficient
## functional 0.0026430976 0.3644444444 0.1332659933
## functional needs repair 0.0006228956 0.0404040404 0.0244107744
## non functional 0.1018855219 0.1538383838 0.0970202020
##
## seasonal unknown
## functional 0.0391414141 0.0035858586
## functional needs repair 0.0070033670 0.0002356902
## non functional 0.0220370370 0.0094612795
# As row-wise proportions, quantity vs status_group
# 1 row-wise
# 2 col-wise
prop.table(table(train$quantity, train$status_group), margin = 1)##
## functional functional needs repair non functional
## dry 0.025136087 0.005923791 0.968940122
## enough 0.652323269 0.072319653 0.275357078
## insufficient 0.523233525 0.095842422 0.380924053
## seasonal 0.574074074 0.102716049 0.323209877
## unknown 0.269961977 0.017743980 0.712294043
par(mar = c(5.1, 4.2, 4.1, 2.1))
mosaicplot(prop.table(table(train$quantity, train$status_group), margin = 1), main = '')
# Plot it
axis(2, at = c(0.75,0.5,0.25), labels = c('non funct.', 'needs repair', 'functional'), pos = 0, las = 1, tick = 0, cex.axis = 0.7)par(mar = c(5.1, 4.1, 4.1, 2.1))It looks like if the quantity variable is ‘dry’, it is likely that the pump is not functional.
Explore and Visualize
Another great way to explore our data is to create a few visualizations. This can help us better understand the structure and potential limitations of particular variables.
# Load the ggplot package and examine train
library(ggplot2)
str(train)## 'data.frame': 59400 obs. of 41 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ status_group : Factor w/ 3 levels "functional","functional needs repair",..: 3 1 1 1 3 1 3 1 3 3 ...
## $ amount_tsh : num 0 0 0 10 0 50 0 0 0 0 ...
## $ date_recorded : Factor w/ 356 levels "2002-10-14","2004-01-07",..: 230 39 61 341 56 32 206 57 72 104 ...
## $ funder : Factor w/ 1898 levels "","0","Aar","Abasia",..: 1635 1499 840 438 213 1272 456 1825 1825 281 ...
## $ gps_height : int 0 1978 0 1639 0 28 0 0 0 0 ...
## $ installer : Factor w/ 2146 levels "","-","0","AAR",..: 1806 1690 1001 230 278 1461 573 566 289 375 ...
## $ longitude : num 33.1 34.8 36.1 37.1 36.2 ...
## $ latitude : num -5.12 -9.4 -6.28 -3.19 -6.1 ...
## $ wpt_name : Factor w/ 37400 levels "24","Abass","Abbas",..: 29563 32435 1063 445 2598 21496 28856 4091 9217 35406 ...
## $ num_private : int 0 0 0 0 0 0 0 0 0 0 ...
## $ basin : Factor w/ 9 levels "Internal","Lake Nyasa",..: 4 7 9 6 9 9 1 7 9 5 ...
## $ subvillage : Factor w/ 19288 levels "","##","1","14Kambalage",..: 9069 8905 17876 18716 8618 13587 14740 13625 10144 10815 ...
## $ region : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 20 4 3 7 3 15 18 3 3 5 ...
## $ region_code : int 14 11 1 3 1 60 17 1 1 18 ...
## $ district_code : int 3 4 4 5 4 43 3 1 5 8 ...
## $ lga : Factor w/ 125 levels "Arusha Rural",..: 125 92 12 17 12 70 105 77 15 13 ...
## $ ward : Factor w/ 2092 levels "Aghondi","Akheri",..: 316 2049 1357 1091 1003 2061 2026 2079 392 1711 ...
## $ population : int 0 20 0 25 0 6922 0 0 0 0 ...
## $ public_meeting : Factor w/ 3 levels "","False","True": 1 3 3 3 3 3 3 3 3 3 ...
## $ recorded_by : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
## $ scheme_management : Factor w/ 13 levels "","Company","None",..: 9 1 9 11 9 6 9 9 9 9 ...
## $ scheme_name : Factor w/ 2697 levels "","14 Kambarage",..: 1 1 1461 1091 1 1 1 1141 526 1326 ...
## $ permit : Factor w/ 3 levels "","False","True": 3 2 3 3 3 2 3 3 2 3 ...
## $ construction_year : int 0 2008 0 1999 0 0 0 0 0 0 ...
## $ extraction_type : Factor w/ 18 levels "afridev","cemo",..: 1 13 8 4 9 15 10 10 8 8 ...
## $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 1 10 5 2 6 11 7 7 5 5 ...
## $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 2 5 3 1 2 6 4 4 3 3 ...
## $ management : Factor w/ 12 levels "company","other",..: 8 8 8 10 8 5 8 8 5 8 ...
## $ management_group : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 5 1 5 5 1 5 ...
## $ payment : Factor w/ 7 levels "never pay","other",..: 7 1 5 5 7 5 1 1 5 1 ...
## $ payment_type : Factor w/ 7 levels "annually","monthly",..: 7 3 6 6 7 6 3 3 6 3 ...
## $ water_quality : Factor w/ 8 levels "coloured","fluoride",..: 4 7 7 7 7 7 7 4 7 7 ...
## $ quality_group : Factor w/ 6 levels "colored","fluoride",..: 4 3 3 3 3 3 3 4 3 3 ...
## $ quantity : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
## $ quantity_group : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
## $ source : Factor w/ 10 levels "dam","hand dtw",..: 8 8 4 9 8 4 8 9 4 4 ...
## $ source_type : Factor w/ 7 levels "borehole","dam",..: 6 6 1 7 6 1 6 7 1 1 ...
## $ source_class : Factor w/ 3 levels "groundwater",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ waterpoint_type : Factor w/ 7 levels "cattle trough",..: 5 5 3 2 5 3 7 6 3 3 ...
## $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 4 4 2 2 4 2 6 5 2 2 ...
# Create bar plot for quantity
qplot(quantity, data = train, geom = "bar", fill = status_group) +
theme(legend.position = "top") +
theme(axis.text.x = element_text(angle = -20, hjust = 0))prop.table(table(train$status_group, train$quantity))##
## dry enough insufficient
## functional 0.0026430976 0.3644444444 0.1332659933
## functional needs repair 0.0006228956 0.0404040404 0.0244107744
## non functional 0.1018855219 0.1538383838 0.0970202020
##
## seasonal unknown
## functional 0.0391414141 0.0035858586
## functional needs repair 0.0070033670 0.0002356902
## non functional 0.0220370370 0.0094612795
From the graph and the table, we can conclude that the ‘dry’ wells appears to be mostly non-functional (not functional, not functional with repair needs at the lowest).
# Create bar plot for quality_group, the quality of the water
qplot(quality_group, data = train, geom = "bar", fill = status_group) +
theme(legend.position = "top") +
theme(axis.text.x = element_text(angle = -20, hjust = 0))# Create bar plot for waterpoint_type, the kind of water point
qplot(waterpoint_type, data = train, geom = "bar", fill = status_group) +
theme(legend.position = "top") +
theme(axis.text.x = element_text(angle = -20, hjust = 0))# extraction_type_class - the kind of extraction the water point uses
qplot(extraction_type_class, data = train, geom = "bar", fill = status_group) +
theme(legend.position = "top") +
theme(axis.text.x = element_text(angle = -20, hjust = 0))# payment - What the water costs
qplot(payment, data = train, geom = "bar", fill = status_group) +
theme(legend.position = "top") +
theme(axis.text.x = element_text(angle = -20, hjust = 0))Description of the variables.
Which Well Quantities are Functional?
# Create a histogram for `construction_year` grouped by `status_group`
ggplot(train, aes(x = construction_year)) +
geom_histogram(bins = 20) +
facet_grid( ~ status_group)# Now subsetting when construction_year is larger than 0
ggplot(subset(train, construction_year > 0), aes(x = construction_year)) +
geom_histogram(bins = 20) +
facet_grid( ~ status_group)Continuous Variable Viz
We made some great plots that compared some categorical variables based on the well status. Now we can look at some ordinal or continuous variables.
# Create a histogram for `construction_year` grouped by `status_group`
ggplot(train, aes(x = construction_year)) +
geom_histogram(bins = 20) +
facet_grid( ~ status_group)As we can see, the first plot showed us that there were a lot of missing values coded as 0’s.
ggplot(subset(train, construction_year > 0), aes(x = construction_year)) +
geom_histogram(bins = 20) +
facet_grid( ~ status_group)After subsetting out the missing value, we could see that there may some differences between the distribution of functional wells and non-functional wells.
Mapping Well Locations
Two other variables that would be worth checking out would be longitude and latitude.
It would make sense that the location of the wells could be connected to the probability that they are functioning. We could look at a histogram of the two variables, but we could be missing some major features of the data.
# Create scatter plot: latitude vs longitude with color as status_group
ggplot(subset(train2, latitude < 0 & longitude > 0), aes(x = latitude, y = longitude, color = status_group)) +
geom_point(shape = 1) +
theme(legend.position = "top")Interactive map with a modified dataset.
library(googleVis)
# Create a column 'latlong' to input into gvisGeoChart
train2$latlong <- paste(round(train2$latitude, 2), round(train2$longitude, 2), sep = ":")
# Use gvisGeoChart to create an interactive map with well locations
wells_map <- gvisGeoChart(train2[1:1000,], locationvar = "latlong", colorvar = "status_group", sizevar = "Size", options = list(region = "TZ"))
# Plot wells_map
wells_mapHere is the .gif file:
First prediction
Let’s start making a few predictions using a common machine learning technique called a Random Forest.
Set the number of trees to grow to 5, make sure we can inspect variable importance (set to TRUE), and set nodesize to 2.
# Load the randomForest library
library(randomForest)
# Set seed and create a random forest classifier
set.seed(42)
model_forest <- randomForest(as.factor(status_group) ~ longitude + latitude + extraction_type_group + quality_group + quantity + waterpoint_type + construction_year, data = train, importance = TRUE, ntree = 5, nodesize = 2)
# Use random forest to predict the values in train
pred_forest_train <- predict(model_forest, train)
# Observe the first few rows of our predictions
head(pred_forest_train)## 1 2 3 4 5
## non functional functional functional functional non functional
## 6
## functional
## Levels: functional functional needs repair non functional
Evaluation the Random Forest
library(caret)
confusionMatrix(pred_forest_train, train$status_group)## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair
## functional 30717 2686
## functional needs repair 251 1145
## non functional 1291 486
## Reference
## Prediction non functional
## functional 4943
## functional needs repair 123
## non functional 17758
##
## Overall Statistics
##
## Accuracy : 0.8354
## 95% CI : (0.8323, 0.8383)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6841
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 0.9522 0.26523
## Specificity 0.7189 0.99321
## Pos Pred Value 0.8010 0.75379
## Neg Pred Value 0.9268 0.94520
## Prevalence 0.5431 0.07268
## Detection Rate 0.5171 0.01928
## Detection Prevalence 0.6456 0.02557
## Balanced Accuracy 0.8356 0.62922
## Class: non functional
## Sensitivity 0.7780
## Specificity 0.9514
## Pos Pred Value 0.9090
## Neg Pred Value 0.8729
## Prevalence 0.3842
## Detection Rate 0.2990
## Detection Prevalence 0.3289
## Balanced Accuracy 0.8647
The positive predictive value is 0.91 for non-functional wells. That is a pretty high predictive value. The overall accuracy of our model on the training set was 0.84.
Variable importance
Look at how important the inputs were to the predictive model.
importance(model_forest)## functional functional needs repair non functional
## longitude 16.71721 7.490207 8.546189
## latitude 17.16512 9.495809 21.302410
## extraction_type_group 19.61447 5.066070 12.426716
## quality_group 10.47112 3.370981 7.865812
## quantity 26.53906 9.100936 43.594567
## waterpoint_type 18.39191 12.293394 24.068165
## construction_year 21.38358 10.318916 30.289797
## MeanDecreaseAccuracy MeanDecreaseGini
## longitude 71.157035 4315.9566
## latitude 23.103322 4296.2056
## extraction_type_group 27.563737 1736.4858
## quality_group 9.928202 546.9821
## quantity 42.336038 4412.6914
## waterpoint_type 40.533549 2755.5968
## construction_year 34.637218 2107.0972
varImpPlot(model_forest)It looks like the quality_group variable had the least predictive power in this model. The quantity variable seemed to be the most useful variable in this model.
Adding features
There are a lot of features in this data set, and many of them will not be useful inputs to commonly for machine learning techniques without some adjustments. That is why this data set is all about feature selection and feature engineering.
We can examine the variable installer by using summary(train$installer). We can see that there are a lot of terms that are likely the same installer, but have differenct names. For example, there are many instances that refer to ‘Government’: ‘Gover’, ‘GOVER’, ‘Government’, ‘Govt’ etc. All of these will be considered different factors unless we find a way to aggregate them.
One quick (and simplistic) way to do this would be to take the first 3 letters of each factor and make them lower case. Then, we can aggregate the terms that are most frequent and only use those as predictors, putting all other variables into an ‘other’ category.
# Observe the installer variable
summary(train$installer)## DWE
## 17402 3655
## Government RWE
## 1825 1206
## Commu DANIDA
## 1060 1050
## KKKT Hesawa
## 898 840
## 0 TCRS
## 777 707
## Central government CES
## 622 610
## Community DANID
## 553 552
## District Council HESAWA
## 551 539
## LGA World vision
## 408 408
## WEDECO TASAF
## 397 396
## District council Gover
## 392 383
## AMREF TWESA
## 329 316
## WU Dmdd
## 301 287
## ACRA World Vision
## 278 270
## SEMA DW
## 249 246
## OXFAM Da
## 234 224
## Gove Idara ya maji
## 222 222
## UNICEF Sengerema Water Department
## 222 214
## Kiliwater FinW
## 210 208
## NORAD DH
## 208 202
## Villagers DWSP
## 199 192
## Distri Lawatefuka water sup
## 181 180
## Magadini-Makiwaru wa RC
## 175 174
## FW KKKT _ Konde and DWE
## 173 166
## Centr WVT
## 162 158
## MWE Handeni Trunk Main(
## 157 156
## Is Norad
## 154 152
## Fini Water RWSSP
## 149 149
## SHIPO Private
## 147 143
## Kuwait JAICA
## 142 141
## Central govt Artisan
## 138 135
## ISF Fini water
## 135 133
## GOVER DDCA
## 128 126
## HE RC CHURCH
## 125 125
## Tardo Mission
## 123 121
## World Ir
## 121 119
## wananchi Consulting Engineer
## 119 116
## Amref JICA
## 114 113
## TWE Central Government
## 111 110
## MUWSA DED
## 107 105
## FINI WATER WATER AID
## 103 103
## Halmashauri ya wilaya sikonge HSW
## 102 100
## Wizara ya maji Jaica
## 100 98
## Unisef Go
## 98 97
## Ki OXFARM
## 97 96
## World Bank Roman
## 95 94
## Angli VWC
## 92 91
## District water department DMDD
## 89 89
## Missi H
## 87 86
## Shipo (Other)
## 86 12250
# Make installer lowercase, take first 3 letters as a sub string
train$install_3 <- substr(tolower(train$installer), 1, 3)
train$install_3[train$install_3 %in% c(" ", "", "0", "_", "-")] <- "other"
# Take the top 15 substrings from above by occurance frequency
install_top_15 <- names(summary(as.factor(train$install_3)))[1:15]
train$install_3[!(train$install_3 %in% install_top_15)] <- "other"
train$install_3 <- as.factor(train$install_3)
# Table of the install_3 variable vs the status of the pumps
table(train$install_3, train$status_group)##
## functional functional needs repair non functional
## cen 263 26 783
## ces 538 1 71
## com 1140 92 449
## dan 1042 95 545
## dis 529 101 744
## dwe 9454 1622 6356
## fin 127 34 641
## gov 945 296 1427
## hes 791 54 557
## kkk 568 78 517
## other 15358 1583 9033
## rwe 334 140 818
## tas 307 37 167
## tcr 305 43 386
## wor 558 115 330
# As row-wise proportions, install_3 vs status_group
prop.table(table(train$install_3, train$status_group), margin = 1)##
## functional functional needs repair non functional
## cen 0.245335821 0.024253731 0.730410448
## ces 0.881967213 0.001639344 0.116393443
## com 0.678167757 0.054729328 0.267102915
## dan 0.619500595 0.056480380 0.324019025
## dis 0.385007278 0.073508006 0.541484716
## dwe 0.542335934 0.093047269 0.364616797
## fin 0.158354115 0.042394015 0.799251870
## gov 0.354197901 0.110944528 0.534857571
## hes 0.564194009 0.038516405 0.397289586
## kkk 0.488392089 0.067067928 0.444539983
## other 0.591283591 0.060945561 0.347770848
## rwe 0.258513932 0.108359133 0.633126935
## tas 0.600782779 0.072407045 0.326810176
## tcr 0.415531335 0.058583106 0.525885559
## wor 0.556331007 0.114656032 0.329012961
test <- test_values
# Create install_3 for the test set using same top 15 from above
test$install_3 <- substr(tolower(test$installer),1,3)
test$install_3[test$install_3 %in% c(" ", "", "0", "_", "-")] <- "other"
test$install_3[!(test$install_3 %in% install_top_15)] <- "other"
test$install_3 <- as.factor(test$install_3)It looks like there are a few installer groups that show a high proportion of non functional wells.
Predict, Submit and Next Steps
Now that we have created the new variable install_3, it is time to make a new prediction. Then we can observe the importance and statistics to see how it performs.
Then, we can use the model to create predictions based on the test set. These results can be stored in a data frame called submission with a column for the well id and the predicted status_group.
This data frame can then be written to a .csv with write.csv and submitted to DrivenData here.
Ex: write.csv(submission, file = filepath.
This has only been an introduction to the water pumps challenge; there are still many ways to improve and refine our predictions. Here are a few suggestions for some next steps:
train and continue to add features that may have predictive power.# randomForest and caret packages are pre-loaded
set.seed(42)
model_forest <- randomForest(as.factor(status_group) ~ longitude + latitude + extraction_type_group + quantity + waterpoint_type + construction_year + install_3, data = train, importance = TRUE, ntree = 5, nodesize = 2)
# Predict using the training values
pred_forest_train <- predict(model_forest, train)
importance(model_forest)## functional functional needs repair non functional
## longitude 22.607973 11.441151 15.278711
## latitude 22.275082 11.739175 15.450517
## extraction_type_group 9.641744 10.703152 26.780392
## quantity 23.019245 15.160051 40.438273
## waterpoint_type 34.160155 8.735793 17.743501
## construction_year 23.752795 8.585766 8.095198
## install_3 14.538354 7.826015 10.086711
## MeanDecreaseAccuracy MeanDecreaseGini
## longitude 39.82497 4528.996
## latitude 30.42600 4397.020
## extraction_type_group 14.29616 1841.559
## quantity 33.36426 4312.917
## waterpoint_type 40.66599 2790.131
## construction_year 18.32689 2352.908
## install_3 13.68290 1566.425
confusionMatrix(pred_forest_train, train$status_group)## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair
## functional 30562 2114
## functional needs repair 347 1637
## non functional 1350 566
## Reference
## Prediction non functional
## functional 4338
## functional needs repair 202
## non functional 18284
##
## Overall Statistics
##
## Accuracy : 0.8499
## 95% CI : (0.847, 0.8527)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7158
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 0.9474 0.37920
## Specificity 0.7623 0.99003
## Pos Pred Value 0.8257 0.74886
## Neg Pred Value 0.9242 0.95316
## Prevalence 0.5431 0.07268
## Detection Rate 0.5145 0.02756
## Detection Prevalence 0.6231 0.03680
## Balanced Accuracy 0.8548 0.68462
## Class: non functional
## Sensitivity 0.8011
## Specificity 0.9476
## Pos Pred Value 0.9051
## Neg Pred Value 0.8842
## Prevalence 0.3842
## Detection Rate 0.3078
## Detection Prevalence 0.3401
## Balanced Accuracy 0.8744
# Predict using the test values
pred_forest_test <- predict(model_forest, test)
# Create submission data frame
submission <- data.frame(test$id)
submission$status_group <- pred_forest_test
names(submission)[1] <- "id"Now we can submit the results of the random forest by downloading this .csv and visiting the submission page for DriveData. We will see the model has an accuracy of 0.80 on the test set. Keep working and climb that leaderboard!
Notes
DataDriven organizes data science competitions to save the world.↩