Foreword
The story of the Titanic is well known. On April 15, 1912, in the North Atlantic Ocean, the RMS Titanic sank on its maiden voyage.
Aboard, there was over 1300 passengers and over 900 crew members for a total of over 2200 persons. Only one third survived to the disaster.
First, we will explore the ship manifest.
Second, we will try to understand why some passengers survived. In other words, depending on the passenger status, could have we predicted who stood a better chance of survival? For that, we will use machine learning techniques.
In this section, we explore the ship manifest. We rely on this source: https://www.encyclopedia-titanica.org/.
Please note that depending on the table, the tally may vary a bit (2201 vs. 2208). Other sources report 2224 passengers. In addition, passengers include crew members and officers. At some point in the analysis, we focus on traveling passengers only, not onboard personnel. Therefore, the word ‘passengers’ can include or exclude the crew.
Above all, we want to demonstrate are humans are poor at reading raw data. No matter how literate we have become, we still rely on our good old prehistoric reflex: we prefer visual representations.
Let’s explore the ship manifest with data tables.
Table 1, Titanic Survival Data Summarised by Class, Age Group and Sex
| Class | Age Group | Sex | Number Survived | Number Aboard | Percent Survived |
|---|---|---|---|---|---|
| 1st | child | male | 5 | 5 | 100 |
| 2nd | child | male | 11 | 11 | 100 |
| 3rd | child | male | 13 | 48 | 27.1 |
| 1st | adult | male | 57 | 175 | 32.6 |
| 2nd | adult | male | 14 | 168 | 8.3 |
| 3rd | adult | male | 75 | 462 | 16.2 |
| Crew | adult | male | 192 | 862 | 22.3 |
| 1st | child | female | 1 | 1 | 100 |
| 2nd | child | female | 13 | 13 | 100 |
| 3rd | child | female | 14 | 31 | 45.2 |
| 1st | adult | female | 140 | 144 | 97.2 |
| 2nd | adult | female | 80 | 93 | 86 |
| 3rd | adult | female | 76 | 165 | 46.1 |
| Crew | adult | female | 20 | 23 | 87 |
| TOTAL | 711 | 2201 | 32.3 |
From all these numbers, we can see that one-third of the 2200 people aboard survived. Among the survivors, a good deal were 1st-class females and children.
The easiest figures to understand are the percentage. We do better when we can transform data into comparisons such as percentage and ratios. We can make sense of all the information.
We can also see discrepancies among class numbers. The ship carried two very different classes of passengers comprising the aristocratic elite from England and Europe, and the poor immigrants from Great Britain, Ireland, and Scandinavia.
Let’s have two alternative views of the raw data.
Table 2, Titanic Survival Data Summarised by Title
| Title | Number Survived | Number Aboard | Percent Survived |
|---|---|---|---|
| Mr | 315 | 1588 | 19.8 |
| Mrs | 165 | 210 | 78.6 |
| Ms | 2 | 2 | 100 |
| Master | 31 | 61 | 50.8 |
| Miss | 185 | 270 | 68.5 |
| Mme. | 1 | 1 | 100 |
| Mlle | 2 | 2 | 100 |
| Sig. | 0 | 43 | 0 |
| Sir | 1 | 1 | 100 |
| Lady | 1 | 1 | 100 |
| Countess | 1 | 1 | 100 |
| Don | 0 | 1 | 0 |
| Dona | 1 | 1 | 100 |
| Fr | 0 | 7 | 0 |
| Rev | 0 | 1 | 0 |
| Captain | 0 | 2 | 0 |
| Colonel | 2 | 4 | 50 |
| Major | 1 | 2 | 50 |
| Dr | 4 | 10 | 40 |
| TOTAL | 712 | 2208 | 32.2 |
Table 3, Titanic Embarking Location Data
| Embarked at: | crew | first | second | third | TOTAL | Number Survived | Number Aboard | Percent Survived |
|---|---|---|---|---|---|---|---|---|
| Unknown | 0 | 3 | 0 | 0 | 3 | 2 | 3 | 66.7 |
| Belfast | 188 | 4 | 6 | 0 | 198 | 43 | 198 | 21.7 |
| Southampton | 703 | 171 | 246 | 493 | 1613 | 469 | 1613 | 29.1 |
| Cherbourg | 0 | 146 | 26 | 102 | 274 | 156 | 274 | 56.9 |
| Queenstown | 0 | 0 | 7 | 113 | 120 | 42 | 120 | 35 |
| TOTAL | 891 | 324 | 285 | 708 | 712 | 2208 | 32.2 |
Preliminary conclusion: exploring raw data provides both the details and a big picture (with totals). However, humans are not very good at making sense of data table. Even with categorical variables (male/female, A/B/C/D/E, 10 classes), we have difficulties sizing up the importance of the numbers. It requires us to overthink. It would be even more opaque with continuous variable (such as height, weights).
Visualizing the data
Let’s create a data table of ‘adult male passenger who survived’.
# The data
TitanicDF = as.data.frame(Titanic)
# column 1 to 4
# rep (replicate)
# for example: 3, 3rd, Male, Child, No is present 35 times
# 3rd, Male, Child will show up 35 times
# process all line
# 3rd will cumulate, Male will cumulate, etc.
TitanicList <- lapply(TitanicDF[1:4], rep, TitanicDF$Freq)
# Create a new d.f: adult male passenger who survived
TitanicSets <- data.frame(passenger = TitanicList$Class != 'Crew',
adult = TitanicList$Age == 'Adult',
male = TitanicList$Sex == 'Male',
survivor = TitanicList$Survived == 'Yes')
# Check it out
head(TitanicSets, 3)## passenger adult male survivor
## 1 TRUE FALSE TRUE FALSE
## 2 TRUE FALSE TRUE FALSE
## 3 TRUE FALSE TRUE FALSE
A good tool for parsing raw data is the Venn diagram. Let’s draw three diagrams:
# Load the library for venn diagrams
library(gplots)
# Plot the venn diagrams
# adult passenger
venn(TitanicSets[1:2])# adult male passenger
venn(TitanicSets[1:3])# adult male passenger who survived
venn(TitanicSets)Parsing data this way offers a different view. The first diagram tells us there were 1207 ‘adult passengers’. Second, among 1207, 805 were ‘males’. Third, among the 805, 147 ‘survived’.
The problem is, beyond three groups, we fall back into the mess of the data tables: a data overflow. In addition, the gplots package does not size up the circles according to each subtotal. One group should be larger than another.
Let’s try another package and reproduce the three previous diagrams.
library(venneuler)
# adult passenger
plot(venneuler(TitanicSets[1:2]), col = hcl(0, 0, c(60, 80), 0.5), alpha = NA, border = 'black')Much better! And…
# adult male passenger
plot(venneuler(TitanicSets[1:3]), col = hcl(0, 0, c(40, 80, 20), 0.5), alpha = NA, border = 'black')# adult male passenger who survived
plot(venneuler(TitanicSets[1:4]), col = hcl(0, 0, c(60, 80, 20), 0.5), alpha = NA, border = 'black')We read this last diagram:
Preliminary conclusion: we avoid numbers and we simply focus on size. The answer is always the central ‘crescent’ formed by the circles overlapping each other (where we read the beginning of ‘passenger’). This crescent represents 146 persons, 7% of the ship manifest. Any good Venn diagram should provide a good visual overview of the data.
Can we do better? Yes, we can. Venn diagram, beyond two circles, are always confusing, no matter how sharp they are. Let’s draw some graphics.
Visual exploration
From now on, we mainly focus on passenger only and try to exclude crew members.
Visual representation is a powerful technique for exploring a case before stepping into deeper analyses.
The next graph plots ‘Table 1, Titanic Survival Data Summarised by Class, Age Group and Sex’.
# Modify the data
Titanic1 <- data.frame(Titanic)
Titanic2 <- Titanic1
Titanic2$Survived <- as.factor(ifelse(Titanic2$Survived == 'Yes', 1, 0))
str(Titanic1)## 'data.frame': 32 obs. of 5 variables:
## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
## $ Age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
## $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq : num 0 0 35 0 0 0 17 0 118 154 ...
str(Titanic2)## 'data.frame': 32 obs. of 5 variables:
## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
## $ Age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
## $ Survived: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq : num 0 0 35 0 0 0 17 0 118 154 ...
# Load the package
library(ggplot2)
# Extend ggplot2
library(grid)
library(gridExtra)
p <- ggplot(Titanic1, aes(weight = Freq)) + ylab('') + ylim(0, 2250)
cs <- p + aes(Class) + geom_bar(fill = 'blue') + ggtitle('Passenger by Class,') + xlab('')
sx <- p + aes(Sex) + geom_bar(fill = 'green') + ggtitle('by Gender,') + xlab('')
ag <- p + aes(Age) + geom_bar(fill = 'tan2') + ggtitle('by Age Group,') + xlab('')
su <- p + aes(Titanic1$Survived) + geom_bar(fill = 'red') + ggtitle('and Survivors') + xlab('')
grid.arrange(cs, sx, ag, su, nrow = 1, widths = c(3, 2, 2, 2))Visually, the ship manifest has more explanatory power the data table.
Proportions (or percentages) are built into the graphs. We don’t need to read them, interpret them; we see them. There is only one problem, we lose the details. For example, the bar charts show data by class, by gender or by survivor.
What about a bivariate bar charts? It is called a mosaic chart. First, we plot two univariate mosaic charts (by class and by gender) and a bivariate mosaic chart combining both classes and genders.
par(mfrow = c(1, 3))
mosaicplot(~ Class, data = Titanic, main = 'Univariate')
mosaicplot(~ Sex, data = Titanic, main = 'Univariate')
mosaicplot(~ Sex + Class, data = Titanic, main = 'Bivariate')par(mfrow = c(1, 1))However, mosaicplot has limited options. It would be interesting if we could ‘color’ each block by ‘survived/deceased’ instead of splitting the mosaic to a point beyond intelligibility.
mosaicplot(~ Sex + Class + Survived, data = Titanic, main = 'Multivariate')Who survived? Where? What?
There is a better package for this kind of task: vcd. We can redo what we did above, and improve on it.
library(vcd)# A <- mosaic(~ Class, data = Titanic1, return_grob = TRUE)
#B <- mosaic(Survived ~ Class, data = Titanic1, return_grob = TRUE)
#C <- mosaic(~ Class + Sex, data = Titanic1, return_grob = TRUE)
#D <- mosaic(Survived ~ Class + Sex, data = Titanic1, return_grob = TRUE)
#E <- mosaic(~ Class + Sex + Age, data = Titanic1, return_grob = TRUE)
#F <- mosaic(Survived ~ Class + Sex + Age, data = Titanic1, return_grob = TRUE)
# Multiple grid plots
mplot(A, B)mplot(C, D)mplot(E, F)Now, the data is talking! Because proportions are built in the graphs.
And there is an alternative to mosaic charts: doubledeckers. Doubledeckers are combinations of bar charts and mosaic charts.
#G <- doubledecker(Survived ~ Sex, data = Titanic, gp = gpar(fill = c('grey90', 'blue')), return_grob = TRUE)
#H <- doubledecker(Survived ~ Sex, data = Titanic, gp = gpar(fill = c('grey90', 'green')), return_grob = TRUE)
# Multiple grid plots
mplot(G, H)doubledecker(Survived ~ Class + Sex, data = Titanic, gp = gpar(fill = c('grey90', 'red')))We have to pay attention to the width. The Crew member constitutes the largest group in term of numbers. The graph capture both sizes (totals) and proportions. It is a data table, with details and totals, but much easier to read and understand. We can already answer some questions while would still be calculating with the data table.
Another key characteristic is to be able to display graphics side by side (in a grid or in a graphic with several facets). Comparison between graphics is as important as within the graphic itself.
Some packages have these feature built-in.
library(extracat)
rmb(formula = ~Class + Sex + Survived, data = Titanic1)The lattice package offers interesting possibilities.
library(lattice)
# Normalized x-axis for comparison
barchart(Class ~ Freq | Sex + Age, data = as.data.frame(Titanic), groups = Survived, stack = TRUE, layout = c(4, 1), auto.key = list(title = 'Survived', columns = 2))OK, lattice pastel tones are… well, we can turn the colour off.
# Save the theme
mytheme <- trellis.par.get()
# Turn the B&W
trellis.par.set(canonical.theme(color = FALSE))Modify the axes (from normalized scales to free scales).
# Free x-axis
bc.titanic <- barchart(Class ~ Freq | Sex + Age, data = as.data.frame(Titanic), groups = Survived, stack = TRUE, layout = c(4, 1), auto.key = list(title = 'Survived', columns = 2), scales = list(x = 'free'))
bc.titanicBring back the colors.
trellis.par.set(mytheme)
bc.titanic <- barchart(Class ~ Freq | Sex + Age, data = as.data.frame(Titanic), groups = Survived, stack = TRUE, layout = c(4, 1), auto.key = list(title = 'Survived', columns = 2), scales = list(x = 'free'))
bc.titanicAdd background grids.
# Add bg grids
update(bc.titanic, panel = function(...) {
panel.grid(h = 0, v = -1)
panel.barchart(...)
})Graphics are powerful for exploring data. Without sweating too much, we can already draw initial conclusions.
There was more crew member onboard than 1st class passengers, yet, the latter had a better survival rate than the former. Overall, 1st-class passengers had the highest survival rate.
More females survived even though there were more males onboard. In all classes (1st, 2nd, 3rd, Crew), there were more males, but more females survived. Actually, more female and children.
In the next sections, we will analyze these initial conclusions with statistical techniques.
We dive deeper into the matter. As with depths, this section and the following one are denser.
We can analyze all sort of things, but we need metrics to compare results. We need ways to measure if an analysis is robust enough and if an analysis is suited to grasp a pattern.
In section 3, we introduce statistical techniques: the decision tree and the k-nearest neighbour. For this section, we focus on performance measuring, preprocessing and cross-validating.
Load the data
# Check it out
str(titanic)## 'data.frame': 714 obs. of 4 variables:
## $ survived: Factor w/ 2 levels "1","0": 2 1 1 1 2 2 2 1 1 1 ...
## $ pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
## $ age : num 22 38 26 35 35 54 2 27 14 4 ...
The Confusion Matrix
With a decision tree, we can predict whether a person would have survived the disaster based on the variables: age, sex and pclass (travel class). We then compare the actuals to the prediction made by the tree with a confusion matrix. In the next chunk, we run a decision tree, make predictions, but only focus on the confusion matrix.
# Load the decision tree package
library(rpart)
# Set random seed
set.seed(1)
# A rpart
tree <- rpart(survived ~ ., data = titanic, method = 'class')
#tree
# Use predict()
pred <- predict(tree, titanic, type = 'class')
#pred
# Use table() to make a confusion matrix
# rows = 1 or 0 (actuals); column = 1 or 0 (predictions)
conf <- table(titanic$survived, pred)
conf## pred
## 1 0
## 1 212 78
## 0 53 371
So to summarize, 212 out of all 265 survivors were correctly predicted to have survived. On the other hand, 371 out of the 449 deceased were correctly predicted to have perished.
Deriving ratios from the Confusion Matrix
The confusion matrix provides the raw performance of the decision tree:
The matrix:
\[\begin{matrix} & \mathbf{P} & \mathbf{N}\\ \mathbf{p} & \mathrm{TP}& \mathrm{FN}\\ \mathbf{n} & \mathrm{FP}& \mathrm{TN}\\ \end{matrix}\]
The ratios:
\[Accuracy = \frac{TP+TN}{TP+FN+FP+TN}\]
\[Precision = \frac{TP}{TP+FP}\]
\[Recall = \frac{TP}{TP+FN}\]
# Given a Confusion Matrix
conf## pred
## 1 0
## 1 212 78
## 0 53 371
# Assign TP, FN, FP and TN using conf
TP <- conf[1,1]
TN <- conf[2,2]
FP <- conf[2,1]
FN <- conf[1,2]
# Calculate and print the accuracy
acc <- (TP + TN)/(TP + FN + FP + TN)
# Calculate and print out the precision
prec <- TP/(TP + FP)
# Calculate and print out the recall
rec <- TP/(TP + FN)
r1 <- c('Accuracy', 'Precision', 'Recall')
r2 <- round(c(acc, prec, rec), 2)
names(r2) <- r1
r2## Accuracy Precision Recall
## 0.82 0.80 0.73
dotchart(r2)With an accuracy of around 82%, the model incorrectly predicted 18% of the passengers’ fates. Overall we have an acceptable, but not truly satisfying classifier.
Split the sets
A decision tree is a supervised algorithm. We need to train the algorithm first. Then, we compare some predictions with the actual results: the test.
We need to split the dataset into train and test sets. We use a 70/30 split between the two.
# Set random seed
set.seed(1)
# Shuffle the dataset
n <- nrow(titanic)
shuffled <- titanic[sample(n),]
# Split the data in train and test
titanic_train <- shuffled[1:round(0.7 * n),]
titanic_test <- shuffled[(round(0.7 * n) + 1):n,]
# Print the structure of train and test
str(titanic_train)## 'data.frame': 500 obs. of 4 variables:
## $ survived: Factor w/ 2 levels "1","0": 2 1 1 2 2 2 2 2 1 1 ...
## $ pclass : Factor w/ 3 levels "1","2","3": 2 1 1 2 2 3 3 3 3 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 1 2 2 1 1 ...
## $ age : num 24 16 36 39 30 30 21 35 36 5 ...
str(titanic_test)## 'data.frame': 214 obs. of 4 variables:
## $ survived: Factor w/ 2 levels "1","0": 2 1 2 2 2 1 1 1 1 2 ...
## $ pclass : Factor w/ 3 levels "1","2","3": 3 2 3 3 2 2 3 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 1 1 1 2 ...
## $ age : num 42 1 21 30.5 29 34 21 32 44 37 ...
This procedure can be repeated several times to build several train and test sets. This is helpful for training, testing and validating.
First, we train, then we test
# Set random seed
set.seed(1)
# The model
tree <- rpart(survived ~ ., titanic_train, method = 'class')
# Predict the outcome on the test set
pred <- predict(tree, titanic_test, type = 'class')
# Calculate the confusion matrix
conf2 <- table(titanic_test$survived, pred)
# Print this confusion matrix
conf2## pred
## 1 0
## 1 47 38
## 0 5 124
# accuracy = (TP+TN) / (TP+FN+FP+TN)
# sum on diagonal / sum of all
# Print the accuracy
round(sum(diag(conf2))/sum(conf2), 2)## [1] 0.8
The confusion matrix reveals an accuracy of 80%. This is less than the 82% calculated above.
Using Cross Validation
Let’s now build several train and test sets; 6 splits or folds.
# Set random seed
set.seed(1)
# Initialize the accs vector
accs <- c(1:6) * 0
# or...
accs <- rep(0,6)
for (i in 1:6) {
# these indices indicate the interval of the test set
indices <- (((i - 1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
# exclude them from the train set
titanic_train2 <- shuffled[-indices,]
# include them in the test set
titanic_test2 <- shuffled[indices,]
# a model is learned using each training set
tree2 <- rpart(survived ~ ., titanic_train2, method = 'class')
# make a prediction on the test set using tree
pred2 <- predict(tree2, titanic_test2, type = 'class')
# assign the confusion matrix to conf
conf2 <- table(titanic_test2$survived, pred2)
# assign the accuracy of this model to the ith index in accs
# accuracy = (TP+TN) / (TP+FN+FP+TN)
# sum on diagonal / sum of all
accs[i] <- sum(diag(conf2))/sum(conf2)
}
# Print out the mean accuracy
round(mean(accs), 2)## [1] 0.8
The mean accuracy is 80%. With 6 folds, the models confirms the previous results.
How many folds?
Let’s say we’re doing cross-validation on a dataset with 22680 observations: n. We want the training set to contain 21420 entries: tr. How many folds can we use for our cross-validation? In other words, how many iterations with other test sets can we do on the dataset?
n - tr = 1260, n / 1260 = 18 folds.
Now we know how to preprocess the data and measure/compare the results. Let’s now turn our attention to machine learning algorithms (or statistical learning, they are synonyms).
There two principal classification techniques in machine learning: decision trees and k-nearest neighbours. Both are supervised techniques. The analyst must set the parameters. Let’s start with decision trees.
A note on tree-based models
Recursive partitioning is a fundamental tool in data mining. It helps us explore the structure of a set of data while visualizing the decision rules for predicting a categorical outcome (classification tree) or a continuous outcome (regression tree).
Classification and regression trees can be generated through the rpart package.
Main methods:
'class' for a classification tree on a categorical dependent variable.'anova' for a regression tree on a continuous dependent variable.There are also the 'poisson' and 'exp' methods. control are optional parameters for controlling and pruning the tree growth.
The party package provides nonparametric regression trees in cases of nonparametric regressions (nominal, ordinal, censored and multivariate responses). Pruning is unnecessary.
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new “forest”, and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler’s random forest approach is implemented via the randomForest package.
Learn a decision tree
First, grow the tree. Second, examine the results. Three, prune the tree.
Let’s grow a decision tree that uses a person’s age, gender, and travel class to predict whether or not they survived the Titanic.
# Set random seed
set.seed(1)
titanic_train3 <- titanic_train2
titanic_test3 <- titanic_test2
# Build a tree model
# min number of observation in a node before splitting, minsplit = 30
# cost complexity factor, cp = 0,001
tree3 <- rpart(survived ~ ., data = titanic_train3, method = 'class')
# Print the results
print(tree3)## n= 595
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 595 242 0 (0.40672269 0.59327731)
## 2) sex=female 220 56 1 (0.74545455 0.25454545)
## 4) pclass=1,2 133 7 1 (0.94736842 0.05263158) *
## 5) pclass=3 87 38 0 (0.43678161 0.56321839)
## 10) age< 5.5 10 2 1 (0.80000000 0.20000000) *
## 11) age>=5.5 77 30 0 (0.38961039 0.61038961) *
## 3) sex=male 375 78 0 (0.20800000 0.79200000)
## 6) age< 6.5 18 5 1 (0.72222222 0.27777778) *
## 7) age>=6.5 357 65 0 (0.18207283 0.81792717) *
# Print detailed results including surrogate splits
summary(tree3)## Call:
## rpart(formula = survived ~ ., data = titanic_train3, method = "class")
## n= 595
##
## CP nsplit rel error xerror xstd
## 1 0.44628099 0 1.0000000 1.0000000 0.04951322
## 2 0.04545455 1 0.5537190 0.5537190 0.04210455
## 3 0.03305785 2 0.5082645 0.5661157 0.04243451
## 4 0.02479339 3 0.4752066 0.5247934 0.04130008
## 5 0.01000000 4 0.4504132 0.4793388 0.03993217
##
## Variable importance
## sex pclass age
## 63 22 15
##
## Node number 1: 595 observations, complexity param=0.446281
## predicted class=0 expected loss=0.4067227 P(node) =1
## class counts: 242 353
## probabilities: 0.407 0.593
## left son=2 (220 obs) right son=3 (375 obs)
## Primary splits:
## sex splits as LR, improve=80.103310, (0 missing)
## pclass splits as LLR, improve=34.140610, (0 missing)
## age < 6.5 to the left, improve= 9.892647, (0 missing)
## Surrogate splits:
## age < 0.79 to the left, agree=0.634, adj=0.009, (0 split)
##
## Node number 2: 220 observations, complexity param=0.04545455
## predicted class=1 expected loss=0.2545455 P(node) =0.3697479
## class counts: 164 56
## probabilities: 0.745 0.255
## left son=4 (133 obs) right son=5 (87 obs)
## Primary splits:
## pclass splits as LLR, improve=27.423150, (0 missing)
## age < 21.5 to the right, improve= 1.601385, (0 missing)
## Surrogate splits:
## age < 21.5 to the right, agree=0.686, adj=0.207, (0 split)
##
## Node number 3: 375 observations, complexity param=0.03305785
## predicted class=0 expected loss=0.208 P(node) =0.6302521
## class counts: 78 297
## probabilities: 0.208 0.792
## left son=6 (18 obs) right son=7 (357 obs)
## Primary splits:
## age < 6.5 to the left, improve=9.999246, (0 missing)
## pclass splits as LRR, improve=6.239221, (0 missing)
##
## Node number 4: 133 observations
## predicted class=1 expected loss=0.05263158 P(node) =0.2235294
## class counts: 126 7
## probabilities: 0.947 0.053
##
## Node number 5: 87 observations, complexity param=0.02479339
## predicted class=0 expected loss=0.4367816 P(node) =0.1462185
## class counts: 38 49
## probabilities: 0.437 0.563
## left son=10 (10 obs) right son=11 (77 obs)
## Primary splits:
## age < 5.5 to the left, improve=2.981221, (0 missing)
##
## Node number 6: 18 observations
## predicted class=1 expected loss=0.2777778 P(node) =0.0302521
## class counts: 13 5
## probabilities: 0.722 0.278
##
## Node number 7: 357 observations
## predicted class=0 expected loss=0.1820728 P(node) =0.6
## class counts: 65 292
## probabilities: 0.182 0.818
##
## Node number 10: 10 observations
## predicted class=1 expected loss=0.2 P(node) =0.01680672
## class counts: 8 2
## probabilities: 0.800 0.200
##
## Node number 11: 77 observations
## predicted class=0 expected loss=0.3896104 P(node) =0.1294118
## class counts: 30 47
## probabilities: 0.390 0.610
# Examine the cp table
printcp(tree3)##
## Classification tree:
## rpart(formula = survived ~ ., data = titanic_train3, method = "class")
##
## Variables actually used in tree construction:
## [1] age pclass sex
##
## Root node error: 242/595 = 0.40672
##
## n= 595
##
## CP nsplit rel error xerror xstd
## 1 0.446281 0 1.00000 1.00000 0.049513
## 2 0.045455 1 0.55372 0.55372 0.042105
## 3 0.033058 2 0.50826 0.56612 0.042435
## 4 0.024793 3 0.47521 0.52479 0.041300
## 5 0.010000 4 0.45041 0.47934 0.039932
# Plot the cross-validation results
plotcp(tree3)# Plot the decision tree
# options: uniform = TRUE, main = 'Title'
plot(tree3, uniform = TRUE, main = 'Tree 3')
# Add text
# options: use.n = TRUE, cex = 0.8
text(tree3, use.n = TRUE, cex = 0.8)Can we surely improve the visual? Of course.
# Load additional packages
library(rattle)
library(rpart.plot)
library(RColorBrewer)
# Draw a prettier decision tree
fancyRpartPlot(tree3)Let’s examine the results. On Node 1, the tree confirms the exploratory data analysis (let’s remember that 1st-class females and children showed a better survival rate). The most important factor deciding the fate of a passenger is gender. The fact that males account for the majority of passengers weighs in the results resulting in a female/male ratio of 41/59. According to the graphics, the percentage of surviving female accounted for more in all categories. In all, as a female, you stand more chance to survive. Node 1 shows males have predominantly deceased at 59%. At Node 1 accounts for 100% of the observations.
On Node 2: the tree predicts female passengers in class 1,2 would survive at 75% since 75% of the training instances are positive and 25% are negative. The majority is positive, which is signaled by the number 1 on top or the green color.
The 37% bit tells us which percentage of the entire training set passes through node 2.
Finally, the pclass = 1,2 specifies the feature test on which this node is separated. If the test comes out positive, the left branch is taken; if it’s negative, the right branch is taken.
In other words, the tree predicts that female passengers in class 3 are not likely to survive.
Node 5 and 3 confirms that children had better survival rates.
Preliminary conclusion: the tree-based model confirms the preliminaries (exploratory) findings from section 1.
Classify with the decision tree
# Set random seed
set.seed(1)
# Predict the values of the test set
pred3 <- predict(tree3, titanic_test3, type = 'class')
# Construct the confusion matrix
# actuals vs predictions
conf3 <- table(titanic_test3$survived, pred3)
# Print out the accuracy
round(sum(diag(conf3))/sum(conf3), 2)## [1] 0.79
Around 79% percent of all test instances have been classified correctly.
Pruning the tree
Let’s increase the bias (cp) by restricting the amount of detail the tree can model.
# Set random seed
set.seed(1)
titanic_train4 <- titanic_train3
titanic_test4 <- titanic_test3
tree4 <- rpart(survived ~ ., titanic_train4, method = 'class', control = rpart.control(cp = 0.00001))
# Draw the complex tree
fancyRpartPlot(tree4)Now we have a tree overfitting the data. It is so precise that it does mean anything. Let’s step back and prune the tree (change the cp).
# Prune the tree: pruned
pruned <- prune(tree4, cp = 0.01)
# Draw the pruned tree
fancyRpartPlot(pruned)Another way to check if we overfit our model is by comparing the accuracy on the training set with the accuracy on the test set.
We can also challenge the tree-based models with another algorithm: the k-nearest neightbour (knn).
Preprocess the data
Beforehand, we need to address an important issue with the data: scaling. The scale of the input variables may have a great influence on the outcome of the knn algorithm. One variable can span from 0 to 50 while another can span from 0 to 1.
First, rescale the variables; or normalize each vector. Here is the formula:
\[\frac{\mathbf{x} - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})}\]
The knn function
We are now ready to classify instances with the knn. k is set to 5 neighbours.
# Set random seed
set.seed(1)
# Load the class package
library(class)
# Make predictions using knn: pred
pred_knn <- knn(knn_train, knn_test, train_labels, k = 5)
# Construct the confusion matrix: conf
conf_knn <- table(test_labels, pred_knn)
# Print out the confusion matrix
conf_knn## pred_knn
## test_labels 1 0
## 1 36 12
## 0 10 61
# Print out the accuracy
round(sum(diag(conf_knn))/sum(conf_knn), 2)## [1] 0.82
With an accuracy of 82%, we confirm all the previous results obtained from our tree-based models.
k’s choice
Let’s push the analysis to another level and iterate over different values of k, then, compare the accuracies.
# Set random seed
set.seed(1)
# Load the class package, define range and accs
library(class)
# The range will span from 0 to 100
range <- 1:round(0.168 * nrow(knn_train))
accs <- rep(0, length(range))
for (k in range) {
# Make predictions using knn: pred
pred <- knn(knn_train, knn_test, train_labels, k = k)
# Construct the confusion matrix: conf
conf <- table(test_labels, pred)
# Calculate the accuracy and store it in accs[k]
accs[k] <- sum(diag(conf))/sum(conf)
}
# Plot the accuracies. Title of x-axis is 'k'
plot(range, accs, xlab = 'k')# Calculate the worst and best k
which.min(accs)## [1] 22
round(min(accs), 2)## [1] 0.75
which.max(accs)## [1] 1
round(max(accs), 2)## [1] 0.83
The accuracies oscillate between the min of 75% (k = 22) and the max of 83% (k = 1).
Again, we confirm all the previous results. For an even more robust estimate of the best k, we could use cross validation.
A cool way to visualize how knn works with two-dimensional features is the Voronoi diagram. It’s basically a plot of all the training instances, together with a set of tiles around the points. Each tile represents a region of influence for each point. When we want to classify a new observation, we assign it to the tile for which the coordinates correspond.
In the plot, we can see training instances that belong to either the blue or the red class. Each instance has two features: x and y (coordinates). The top left instance, for example, has an x value of around 0.05 and a y value of 0.9.
Suppose we are given an unseen observation with features (x, y) = (0.5, 0.5). Looking at the Voronoi diagram, which class would we give this observation? Blue.
Simulating a Voronoi diagram
#Let's generate some fake data
set.seed(105)
long <- rnorm(20,-98,15)
lat <- rnorm(20,39,10)
df <- data.frame(lat,long)
library(deldir)
library(ggplot2)
#This creates the voronoi line segments
voronoi <- deldir(df$long, df$lat)
#Now we can make a plot
ggplot(data = df, aes(x = long,y = lat)) +
#Plot the voronoi lines
geom_segment(
aes(x = x1, y = y1, xend = x2, yend = y2),
size = 2,
data = voronoi$dirsgs,
linetype = 1,
color = "#FFB958") +
#Plot the points
geom_point(
fill = rgb(70, 130, 180, 255, maxColorValue = 255),
pch = 21,
size = 4,
color = "#333333")What is a ‘random forest’
The Random Forest technique handles the overfitting problem we faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote.
For example, if we have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor.
This approach of overtraining trees, but having the majority’s vote count as the actual classification decision, avoids overfitting.
Before starting with the actual analysis, we first need to meet one big condition of Random Forests: no missing values in our data frame.
# Check it out
summary(all_data)## passengerid pclass survived name
## Min. : 1 Min. :1.000 Min. :0.000 Length:1309
## 1st Qu.: 328 1st Qu.:2.000 1st Qu.:0.000 Class :character
## Median : 655 Median :3.000 Median :0.000 Mode :character
## Mean : 655 Mean :2.295 Mean :0.382
## 3rd Qu.: 982 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :1309 Max. :3.000 Max. :1.000
##
## sex age sibsp parch
## female:466 Min. : 1.00 Min. :0.0000 Min. :0.000
## male :843 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Median :28.00 Median :0.0000 Median :0.000
## Mean :30.15 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :308
## ticket fare cabin embarked
## CA. 2343: 11 Min. : 0.000 Length:1309 C :270
## 1601 : 8 1st Qu.: 7.896 Class :character Q :123
## CA 2144 : 8 Median : 14.454 Mode :character S :914
## 3101295 : 7 Mean : 33.295 NA's: 2
## 347077 : 7 3rd Qu.: 31.275
## 347082 : 7 Max. :512.329
## (Other) :1261 NA's :1
## boat body home.dest
## Length:1309 Length:1309 Length:1309
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
We have missing values (NA).
# some passenger do not have a value for embarkment.
# Since many passengers embarked at Southampton, we give them the value S.
all_data$embarked[is.na(all_data$embarked) == T] <- 'S'
# some passenger have an NA fare value
# Replace it with the median fare value
all_data$fare[is.na(all_data$fare) == T] <- median(all_data$fare, na.rm = TRUE)
# How to fill in missing age values?
# Make a prediction of a passengers age using the other variables and a decision tree model
# method = 'anova' since we are predicting a continuous variable.
predicted_age <- rpart(age ~ pclass + sex + sibsp + parch + fare + embarked, data = all_data[!is.na(all_data$age),], method = 'anova')
all_data$age[is.na(all_data$age)] <- predict(predicted_age, all_data[is.na(all_data$age),])
# Split the data back into a train set and a test set
split <- round(nrow(all_data) * 0.75)
train_rf <- all_data[1:split,]
test_rf <- all_data[(split + 1):nrow(all_data),]The data is cleaned up.
One more important element in Random Forest is randomization to avoid the creation of the same tree from the training set.
Randomize in two ways: by taking a randomized sample of the rows in our training set and by only working with a limited and changing number of the available variables for every node of the tree.
A ‘random forest’ analysis in R
# load the package
library(randomForest)
# train set and test set
str(train_rf)## 'data.frame': 982 obs. of 15 variables:
## $ passengerid: num 1 2 3 4 5 6 7 8 9 10 ...
## $ pclass : num 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : num 1 1 0 0 0 1 1 0 1 0 ...
## $ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : num 29 28.7 2 30 25 ...
## $ sibsp : num 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : num 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 825 ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
## $ embarked : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
## $ boat : chr "2" "11" NA NA ...
## $ body : chr NA NA NA "135" ...
## $ home.dest : chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
str(test_rf)## 'data.frame': 327 obs. of 15 variables:
## $ passengerid: num 983 984 985 986 987 988 989 990 991 992 ...
## $ pclass : num 3 3 3 3 3 3 3 3 3 3 ...
## $ survived : num 0 0 1 1 0 0 0 0 0 1 ...
## $ name : chr "Lyntakoff, Mr. Stanko" "MacKay, Mr. George William" "Madigan, Miss. Margaret \"Maggie\"" "Madsen, Mr. Fridtjof Arne" ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 27 27 27 24 22 ...
## $ sibsp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ parch : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ticket : Factor w/ 929 levels "110152","110413",..: 519 775 637 743 903 381 740 741 897 273 ...
## $ fare : num 7.9 7.55 7.75 7.14 7.12 ...
## $ cabin : chr NA NA NA NA ...
## $ embarked : Factor w/ 3 levels "C","Q","S": 3 3 2 3 3 2 2 3 3 1 ...
## $ boat : chr NA NA "15" "13" ...
## $ body : chr NA NA NA NA ...
## $ home.dest : chr NA NA NA NA ...
# set seed for reproducibility
set.seed(111)
# apply the random forest algorithm
forest_rf <- randomForest(as.factor(survived) ~ pclass + sex + age + sibsp + parch + fare + embarked, data = train_rf, importance = TRUE, ntree = 1000)
# make our prediction using the test set
pred_rf <- predict(forest_rf, test_rf)
# Construct the confusion matrix
conf_rf <- table(test_rf$survived, pred_rf)
# Print out the confusion matrix
conf_rf## pred_rf
## 0 1
## 0 222 22
## 1 53 30
# Print out the accuracy
round(sum(diag(conf_rf))/sum(conf_rf), 2)## [1] 0.77
We just created your first random forest with an accuracy of 75%. This is the lowest accuracy ratio compared to all the previous ratios. However, remember that the minimum accuracy ratio for knn is 75% (for k = 22).
Important variables
Now we can see what variables are important using.
varImpPlot(forest_rf)Two graphs appear: the accuracy plot shows how much worse the model would perform without the included variables. So a large decrease (= high-value x-axis) links to a high predictive variable. The second plot is the Gini coefficient. The higher the variable scores here, the more important it is for the model.
Preliminary conclusion: sex appear to have the highest impact on the model. All tree-based models, including the random forests, confirms the preliminaries (exploratory) findings from section 1.
Final conclusion
The ship carried different classes of passengers. Their destination was the same; however, their faithful end deferred. The rich traveled in the 1st class with all the luxuries and entitlements, and the poor were condemned to the dark and windowless quarters in the 3rd class.
When it became a question of survival, the rich prevailed over the rest. The Titanic carried far fewer rescue boats than needed. In fact, the lifeboats could carry only 1178 passengers whereas the ship carried 2224 passengers. The ‘women and children policies’ was adopted. However, rich women and children survived in greater proportions than did poor women and children.