Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets and results.
  • Source: ‘Machine Learning’ from DataCamp; Wikipedia and other websites.


The story of the Titanic is well known. On April 15, 1912, in the North Atlantic Ocean, the RMS Titanic sank on its maiden voyage.

Aboard, there was over 1300 passengers and over 900 crew members for a total of over 2200 persons. Only one third survived to the disaster.

First, we will explore the ship manifest.

Second, we will try to understand why some passengers survived. In other words, depending on the passenger status, could have we predicted who stood a better chance of survival? For that, we will use machine learning techniques.

Data Exploration

In this section, we explore the ship manifest. We rely on this source: https://www.encyclopedia-titanica.org/.

Please note that depending on the table, the tally may vary a bit (2201 vs. 2208). Other sources report 2224 passengers. In addition, passengers include crew members and officers. At some point in the analysis, we focus on traveling passengers only, not onboard personnel. Therefore, the word ‘passengers’ can include or exclude the crew.

Above all, we want to demonstrate are humans are poor at reading raw data. No matter how literate we have become, we still rely on our good old prehistoric reflex: we prefer visual representations.

Let’s explore the ship manifest with data tables.

Table 1, Titanic Survival Data Summarised by Class, Age Group and Sex

Class Age Group Sex Number Survived Number Aboard Percent Survived
1st child male 5 5 100
2nd child male 11 11 100
3rd child male 13 48 27.1
1st adult male 57 175 32.6
2nd adult male 14 168 8.3
3rd adult male 75 462 16.2
Crew adult male 192 862 22.3
1st child female 1 1 100
2nd child female 13 13 100
3rd child female 14 31 45.2
1st adult female 140 144 97.2
2nd adult female 80 93 86
3rd adult female 76 165 46.1
Crew adult female 20 23 87
TOTAL 711 2201 32.3

From all these numbers, we can see that one-third of the 2200 people aboard survived. Among the survivors, a good deal were 1st-class females and children.

The easiest figures to understand are the percentage. We do better when we can transform data into comparisons such as percentage and ratios. We can make sense of all the information.

We can also see discrepancies among class numbers. The ship carried two very different classes of passengers comprising the aristocratic elite from England and Europe, and the poor immigrants from Great Britain, Ireland, and Scandinavia.

Let’s have two alternative views of the raw data.

Table 2, Titanic Survival Data Summarised by Title

Title Number Survived Number Aboard Percent Survived
Mr 315 1588 19.8
Mrs 165 210 78.6
Ms 2 2 100
Master 31 61 50.8
Miss 185 270 68.5
Mme. 1 1 100
Mlle 2 2 100
Sig. 0 43 0
Sir 1 1 100
Lady 1 1 100
Countess 1 1 100
Don 0 1 0
Dona 1 1 100
Fr 0 7 0
Rev 0 1 0
Captain 0 2 0
Colonel 2 4 50
Major 1 2 50
Dr 4 10 40
TOTAL 712 2208 32.2

Table 3, Titanic Embarking Location Data

Embarked at: crew first second third TOTAL Number Survived Number Aboard Percent Survived
Unknown 0 3 0 0 3 2 3 66.7
Belfast 188 4 6 0 198 43 198 21.7
Southampton 703 171 246 493 1613 469 1613 29.1
Cherbourg 0 146 26 102 274 156 274 56.9
Queenstown 0 0 7 113 120 42 120 35
TOTAL 891 324 285 708 712 2208 32.2

Preliminary conclusion: exploring raw data provides both the details and a big picture (with totals). However, humans are not very good at making sense of data table. Even with categorical variables (male/female, A/B/C/D/E, 10 classes), we have difficulties sizing up the importance of the numbers. It requires us to overthink. It would be even more opaque with continuous variable (such as height, weights).

Visualizing the data

Let’s create a data table of ‘adult male passenger who survived’.

# The data
TitanicDF = as.data.frame(Titanic)

# column 1 to 4
# rep (replicate)
# for example: 3, 3rd, Male, Child, No is present 35 times
# 3rd, Male, Child will show up 35 times
# process all line
# 3rd will cumulate, Male will cumulate, etc.
TitanicList <-  lapply(TitanicDF[1:4], rep, TitanicDF$Freq)

# Create a new d.f: adult male passenger who survived
TitanicSets <- data.frame(passenger = TitanicList$Class != 'Crew', 
                          adult = TitanicList$Age == 'Adult', 
                          male = TitanicList$Sex == 'Male', 
                          survivor = TitanicList$Survived == 'Yes')

# Check it out
head(TitanicSets, 3)
##   passenger adult male survivor
## 1      TRUE FALSE TRUE    FALSE
## 2      TRUE FALSE TRUE    FALSE
## 3      TRUE FALSE TRUE    FALSE

A good tool for parsing raw data is the Venn diagram. Let’s draw three diagrams:

  • ‘adult passenger’.
  • ‘adult male passenger’.
  • ‘adult male passenger who survived’.
# Load the library for venn diagrams
library(gplots)

# Plot the venn diagrams
# adult passenger
venn(TitanicSets[1:2])

# adult male passenger
venn(TitanicSets[1:3])

# adult male passenger who survived
venn(TitanicSets)

Parsing data this way offers a different view. The first diagram tells us there were 1207 ‘adult passengers’. Second, among 1207, 805 were ‘males’. Third, among the 805, 147 ‘survived’.

The problem is, beyond three groups, we fall back into the mess of the data tables: a data overflow. In addition, the gplots package does not size up the circles according to each subtotal. One group should be larger than another.

Let’s try another package and reproduce the three previous diagrams.

library(venneuler)

# adult passenger
plot(venneuler(TitanicSets[1:2]), col = hcl(0, 0, c(60, 80), 0.5), alpha = NA, border = 'black')

Much better! And…

# adult male passenger
plot(venneuler(TitanicSets[1:3]), col = hcl(0, 0, c(40, 80, 20), 0.5), alpha = NA, border = 'black')

# adult male passenger who survived
plot(venneuler(TitanicSets[1:4]), col = hcl(0, 0, c(60, 80, 20), 0.5), alpha = NA, border = 'black')

We read this last diagram:

  • There are adults and children.
  • There are males and females.
  • There are passengers and crew members.
  • The are survivors and deceases.
  • Among all the adult male passengers, only a few survived.

Preliminary conclusion: we avoid numbers and we simply focus on size. The answer is always the central ‘crescent’ formed by the circles overlapping each other (where we read the beginning of ‘passenger’). This crescent represents 146 persons, 7% of the ship manifest. Any good Venn diagram should provide a good visual overview of the data.

Can we do better? Yes, we can. Venn diagram, beyond two circles, are always confusing, no matter how sharp they are. Let’s draw some graphics.

Visual exploration

From now on, we mainly focus on passenger only and try to exclude crew members.

Visual representation is a powerful technique for exploring a case before stepping into deeper analyses.

The next graph plots ‘Table 1, Titanic Survival Data Summarised by Class, Age Group and Sex’.

# Modify the data
Titanic1 <- data.frame(Titanic)
Titanic2 <- Titanic1
Titanic2$Survived <- as.factor(ifelse(Titanic2$Survived == 'Yes', 1, 0))
str(Titanic1)
## 'data.frame':    32 obs. of  5 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...
str(Titanic2)
## 'data.frame':    32 obs. of  5 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Survived: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...
# Load the package
library(ggplot2)

# Extend ggplot2
library(grid)
library(gridExtra)

p <-  ggplot(Titanic1, aes(weight = Freq)) + ylab('') + ylim(0, 2250)

cs <- p + aes(Class) + geom_bar(fill = 'blue') + ggtitle('Passenger by Class,') + xlab('') 
sx <- p + aes(Sex) + geom_bar(fill = 'green') + ggtitle('by Gender,') + xlab('') 
ag <- p + aes(Age) + geom_bar(fill = 'tan2') + ggtitle('by Age Group,') + xlab('')
su <- p + aes(Titanic1$Survived) + geom_bar(fill = 'red') + ggtitle('and Survivors') + xlab('') 

grid.arrange(cs, sx, ag, su, nrow = 1, widths = c(3, 2, 2, 2))

Visually, the ship manifest has more explanatory power the data table.

Proportions (or percentages) are built into the graphs. We don’t need to read them, interpret them; we see them. There is only one problem, we lose the details. For example, the bar charts show data by class, by gender or by survivor.

What about a bivariate bar charts? It is called a mosaic chart. First, we plot two univariate mosaic charts (by class and by gender) and a bivariate mosaic chart combining both classes and genders.

par(mfrow = c(1, 3))

mosaicplot(~ Class, data = Titanic, main = 'Univariate')
mosaicplot(~ Sex, data = Titanic, main = 'Univariate')
mosaicplot(~ Sex + Class, data = Titanic, main = 'Bivariate')

par(mfrow = c(1, 1))

However, mosaicplot has limited options. It would be interesting if we could ‘color’ each block by ‘survived/deceased’ instead of splitting the mosaic to a point beyond intelligibility.

mosaicplot(~ Sex + Class + Survived, data = Titanic, main = 'Multivariate')

Who survived? Where? What?

There is a better package for this kind of task: vcd. We can redo what we did above, and improve on it.

library(vcd)
# A <- mosaic(~ Class, data = Titanic1, return_grob = TRUE)
#B <- mosaic(Survived ~ Class, data = Titanic1, return_grob = TRUE)
#C <- mosaic(~ Class + Sex, data = Titanic1, return_grob = TRUE)
#D <- mosaic(Survived ~ Class + Sex, data = Titanic1, return_grob = TRUE)
#E <- mosaic(~ Class + Sex + Age, data = Titanic1, return_grob = TRUE)
#F <- mosaic(Survived ~ Class + Sex + Age, data = Titanic1, return_grob = TRUE)

# Multiple grid plots
mplot(A, B)

mplot(C, D)

mplot(E, F)

Now, the data is talking! Because proportions are built in the graphs.

And there is an alternative to mosaic charts: doubledeckers. Doubledeckers are combinations of bar charts and mosaic charts.

#G <- doubledecker(Survived ~ Sex, data = Titanic, gp = gpar(fill = c('grey90', 'blue')), return_grob = TRUE)
#H <- doubledecker(Survived ~ Sex, data = Titanic, gp = gpar(fill = c('grey90', 'green')), return_grob = TRUE)

# Multiple grid plots
mplot(G, H)

doubledecker(Survived ~ Class + Sex, data = Titanic, gp = gpar(fill = c('grey90', 'red')))

We have to pay attention to the width. The Crew member constitutes the largest group in term of numbers. The graph capture both sizes (totals) and proportions. It is a data table, with details and totals, but much easier to read and understand. We can already answer some questions while would still be calculating with the data table.

Another key characteristic is to be able to display graphics side by side (in a grid or in a graphic with several facets). Comparison between graphics is as important as within the graphic itself.

Some packages have these feature built-in.

library(extracat)

rmb(formula = ~Class + Sex + Survived, data = Titanic1)

The lattice package offers interesting possibilities.

library(lattice)

# Normalized x-axis for comparison
barchart(Class ~ Freq | Sex + Age, data = as.data.frame(Titanic), groups = Survived, stack = TRUE, layout = c(4, 1), auto.key = list(title = 'Survived', columns = 2))

OK, lattice pastel tones are… well, we can turn the colour off.

# Save the theme
mytheme <- trellis.par.get()

# Turn the B&W
trellis.par.set(canonical.theme(color = FALSE))

Modify the axes (from normalized scales to free scales).

# Free x-axis
bc.titanic <- barchart(Class ~ Freq | Sex + Age, data = as.data.frame(Titanic), groups = Survived, stack = TRUE, layout = c(4, 1), auto.key = list(title = 'Survived', columns = 2), scales = list(x = 'free'))

bc.titanic

Bring back the colors.

trellis.par.set(mytheme)

bc.titanic <- barchart(Class ~ Freq | Sex + Age, data = as.data.frame(Titanic), groups = Survived, stack = TRUE, layout = c(4, 1), auto.key = list(title = 'Survived', columns = 2), scales = list(x = 'free'))

bc.titanic

Add background grids.

# Add bg grids
update(bc.titanic, panel = function(...) {
  panel.grid(h = 0, v = -1)
  panel.barchart(...)
})

Graphics are powerful for exploring data. Without sweating too much, we can already draw initial conclusions.

There was more crew member onboard than 1st class passengers, yet, the latter had a better survival rate than the former. Overall, 1st-class passengers had the highest survival rate.

More females survived even though there were more males onboard. In all classes (1st, 2nd, 3rd, Crew), there were more males, but more females survived. Actually, more female and children.

In the next sections, we will analyze these initial conclusions with statistical techniques.

Performance Measures

We dive deeper into the matter. As with depths, this section and the following one are denser.

We can analyze all sort of things, but we need metrics to compare results. We need ways to measure if an analysis is robust enough and if an analysis is suited to grasp a pattern.

In section 3, we introduce statistical techniques: the decision tree and the k-nearest neighbour. For this section, we focus on performance measuring, preprocessing and cross-validating.

Load the data

# Check it out
str(titanic)
## 'data.frame':    714 obs. of  4 variables:
##  $ survived: Factor w/ 2 levels "1","0": 2 1 1 1 2 2 2 1 1 1 ...
##  $ pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
##  $ age     : num  22 38 26 35 35 54 2 27 14 4 ...

The Confusion Matrix

With a decision tree, we can predict whether a person would have survived the disaster based on the variables: age, sex and pclass (travel class). We then compare the actuals to the prediction made by the tree with a confusion matrix. In the next chunk, we run a decision tree, make predictions, but only focus on the confusion matrix.

# Load the decision tree package
library(rpart)

# Set random seed
set.seed(1)

# A rpart
tree <- rpart(survived ~ ., data = titanic, method = 'class')
#tree

# Use predict()
pred <- predict(tree, titanic, type = 'class')
#pred

# Use table() to make a confusion matrix
# rows = 1 or 0 (actuals); column = 1 or 0 (predictions)
conf <- table(titanic$survived, pred)
conf
##    pred
##       1   0
##   1 212  78
##   0  53 371

So to summarize, 212 out of all 265 survivors were correctly predicted to have survived. On the other hand, 371 out of the 449 deceased were correctly predicted to have perished.

Deriving ratios from the Confusion Matrix

The confusion matrix provides the raw performance of the decision tree:

  • The survivors correctly predicted to have survived: true positives (TP); 212 from above.
  • The deceased who were wrongly predicted to have survived: false positives (FP); 53 above.
  • The survivors who were wrongly predicted to have perished: false negatives (FN); 78 from above.
  • The deceased who were correctly predicted to have perished: true negatives (TN); 371 from above.

The matrix:

\[\begin{matrix} & \mathbf{P} & \mathbf{N}\\ \mathbf{p} & \mathrm{TP}& \mathrm{FN}\\ \mathbf{n} & \mathrm{FP}& \mathrm{TN}\\ \end{matrix}\]

The ratios:

\[Accuracy = \frac{TP+TN}{TP+FN+FP+TN}\]

\[Precision = \frac{TP}{TP+FP}\]

\[Recall = \frac{TP}{TP+FN}\]

# Given a Confusion Matrix
conf
##    pred
##       1   0
##   1 212  78
##   0  53 371
# Assign TP, FN, FP and TN using conf
TP <- conf[1,1]
TN <- conf[2,2]
FP <- conf[2,1]
FN <- conf[1,2]

# Calculate and print the accuracy
acc <- (TP + TN)/(TP + FN + FP + TN)

# Calculate and print out the precision
prec <- TP/(TP + FP)

# Calculate and print out the recall
rec <- TP/(TP + FN)

r1 <- c('Accuracy', 'Precision', 'Recall')
r2 <- round(c(acc, prec, rec), 2)
names(r2) <- r1

r2
##  Accuracy Precision    Recall 
##      0.82      0.80      0.73
dotchart(r2)

With an accuracy of around 82%, the model incorrectly predicted 18% of the passengers’ fates. Overall we have an acceptable, but not truly satisfying classifier.

Split the sets

A decision tree is a supervised algorithm. We need to train the algorithm first. Then, we compare some predictions with the actual results: the test.

We need to split the dataset into train and test sets. We use a 70/30 split between the two.

# Set random seed
set.seed(1)

# Shuffle the dataset
n <- nrow(titanic)
shuffled <- titanic[sample(n),]

# Split the data in train and test
titanic_train <- shuffled[1:round(0.7 * n),]
titanic_test <- shuffled[(round(0.7 * n) + 1):n,]

# Print the structure of train and test
str(titanic_train)
## 'data.frame':    500 obs. of  4 variables:
##  $ survived: Factor w/ 2 levels "1","0": 2 1 1 2 2 2 2 2 1 1 ...
##  $ pclass  : Factor w/ 3 levels "1","2","3": 2 1 1 2 2 3 3 3 3 2 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 1 2 2 2 1 2 2 1 1 ...
##  $ age     : num  24 16 36 39 30 30 21 35 36 5 ...
str(titanic_test)
## 'data.frame':    214 obs. of  4 variables:
##  $ survived: Factor w/ 2 levels "1","0": 2 1 2 2 2 1 1 1 1 2 ...
##  $ pclass  : Factor w/ 3 levels "1","2","3": 3 2 3 3 2 2 3 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 1 1 1 2 ...
##  $ age     : num  42 1 21 30.5 29 34 21 32 44 37 ...

This procedure can be repeated several times to build several train and test sets. This is helpful for training, testing and validating.

First, we train, then we test

# Set random seed
set.seed(1)

# The model
tree <- rpart(survived ~ ., titanic_train, method = 'class')

# Predict the outcome on the test set
pred <- predict(tree, titanic_test, type = 'class')

# Calculate the confusion matrix
conf2 <- table(titanic_test$survived, pred)

# Print this confusion matrix
conf2
##    pred
##       1   0
##   1  47  38
##   0   5 124
# accuracy = (TP+TN) / (TP+FN+FP+TN)
#    sum on diagonal / sum of all
# Print the accuracy
round(sum(diag(conf2))/sum(conf2), 2)
## [1] 0.8

The confusion matrix reveals an accuracy of 80%. This is less than the 82% calculated above.

Using Cross Validation

Let’s now build several train and test sets; 6 splits or folds.

# Set random seed
set.seed(1)

# Initialize the accs vector
accs <- c(1:6) * 0
# or...
accs <- rep(0,6)

for (i in 1:6) {
  # these indices indicate the interval of the test set
  indices <- (((i - 1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
  # exclude them from the train set
  titanic_train2 <- shuffled[-indices,]
  # include them in the test set
  titanic_test2 <- shuffled[indices,]
  # a model is learned using each training set
  tree2 <- rpart(survived ~ ., titanic_train2, method = 'class')
  # make a prediction on the test set using tree
  pred2 <- predict(tree2, titanic_test2, type = 'class')
  # assign the confusion matrix to conf
  conf2 <- table(titanic_test2$survived, pred2)
  # assign the accuracy of this model to the ith index in accs
  # accuracy = (TP+TN) / (TP+FN+FP+TN)
  #    sum on diagonal / sum of all    
  accs[i] <- sum(diag(conf2))/sum(conf2)
}

# Print out the mean accuracy
round(mean(accs), 2)
## [1] 0.8

The mean accuracy is 80%. With 6 folds, the models confirms the previous results.

How many folds?

Let’s say we’re doing cross-validation on a dataset with 22680 observations: n. We want the training set to contain 21420 entries: tr. How many folds can we use for our cross-validation? In other words, how many iterations with other test sets can we do on the dataset?

n - tr = 1260, n / 1260 = 18 folds.

Now we know how to preprocess the data and measure/compare the results. Let’s now turn our attention to machine learning algorithms (or statistical learning, they are synonyms).

Classification

There two principal classification techniques in machine learning: decision trees and k-nearest neighbours. Both are supervised techniques. The analyst must set the parameters. Let’s start with decision trees.

A note on tree-based models

Recursive partitioning is a fundamental tool in data mining. It helps us explore the structure of a set of data while visualizing the decision rules for predicting a categorical outcome (classification tree) or a continuous outcome (regression tree).

  1. Classification And Regression Tree (CART)

Classification and regression trees can be generated through the rpart package.

Main methods:

  • 'class' for a classification tree on a categorical dependent variable.
  • 'anova' for a regression tree on a continuous dependent variable.

There are also the 'poisson' and 'exp' methods. control are optional parameters for controlling and pruning the tree growth.

  1. Conditional Inference Tree

The party package provides nonparametric regression trees in cases of nonparametric regressions (nominal, ordinal, censored and multivariate responses). Pruning is unnecessary.

  1. Random Forests

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new “forest”, and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler’s random forest approach is implemented via the randomForest package.

Learn a decision tree

First, grow the tree. Second, examine the results. Three, prune the tree.

Let’s grow a decision tree that uses a person’s age, gender, and travel class to predict whether or not they survived the Titanic.

# Set random seed
set.seed(1)

titanic_train3 <- titanic_train2
titanic_test3 <- titanic_test2

# Build a tree model
# min number of observation in a node before splitting, minsplit = 30
# cost complexity factor, cp = 0,001
tree3 <- rpart(survived ~ ., data = titanic_train3, method = 'class')

# Print the results
print(tree3)
## n= 595 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 595 242 0 (0.40672269 0.59327731)  
##    2) sex=female 220  56 1 (0.74545455 0.25454545)  
##      4) pclass=1,2 133   7 1 (0.94736842 0.05263158) *
##      5) pclass=3 87  38 0 (0.43678161 0.56321839)  
##       10) age< 5.5 10   2 1 (0.80000000 0.20000000) *
##       11) age>=5.5 77  30 0 (0.38961039 0.61038961) *
##    3) sex=male 375  78 0 (0.20800000 0.79200000)  
##      6) age< 6.5 18   5 1 (0.72222222 0.27777778) *
##      7) age>=6.5 357  65 0 (0.18207283 0.81792717) *
# Print detailed results including surrogate splits 
summary(tree3)
## Call:
## rpart(formula = survived ~ ., data = titanic_train3, method = "class")
##   n= 595 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.44628099      0 1.0000000 1.0000000 0.04951322
## 2 0.04545455      1 0.5537190 0.5537190 0.04210455
## 3 0.03305785      2 0.5082645 0.5661157 0.04243451
## 4 0.02479339      3 0.4752066 0.5247934 0.04130008
## 5 0.01000000      4 0.4504132 0.4793388 0.03993217
## 
## Variable importance
##    sex pclass    age 
##     63     22     15 
## 
## Node number 1: 595 observations,    complexity param=0.446281
##   predicted class=0  expected loss=0.4067227  P(node) =1
##     class counts:   242   353
##    probabilities: 0.407 0.593 
##   left son=2 (220 obs) right son=3 (375 obs)
##   Primary splits:
##       sex    splits as  LR,       improve=80.103310, (0 missing)
##       pclass splits as  LLR,      improve=34.140610, (0 missing)
##       age    < 6.5  to the left,  improve= 9.892647, (0 missing)
##   Surrogate splits:
##       age < 0.79 to the left,  agree=0.634, adj=0.009, (0 split)
## 
## Node number 2: 220 observations,    complexity param=0.04545455
##   predicted class=1  expected loss=0.2545455  P(node) =0.3697479
##     class counts:   164    56
##    probabilities: 0.745 0.255 
##   left son=4 (133 obs) right son=5 (87 obs)
##   Primary splits:
##       pclass splits as  LLR,      improve=27.423150, (0 missing)
##       age    < 21.5 to the right, improve= 1.601385, (0 missing)
##   Surrogate splits:
##       age < 21.5 to the right, agree=0.686, adj=0.207, (0 split)
## 
## Node number 3: 375 observations,    complexity param=0.03305785
##   predicted class=0  expected loss=0.208  P(node) =0.6302521
##     class counts:    78   297
##    probabilities: 0.208 0.792 
##   left son=6 (18 obs) right son=7 (357 obs)
##   Primary splits:
##       age    < 6.5  to the left,  improve=9.999246, (0 missing)
##       pclass splits as  LRR,      improve=6.239221, (0 missing)
## 
## Node number 4: 133 observations
##   predicted class=1  expected loss=0.05263158  P(node) =0.2235294
##     class counts:   126     7
##    probabilities: 0.947 0.053 
## 
## Node number 5: 87 observations,    complexity param=0.02479339
##   predicted class=0  expected loss=0.4367816  P(node) =0.1462185
##     class counts:    38    49
##    probabilities: 0.437 0.563 
##   left son=10 (10 obs) right son=11 (77 obs)
##   Primary splits:
##       age < 5.5  to the left,  improve=2.981221, (0 missing)
## 
## Node number 6: 18 observations
##   predicted class=1  expected loss=0.2777778  P(node) =0.0302521
##     class counts:    13     5
##    probabilities: 0.722 0.278 
## 
## Node number 7: 357 observations
##   predicted class=0  expected loss=0.1820728  P(node) =0.6
##     class counts:    65   292
##    probabilities: 0.182 0.818 
## 
## Node number 10: 10 observations
##   predicted class=1  expected loss=0.2  P(node) =0.01680672
##     class counts:     8     2
##    probabilities: 0.800 0.200 
## 
## Node number 11: 77 observations
##   predicted class=0  expected loss=0.3896104  P(node) =0.1294118
##     class counts:    30    47
##    probabilities: 0.390 0.610
# Examine the cp table
printcp(tree3)
## 
## Classification tree:
## rpart(formula = survived ~ ., data = titanic_train3, method = "class")
## 
## Variables actually used in tree construction:
## [1] age    pclass sex   
## 
## Root node error: 242/595 = 0.40672
## 
## n= 595 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.446281      0   1.00000 1.00000 0.049513
## 2 0.045455      1   0.55372 0.55372 0.042105
## 3 0.033058      2   0.50826 0.56612 0.042435
## 4 0.024793      3   0.47521 0.52479 0.041300
## 5 0.010000      4   0.45041 0.47934 0.039932
# Plot the cross-validation results
plotcp(tree3)

# Plot the decision tree
# options: uniform = TRUE, main = 'Title'
plot(tree3, uniform = TRUE, main = 'Tree 3')

# Add text
# options: use.n = TRUE, cex = 0.8
text(tree3, use.n = TRUE, cex = 0.8)

Can we surely improve the visual? Of course.

# Load additional packages
library(rattle)
library(rpart.plot)
library(RColorBrewer)

# Draw a prettier decision tree
fancyRpartPlot(tree3)

Let’s examine the results. On Node 1, the tree confirms the exploratory data analysis (let’s remember that 1st-class females and children showed a better survival rate). The most important factor deciding the fate of a passenger is gender. The fact that males account for the majority of passengers weighs in the results resulting in a female/male ratio of 41/59. According to the graphics, the percentage of surviving female accounted for more in all categories. In all, as a female, you stand more chance to survive. Node 1 shows males have predominantly deceased at 59%. At Node 1 accounts for 100% of the observations.

On Node 2: the tree predicts female passengers in class 1,2 would survive at 75% since 75% of the training instances are positive and 25% are negative. The majority is positive, which is signaled by the number 1 on top or the green color.

The 37% bit tells us which percentage of the entire training set passes through node 2.

Finally, the pclass = 1,2 specifies the feature test on which this node is separated. If the test comes out positive, the left branch is taken; if it’s negative, the right branch is taken.

In other words, the tree predicts that female passengers in class 3 are not likely to survive.

Node 5 and 3 confirms that children had better survival rates.

Preliminary conclusion: the tree-based model confirms the preliminaries (exploratory) findings from section 1.

Classify with the decision tree

# Set random seed
set.seed(1)

# Predict the values of the test set
pred3 <- predict(tree3, titanic_test3, type = 'class')

# Construct the confusion matrix
# actuals vs predictions
conf3 <- table(titanic_test3$survived, pred3)

# Print out the accuracy
round(sum(diag(conf3))/sum(conf3), 2)
## [1] 0.79

Around 79% percent of all test instances have been classified correctly.

Pruning the tree

Let’s increase the bias (cp) by restricting the amount of detail the tree can model.

# Set random seed
set.seed(1)

titanic_train4 <- titanic_train3
titanic_test4 <- titanic_test3

tree4 <- rpart(survived ~ ., titanic_train4, method = 'class', control = rpart.control(cp = 0.00001))

# Draw the complex tree
fancyRpartPlot(tree4)

Now we have a tree overfitting the data. It is so precise that it does mean anything. Let’s step back and prune the tree (change the cp).

# Prune the tree: pruned
pruned <- prune(tree4, cp = 0.01)

# Draw the pruned tree
fancyRpartPlot(pruned)

Another way to check if we overfit our model is by comparing the accuracy on the training set with the accuracy on the test set.

We can also challenge the tree-based models with another algorithm: the k-nearest neightbour (knn).

Preprocess the data

Beforehand, we need to address an important issue with the data: scaling. The scale of the input variables may have a great influence on the outcome of the knn algorithm. One variable can span from 0 to 50 while another can span from 0 to 1.

First, rescale the variables; or normalize each vector. Here is the formula:

\[\frac{\mathbf{x} - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})}\]

The knn function

We are now ready to classify instances with the knn. k is set to 5 neighbours.

# Set random seed
set.seed(1)

# Load the class package
library(class)

# Make predictions using knn: pred
pred_knn <- knn(knn_train, knn_test, train_labels, k = 5)

# Construct the confusion matrix: conf
conf_knn <- table(test_labels, pred_knn)

# Print out the confusion matrix
conf_knn
##            pred_knn
## test_labels  1  0
##           1 36 12
##           0 10 61
# Print out the accuracy
round(sum(diag(conf_knn))/sum(conf_knn), 2)
## [1] 0.82

With an accuracy of 82%, we confirm all the previous results obtained from our tree-based models.

k’s choice

Let’s push the analysis to another level and iterate over different values of k, then, compare the accuracies.

# Set random seed
set.seed(1)

# Load the class package, define range and accs
library(class)

# The range will span from 0 to 100
range <- 1:round(0.168 * nrow(knn_train))

accs <- rep(0, length(range))

for (k in range) {
  # Make predictions using knn: pred
    pred <- knn(knn_train, knn_test, train_labels, k = k)
  # Construct the confusion matrix: conf
    conf <- table(test_labels, pred)
  # Calculate the accuracy and store it in accs[k]
    accs[k] <- sum(diag(conf))/sum(conf)
}

# Plot the accuracies. Title of x-axis is 'k'
plot(range, accs, xlab = 'k')

# Calculate the worst and best k
which.min(accs)
## [1] 22
round(min(accs), 2)
## [1] 0.75
which.max(accs)
## [1] 1
round(max(accs), 2)
## [1] 0.83

The accuracies oscillate between the min of 75% (k = 22) and the max of 83% (k = 1).

Again, we confirm all the previous results. For an even more robust estimate of the best k, we could use cross validation.

A cool way to visualize how knn works with two-dimensional features is the Voronoi diagram. It’s basically a plot of all the training instances, together with a set of tiles around the points. Each tile represents a region of influence for each point. When we want to classify a new observation, we assign it to the tile for which the coordinates correspond.

In the plot, we can see training instances that belong to either the blue or the red class. Each instance has two features: x and y (coordinates). The top left instance, for example, has an x value of around 0.05 and a y value of 0.9.

Suppose we are given an unseen observation with features (x, y) = (0.5, 0.5). Looking at the Voronoi diagram, which class would we give this observation? Blue.

Simulating a Voronoi diagram

#Let's generate some fake data
set.seed(105)
long <- rnorm(20,-98,15)
lat <- rnorm(20,39,10)
df <- data.frame(lat,long)
 
library(deldir)
library(ggplot2)
 
#This creates the voronoi line segments
voronoi <- deldir(df$long, df$lat)
 
#Now we can make a plot
ggplot(data = df, aes(x = long,y = lat)) +
  #Plot the voronoi lines
  geom_segment(
    aes(x = x1, y = y1, xend = x2, yend = y2),
    size = 2,
    data = voronoi$dirsgs,
    linetype = 1,
    color = "#FFB958") + 
  #Plot the points
  geom_point(
    fill = rgb(70, 130, 180, 255, maxColorValue = 255),
    pch = 21,
    size = 4,
    color = "#333333")

Improving Predictions through Random Forests (and Final Conclusion)

What is a ‘random forest’

The Random Forest technique handles the overfitting problem we faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote.

For example, if we have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor.

This approach of overtraining trees, but having the majority’s vote count as the actual classification decision, avoids overfitting.

Before starting with the actual analysis, we first need to meet one big condition of Random Forests: no missing values in our data frame.

# Check it out
summary(all_data)
##   passengerid       pclass         survived         name          
##  Min.   :   1   Min.   :1.000   Min.   :0.000   Length:1309       
##  1st Qu.: 328   1st Qu.:2.000   1st Qu.:0.000   Class :character  
##  Median : 655   Median :3.000   Median :0.000   Mode  :character  
##  Mean   : 655   Mean   :2.295   Mean   :0.382                     
##  3rd Qu.: 982   3rd Qu.:3.000   3rd Qu.:1.000                     
##  Max.   :1309   Max.   :3.000   Max.   :1.000                     
##                                                                   
##      sex           age            sibsp            parch      
##  female:466   Min.   : 1.00   Min.   :0.0000   Min.   :0.000  
##  male  :843   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##               Median :28.00   Median :0.0000   Median :0.000  
##               Mean   :30.15   Mean   :0.4989   Mean   :0.385  
##               3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##               Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##               NA's   :308                                     
##       ticket          fare            cabin           embarked  
##  CA. 2343:  11   Min.   :  0.000   Length:1309        C   :270  
##  1601    :   8   1st Qu.:  7.896   Class :character   Q   :123  
##  CA 2144 :   8   Median : 14.454   Mode  :character   S   :914  
##  3101295 :   7   Mean   : 33.295                      NA's:  2  
##  347077  :   7   3rd Qu.: 31.275                                
##  347082  :   7   Max.   :512.329                                
##  (Other) :1261   NA's   :1                                      
##      boat               body            home.dest        
##  Length:1309        Length:1309        Length:1309       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

We have missing values (NA).

# some passenger do not have a value for embarkment.
# Since many passengers embarked at Southampton, we give them the value S.
all_data$embarked[is.na(all_data$embarked) == T] <- 'S'

# some passenger have an NA fare value
# Replace it with the median fare value
all_data$fare[is.na(all_data$fare) == T] <- median(all_data$fare, na.rm = TRUE)

# How to fill in missing age values?
# Make a prediction of a passengers age using the other variables and a decision tree model 
# method = 'anova' since we are predicting a continuous variable.
predicted_age <- rpart(age ~ pclass + sex + sibsp + parch + fare + embarked, data = all_data[!is.na(all_data$age),], method = 'anova')

all_data$age[is.na(all_data$age)] <- predict(predicted_age, all_data[is.na(all_data$age),])

# Split the data back into a train set and a test set
split <- round(nrow(all_data) * 0.75)

train_rf <- all_data[1:split,]
test_rf <- all_data[(split + 1):nrow(all_data),]

The data is cleaned up.

One more important element in Random Forest is randomization to avoid the creation of the same tree from the training set.

Randomize in two ways: by taking a randomized sample of the rows in our training set and by only working with a limited and changing number of the available variables for every node of the tree.

A ‘random forest’ analysis in R

# load the package
library(randomForest)

# train set and test set
str(train_rf)
## 'data.frame':    982 obs. of  15 variables:
##  $ passengerid: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ pclass     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ survived   : num  1 1 0 0 0 1 1 0 1 0 ...
##  $ name       : chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
##  $ sex        : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age        : num  29 28.7 2 30 25 ...
##  $ sibsp      : num  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch      : num  0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket     : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 825 ...
##  $ fare       : num  211 152 152 152 152 ...
##  $ cabin      : chr  "B5" "C22 C26" "C22 C26" "C22 C26" ...
##  $ embarked   : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
##  $ boat       : chr  "2" "11" NA NA ...
##  $ body       : chr  NA NA NA "135" ...
##  $ home.dest  : chr  "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
str(test_rf)
## 'data.frame':    327 obs. of  15 variables:
##  $ passengerid: num  983 984 985 986 987 988 989 990 991 992 ...
##  $ pclass     : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ survived   : num  0 0 1 1 0 0 0 0 0 1 ...
##  $ name       : chr  "Lyntakoff, Mr. Stanko" "MacKay, Mr. George William" "Madigan, Miss. Margaret \"Maggie\"" "Madsen, Mr. Fridtjof Arne" ...
##  $ sex        : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 2 2 2 2 ...
##  $ age        : num  27 27 27 24 22 ...
##  $ sibsp      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ parch      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ticket     : Factor w/ 929 levels "110152","110413",..: 519 775 637 743 903 381 740 741 897 273 ...
##  $ fare       : num  7.9 7.55 7.75 7.14 7.12 ...
##  $ cabin      : chr  NA NA NA NA ...
##  $ embarked   : Factor w/ 3 levels "C","Q","S": 3 3 2 3 3 2 2 3 3 1 ...
##  $ boat       : chr  NA NA "15" "13" ...
##  $ body       : chr  NA NA NA NA ...
##  $ home.dest  : chr  NA NA NA NA ...
# set seed for reproducibility
set.seed(111)

# apply the random forest algorithm
forest_rf <- randomForest(as.factor(survived) ~ pclass + sex + age + sibsp + parch + fare + embarked, data = train_rf, importance = TRUE, ntree = 1000)

# make our prediction using the test set
pred_rf <- predict(forest_rf, test_rf)

# Construct the confusion matrix
conf_rf <- table(test_rf$survived, pred_rf)

# Print out the confusion matrix
conf_rf
##    pred_rf
##       0   1
##   0 222  22
##   1  53  30
# Print out the accuracy
round(sum(diag(conf_rf))/sum(conf_rf), 2)
## [1] 0.77

We just created your first random forest with an accuracy of 75%. This is the lowest accuracy ratio compared to all the previous ratios. However, remember that the minimum accuracy ratio for knn is 75% (for k = 22).

Important variables

Now we can see what variables are important using.

varImpPlot(forest_rf)

Two graphs appear: the accuracy plot shows how much worse the model would perform without the included variables. So a large decrease (= high-value x-axis) links to a high predictive variable. The second plot is the Gini coefficient. The higher the variable scores here, the more important it is for the model.

Preliminary conclusion: sex appear to have the highest impact on the model. All tree-based models, including the random forests, confirms the preliminaries (exploratory) findings from section 1.

Final conclusion

The ship carried different classes of passengers. Their destination was the same; however, their faithful end deferred. The rich traveled in the 1st class with all the luxuries and entitlements, and the poor were condemned to the dark and windowless quarters in the 3rd class.

When it became a question of survival, the rich prevailed over the rest. The Titanic carried far fewer rescue boats than needed. In fact, the lifeboats could carry only 1178 passengers whereas the ship carried 2224 passengers. The ‘women and children policies’ was adopted. However, rich women and children survived in greater proportions than did poor women and children.