Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets and results.
  • Source: ‘Introduction to Machine Learning’ from DataCamp.


What is Machine Learning

Algorithms are flexible methods that can learn from examples; they are supervised methods. They are good at classification. For example, from symptoms, an algorithm can predict a sickness (classification between ‘sick’ and ‘Not’).

In general, we feed patterns or predictors to the algorithm, just like we feed data to a regression. Then, we use the algorithm-regression to forecast, extrapolate or interpolate.

Sophisticated methods can parse through anything: structured data such as numbers, text or natural language, unstructured data such as sounds, colors, images, etc.

Here, we focus on something simple: spam. Spam (emails) is text. However, we preprocessed the emails and extracted the patterns: whether the email contents capital characters, the average number of capital characters, given words or n-grams, word or n-gram frequencies, etc.

We then train the algorithm by feeding him the preprocessed database. The database shows lines with one answer (‘Spam’ or ‘Not’, 1 or 0) and patterns. The algorithm learns, makes connections. Finally, we feed the algorithm with another database. This time, the algorithm must predict the answers. And we measure the success rate with the true answers. When the algorithm is robust enough (high success rate), we can apply it on other databases to extract spam from emails (knowing we still have an error margin).

Notes: a n-gram is a combination of words; a pair is a bigram, a triple is a trigram, etc.

Classification: Filtering spam

Filtering spam from relevant emails is a typical machine learning task. Information such as word frequency, character frequency and the number of capital letters can indicate whether an email is a spam or not.

We have a small dataset: emails (Source: UCI Machine Learning Repository).

# Check it out
head(emails)
##   avg_capital_seq spam
## 1           1.000    0
## 2           2.112    0
## 3           4.123    1
## 4           1.863    0
## 5           2.973    1
## 6           1.687    0
# Show the dimensions of emails
dim(emails)
## [1] 13  2

Several emails have been labeled by humans as spam (1) or not spam (0) and the results are found in the column spam.

We have a simple classifier (a custom function).

spam_classifier <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 4] <- 1
  prediction[x >= 3 & x <= 4] <- 0
  prediction[x >= 2.2 & x < 3] <- 1
  prediction[x >= 1.4 & x < 2.2] <- 0
  prediction[x > 1.25 & x < 1.4] <- 1
  prediction[x <= 1.25] <- 0
  return(prediction)
}

Apply the classifier to the avg_capital_seq column.

pred_small <- spam_classifier(emails$avg_capital_seq)
pred_small
##  [1] 0 0 1 0 1 0 1 0 0 1 0 0 1

Compare spam_pred to emails$spam.

pred_small == emails$spam
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The algorithm filtered the spam 13 out of 13 times! Sadly, the classifier was made to perfectly filter all 13 examples.



Performance measures

Overfitting the spam!

Load a larger dataset (Source: UCIMLR). Print it.

# Check it out
head(emails_full)
##   avg_capital_seq spam
## 1           1.500    0
## 2           4.941    1
## 3           3.429    1
## 4           3.493    1
## 5           3.380    0
## 6           3.689    1
# Show the dimensions of emails
dim(emails_full)
## [1] 4601    2

Use the larger dataset emails_full with the same classifier as above.

# The spam filter
spam_classifier <- function(x) {
  prediction <- rep(NA, length(x))
  prediction[x > 4] <- 1
  prediction[x >= 3 & x <= 4] <- 0
  prediction[x >= 2.2 & x < 3] <- 1
  prediction[x >= 1.4 & x < 2.2] <- 0
  prediction[x > 1.25 & x < 1.4] <- 1
  prediction[x <= 1.25] <- 0
  return(factor(prediction, levels = c('0', '1')))
}

Build a confusion matrix and assess accuracy (one of the three ratios).

# Apply spam_classifier to emails_full
pred_full <- spam_classifier(emails_full$avg_capital_seq)

# Build confusion matrix for emails_full: conf_full
conf_full <- table(emails_full$spam, pred_full)

# Calculate the accuracy with conf_full: acc_full
acc_full <- sum(diag(conf_full))/sum(conf_full)

# Print acc_full
acc_full
## [1] 0.6561617

This hard-coded classifier gave you an accuracy of around 65% on the large dataset, which is way worse than the 100% on the small dataset (the emails dataset). Hence, the classifier does not generalize well at all!

Increasing the bias

The spam_classifier above is bogus. It simply overfits on the emails set and, as a result, doesn’t generalize to larger datasets such as emails_full.

Simplify the rules.

spam_classifier_s <- function(x) {
  prediction <- rep(NA,length(x))
  prediction[x > 4] <- 1
  prediction[x <= 4] <- 0
  return(factor(prediction, levels = c('0', '1')))
}

Run the classifier on emails_full. Calculate the accuracy.

# Apply spam_classifier to emails_full
pred_full <- spam_classifier_s(emails_full$avg_capital_seq)

# Build confusion matrix for emails_full: conf_full
conf_full <- table(emails_full$spam, pred_full)

# Calculate the accuracy with conf_full: acc_full
acc_full <- sum(diag(conf_full))/sum(conf_full)

# Print acc_full
acc_full
## [1] 0.7259291

Repeat for emails.

pred_small <- spam_classifier_s(emails$avg_capital_seq)
conf_small <- table(emails$spam, pred_small)
acc_small <- sum(diag(conf_small))/sum(conf_small)
acc_small
## [1] 0.7692308

Compare the results.

before = c(0.6561617, 1)
after = c(full = acc_full, small = acc_small)
results <- data.frame(before, after)
results
##          before     after
## full  0.6561617 0.7259291
## small 1.0000000 0.7692308

The model no longer fits the small dataset perfectly, but it fits the large dataset better.

Increasing the bias on the model caused it to generalize better over the large dataset. While the first classifier overfits the data, an accuracy of 77% is far from satisfying for a spam filter.



Classification

Splitting criterion

Let try another method for classifying spam: decision trees.

Load another dataset: emails_all. Check it out

str(emails_all)
## 'data.frame':    4600 obs. of  58 variables:
##  $ word_freq_make            : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ word_freq_address         : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ word_freq_all             : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ word_freq_3d              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_our             : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ word_freq_over            : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ word_freq_remove          : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ word_freq_internet        : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ word_freq_order           : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ word_freq_mail            : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ word_freq_receive         : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ word_freq_will            : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ word_freq_people          : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ word_freq_report          : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ word_freq_addresses       : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ word_freq_free            : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ word_freq_business        : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_email           : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ word_freq_you             : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ word_freq_credit          : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ word_freq_your            : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ word_freq_font            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_000             : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ word_freq_money           : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ word_freq_hp              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_hpl             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_george          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_650             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_lab             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_labs            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_telnet          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_857             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_data            : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ word_freq_415             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_85              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_technology      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_1999            : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ word_freq_parts           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_pm              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_direct          : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_cs              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_meeting         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_original        : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ word_freq_project         : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ word_freq_re              : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_edu             : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_table           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_conference      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_.               : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ char_freq_..1             : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ char_freq_..2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_..3             : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ char_freq_..4             : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ char_freq_..5             : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ capital_run_length_average: num  3.76 5.11 9.82 3.54 3.54 ...
##  $ capital_run_length_longest: num  61 101 485 40 40 15 4 11 445 43 ...
##  $ capital_run_length_total  : num  278 1028 2259 191 191 ...
##  $ spam                      : num  1 1 1 1 1 1 1 1 1 1 ...
dim(emails_all)
## [1] 4600   58

Read in a train set and check it out.

str(train_all)
## 'data.frame':    3221 obs. of  58 variables:
##  $ word_freq_make            : num  0 0 0 0 0 0 0 0 0 0.4 ...
##  $ word_freq_address         : num  0 0 0.22 0 0 0 0 0 0 0 ...
##  $ word_freq_all             : num  0 0 0 1.57 0 0 0 0 0.86 0.6 ...
##  $ word_freq_3d              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_our             : num  0.64 0.44 0.66 0.22 0.43 0 0 0 0 0.2 ...
##  $ word_freq_over            : num  0 0 0.22 0.22 0.43 0 0 0 0 0.6 ...
##  $ word_freq_remove          : num  0 0 0 0 0.43 0 0 0 0 0.2 ...
##  $ word_freq_internet        : num  0 0 0.44 0 0.43 0 0 0 0 0.6 ...
##  $ word_freq_order           : num  0 0 0.44 0 0 0 0 0 0 0.2 ...
##  $ word_freq_mail            : num  0 0.88 0.89 0 0 0 0 0 0 0.2 ...
##  $ word_freq_receive         : num  0.64 0 0 0 0 0 0 0 0 0.2 ...
##  $ word_freq_will            : num  0.64 0 0 0 0.43 0 0.99 0 0.86 1.2 ...
##  $ word_freq_people          : num  0 0 0.22 0.22 0 0 0 0 0 0 ...
##  $ word_freq_report          : num  0 0 0.22 0 0 0 0 0 0 0 ...
##  $ word_freq_addresses       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_free            : num  0 0.44 1.33 0 0 0 0 0 0 0.4 ...
##  $ word_freq_business        : num  1.29 0 0 0 0 0 0 0 0 1.61 ...
##  $ word_freq_email           : num  0 0 0 0 0 0 0 0 0.86 0.4 ...
##  $ word_freq_you             : num  1.29 1.32 0.89 2.02 0.87 5.81 5.94 0.77 0.86 2.21 ...
##  $ word_freq_credit          : num  5.19 0 0 0 0 0 0 0 0 1.81 ...
##  $ word_freq_your            : num  1.29 0 0.44 0.22 0 1.16 0 2.32 2.58 2.62 ...
##  $ word_freq_font            : num  0 0 0 0 9.17 0 0 0 0 0 ...
##  $ word_freq_000             : num  0 0 0 0 0 0 0 0 0 0.2 ...
##  $ word_freq_money           : num  0.64 0 0.22 0 0 1.16 0 0 0 0.6 ...
##  $ word_freq_hp              : num  0 0 3.34 0 0 0 0 0 1.72 0 ...
##  $ word_freq_hpl             : num  0 0 3.56 0 0 0 0 0.77 0.86 0 ...
##  $ word_freq_george          : num  0 0 0.66 0 0 0 0 0 0 0 ...
##  $ word_freq_650             : num  0 0.44 0.22 0 0 0 0 0 0.86 0 ...
##  $ word_freq_lab             : num  0 0 0.22 0 0 0 0 0 0 0 ...
##  $ word_freq_labs            : num  0 0 0.22 0 0 0 0 0 0.86 0 ...
##  $ word_freq_telnet          : num  0 0 0.22 0 0 0 0 0 0.86 0 ...
##  $ word_freq_857             : num  0 0 0.22 0 0 0 0 0 0 0 ...
##  $ word_freq_data            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_415             : num  0 0 0.22 0 0 0 0 0 0 0 ...
##  $ word_freq_85              : num  0 0 0.22 0 0 0 0 0 0.86 0 ...
##  $ word_freq_technology      : num  0 0 0.22 0 0 0 0 0 0.86 0 ...
##  $ word_freq_1999            : num  0 0 1.11 0 0 0 0 0 0 0 ...
##  $ word_freq_parts           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_pm              : num  0 0 0 0.22 0 0 0 0 0 0 ...
##  $ word_freq_direct          : num  0 0 0.22 0 0 0 0 0 0 0 ...
##  $ word_freq_cs              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_meeting         : num  0 0 0 0 0 0 0 0.77 0 0 ...
##  $ word_freq_original        : num  0 0 0.22 0 0 0 0 0 0 0 ...
##  $ word_freq_project         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_re              : num  0 0 0.22 0.89 0 0 0.99 0 0 0 ...
##  $ word_freq_edu             : num  0 0.44 0 0 0 2.32 0 0 0 0 ...
##  $ word_freq_table           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_conference      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_.               : num  0 0 0.148 0 0 0 0.191 0 0 0 ...
##  $ char_freq_..1             : num  0.468 0 0.372 0.091 0 0.163 0 0 0.11 0.096 ...
##  $ char_freq_..2             : num  0 0 0.111 0 0 0 0 0 0 0 ...
##  $ char_freq_..3             : num  0.093 0 0.372 0.045 0.395 ...
##  $ char_freq_..4             : num  0 0 0.223 0 0 0 0 0 0 0.129 ...
##  $ char_freq_..5             : num  0 0 0 0 1.12 ...
##  $ capital_run_length_average: num  2.75 1.84 3.42 1.28 7.98 ...
##  $ capital_run_length_longest: num  66 10 42 16 72 7 1 3 10 64 ...
##  $ capital_run_length_total  : num  135 186 411 97 495 34 18 37 58 513 ...
##  $ spam                      : num  1 1 0 0 1 0 0 0 0 1 ...
dim(train_all)
## [1] 3221   58

Read in a test set and check it out.

str(test_all)
## 'data.frame':    1380 obs. of  58 variables:
##  $ word_freq_make            : num  0 0 0.07 0.23 0 0.5 0 0 0.99 0 ...
##  $ word_freq_address         : num  0 0 0 0 0 0.46 0 0 0.49 0 ...
##  $ word_freq_all             : num  0.68 0 0.29 0 0 0.34 0 0 0 0 ...
##  $ word_freq_3d              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_our             : num  0 0 0.07 0 1.38 0.15 1.21 0 0 0 ...
##  $ word_freq_over            : num  0 0 0.07 0.23 0 0.03 0 0 0 0 ...
##  $ word_freq_remove          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_internet        : num  1.36 0 0 0 0 0.19 0.6 0 0 0 ...
##  $ word_freq_order           : num  0 0 0.74 0 0 0.57 0 0 0 0 ...
##  $ word_freq_mail            : num  0 0 0 0 0 0.65 0.6 1.38 0.49 0 ...
##  $ word_freq_receive         : num  0.68 0 0 0 0 0.3 1.21 0 0 0 ...
##  $ word_freq_will            : num  0.68 0 0.22 0.92 4.16 0.73 0 0 0.49 0 ...
##  $ word_freq_people          : num  0 0 0 0.46 0 0.65 0 0 0 0 ...
##  $ word_freq_report          : num  0 0 0.07 0 0 1.27 0 0 0 0 ...
##  $ word_freq_addresses       : num  0 0 0 0 0 0.03 0 0 0 0 ...
##  $ word_freq_free            : num  0 0 0 0 0 0.23 1.82 0 0 0 ...
##  $ word_freq_business        : num  0 0 0 0 0 0.42 0 0 0 0 ...
##  $ word_freq_email           : num  0 0 0.07 0 0 0 0 0 0 0 ...
##  $ word_freq_you             : num  3.4 0 0.29 2.76 0 3.08 4.26 4.16 2.48 0.67 ...
##  $ word_freq_credit          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_your            : num  1.36 0 0.22 2.76 0 1.34 0 0 1.99 1.35 ...
##  $ word_freq_font            : num  0 0 0 0 0 0 0 0 2.98 0 ...
##  $ word_freq_000             : num  0.68 0 0 0 0 0.5 0 0 0 0 ...
##  $ word_freq_money           : num  0.68 0 0 0.69 0 0.5 0 0 0 0 ...
##  $ word_freq_hp              : num  0 0 0.67 0 0 0 0 0 0 0.67 ...
##  $ word_freq_hpl             : num  0 0 0.74 0 0 0 0 0 0 0 ...
##  $ word_freq_george          : num  0 33.33 0.07 0 0 ...
##  $ word_freq_650             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_lab             : num  0 0 0 0 6.94 0 0 0 0 0 ...
##  $ word_freq_labs            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_telnet          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_857             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_data            : num  0 0 1.63 0.46 0 0 0 0 0 0 ...
##  $ word_freq_415             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_85              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_technology      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_1999            : num  0 0 0.59 0 0 0 0 0 0 0 ...
##  $ word_freq_parts           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_pm              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_direct          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_cs              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_meeting         : num  0 0 0 0 6.94 0 0 0 0 0 ...
##  $ word_freq_original        : num  0 0 0.07 0 0 0 0 0 0 0 ...
##  $ word_freq_project         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_re              : num  0 0 0 0 0 0.03 0 0 0 0 ...
##  $ word_freq_edu             : num  0 0 0 0 0 0 0 1.38 0 0 ...
##  $ word_freq_table           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_conference      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_.               : num  0 0 0.163 0 0 0.011 0 0 0 0 ...
##  $ char_freq_..1             : num  0 0 0.228 0.445 0.238 0.077 0.29 0 0 0.087 ...
##  $ char_freq_..2             : num  0 0 0.032 0 0 0 0 0 0 0 ...
##  $ char_freq_..3             : num  0.238 0 0 0.202 0 0.335 0.193 0 0.356 0 ...
##  $ char_freq_..4             : num  0.238 0 0.021 0.121 0 ...
##  $ char_freq_..5             : num  0 0 0 0 0 0.125 0 0 0.446 0.087 ...
##  $ capital_run_length_average: num  2.23 1 3.03 1.95 1.58 ...
##  $ capital_run_length_longest: num  19 1 45 7 4 595 26 4 64 24 ...
##  $ capital_run_length_total  : num  96 3 706 142 30 ...
##  $ spam                      : num  1 0 0 0 0 1 1 0 1 0 ...
dim(test_all)
## [1] 1380   58

The goal of trees is to understand how the dataset is structured. Is there any order in the chaos of information?

The tree builds categories: from the first node, black vs white, then, from the dark node, pale dark vs shades of grays, etc.

From the first node, we split the data. Each split brings more explanation. However, we could split the dataset indefinitely; which would overfit the data and all the analysis would be pointless. We must find the optimum number of splits. How?

  • Information gain: the higher the gain when you split, the better. However, the standard splitting criterion is the Gini impurity.
  • Gini: a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

First, we train the tree. Second, we make predictions with the newly created tree (we test it).

# Set random seed
set.seed(1)

# Load the principal package
library(rpart)

# Load additional packages to improve the visual
library(rattle)
library(rpart.plot)
library(RColorBrewer)

# FIRST LEARNING ALGORITHM
# Train with the 'gini criterion' (for splitting)
tree_g <- rpart(spam ~ ., train_all, method = 'class')

# Make predictions with the test set, compute the confusion matrix and accuracy
pred_g <- predict(tree_g, test_all, type = 'class')
conf_g <- table(test_all$spam, pred_g)
acc_g <- sum(diag(conf_g)) / sum(conf_g)


# SECOND LEARNING ALGORITHM
# Train with the  'information gain' as a splitting criterion
tree_i <- rpart(spam ~ ., train_all, method = 'class', parms = list(split = 'information'))

# Make predictions with the test set, compute the confusion matrix and accuracy
pred_i <- predict(tree_i, test_all, type = 'class')
conf_i <- table(test_all$spam, pred_i)
acc_i <- sum(diag(conf_i)) / sum(conf_i)


# Draw fancy plots
fancyRpartPlot(tree_g)

fancyRpartPlot(tree_i)

# Print acc_g and acc_i
acc_g
## [1] 0.8905797
acc_i
## [1] 0.8963768

Using a different splitting criterion can influence the learning algorithm and results. However, the resulting trees are quite similar. The same variables are often present in both trees and the accuracy on the test set is comparable: 89% and 90%.

Comparing the methods

Use the ROC to compare two predictions: (1) a tree with the gini (_g) (splitting) criterion, and (2) a tree with information gain (_i) as the splitting criterion.

library(ROCR)

# set random seed
set.seed(1)

# FIRST LEARNING ALGORITHM
# Train and test tree with the gini criterion
tree_g <- rpart(spam ~ ., train_all, method = 'class')

# Make the rpart probs
probs_g <- predict(tree_g, test_all, type = 'prob')[,2]

# Make the ROCR prediction
pred_g <- prediction(probs_g, test_all$spam)


# SECOND LEARNING ALGORITHM
# Train and test tree with information gain criterion
tree_i <- rpart(spam ~ ., train_all, method = 'class', parms = list(split = 'information'))

# Make the rpart probs
probs_i <- predict(tree_i, test_all, type = 'prob')[,2]

# Make the ROCR prediction
pred_i <- prediction(probs_i, test_all$spam)

# BOTH LEARNING ALGORITHM
# Make the performance objects for both models
perf_g <- performance(pred_g, 'tpr', 'fpr')
perf_i <- performance(pred_i, 'tpr', 'fpr')

# Draw the ROC lines
plot(perf_g, type = 'l', lwd = 2, col = 'darkgreen', main = 'ROC Curves', ylab = 'True positive rate', xlab = 'False positive rate', ylim = c(0,1), xlim = c(0,1))
par(new=TRUE)
plot(perf_i, type = 'l', lwd = 2, col = 'red', axes = FALSE, xlab = '', ylab = '')

text(0.25, 0.75, 'i in red', col = 'red')
text(0.25, 0.65, 'g in green', col = 'darkgreen')

The larger the area under the curve, the red curve, the better the model. The best model is the tree with information gain (_i) as the splitting criterion.

acc_g
## [1] 0.8905797
acc_i
## [1] 0.8963768
acc_g < acc_i
## [1] TRUE

It’s also the model with the highest accuracy ratio.