Foreword
Algorithms are flexible methods that can learn from examples; they are supervised methods. They are good at classification. For example, from symptoms, an algorithm can predict a sickness (classification between ‘sick’ and ‘Not’).
In general, we feed patterns or predictors to the algorithm, just like we feed data to a regression. Then, we use the algorithm-regression to forecast, extrapolate or interpolate.
Sophisticated methods can parse through anything: structured data such as numbers, text or natural language, unstructured data such as sounds, colors, images, etc.
Here, we focus on something simple: spam. Spam (emails) is text. However, we preprocessed the emails and extracted the patterns: whether the email contents capital characters, the average number of capital characters, given words or n-grams, word or n-gram frequencies, etc.
We then train the algorithm by feeding him the preprocessed database. The database shows lines with one answer (‘Spam’ or ‘Not’, 1 or 0) and patterns. The algorithm learns, makes connections. Finally, we feed the algorithm with another database. This time, the algorithm must predict the answers. And we measure the success rate with the true answers. When the algorithm is robust enough (high success rate), we can apply it on other databases to extract spam from emails (knowing we still have an error margin).
Notes: a n-gram is a combination of words; a pair is a bigram, a triple is a trigram, etc.
Classification: Filtering spam
Filtering spam from relevant emails is a typical machine learning task. Information such as word frequency, character frequency and the number of capital letters can indicate whether an email is a spam or not.
We have a small dataset: emails (Source: UCI Machine Learning Repository).
# Check it out
head(emails)## avg_capital_seq spam
## 1 1.000 0
## 2 2.112 0
## 3 4.123 1
## 4 1.863 0
## 5 2.973 1
## 6 1.687 0
# Show the dimensions of emails
dim(emails)## [1] 13 2
Several emails have been labeled by humans as spam (1) or not spam (0) and the results are found in the column spam.
We have a simple classifier (a custom function).
spam_classifier <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 4] <- 1
prediction[x >= 3 & x <= 4] <- 0
prediction[x >= 2.2 & x < 3] <- 1
prediction[x >= 1.4 & x < 2.2] <- 0
prediction[x > 1.25 & x < 1.4] <- 1
prediction[x <= 1.25] <- 0
return(prediction)
}Apply the classifier to the avg_capital_seq column.
pred_small <- spam_classifier(emails$avg_capital_seq)
pred_small## [1] 0 0 1 0 1 0 1 0 0 1 0 0 1
Compare spam_pred to emails$spam.
pred_small == emails$spam## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The algorithm filtered the spam 13 out of 13 times! Sadly, the classifier was made to perfectly filter all 13 examples.
Overfitting the spam!
Load a larger dataset (Source: UCIMLR). Print it.
# Check it out
head(emails_full)## avg_capital_seq spam
## 1 1.500 0
## 2 4.941 1
## 3 3.429 1
## 4 3.493 1
## 5 3.380 0
## 6 3.689 1
# Show the dimensions of emails
dim(emails_full)## [1] 4601 2
Use the larger dataset emails_full with the same classifier as above.
# The spam filter
spam_classifier <- function(x) {
prediction <- rep(NA, length(x))
prediction[x > 4] <- 1
prediction[x >= 3 & x <= 4] <- 0
prediction[x >= 2.2 & x < 3] <- 1
prediction[x >= 1.4 & x < 2.2] <- 0
prediction[x > 1.25 & x < 1.4] <- 1
prediction[x <= 1.25] <- 0
return(factor(prediction, levels = c('0', '1')))
}Build a confusion matrix and assess accuracy (one of the three ratios).
# Apply spam_classifier to emails_full
pred_full <- spam_classifier(emails_full$avg_capital_seq)
# Build confusion matrix for emails_full: conf_full
conf_full <- table(emails_full$spam, pred_full)
# Calculate the accuracy with conf_full: acc_full
acc_full <- sum(diag(conf_full))/sum(conf_full)
# Print acc_full
acc_full## [1] 0.6561617
This hard-coded classifier gave you an accuracy of around 65% on the large dataset, which is way worse than the 100% on the small dataset (the emails dataset). Hence, the classifier does not generalize well at all!
Increasing the bias
The spam_classifier above is bogus. It simply overfits on the emails set and, as a result, doesn’t generalize to larger datasets such as emails_full.
Simplify the rules.
spam_classifier_s <- function(x) {
prediction <- rep(NA,length(x))
prediction[x > 4] <- 1
prediction[x <= 4] <- 0
return(factor(prediction, levels = c('0', '1')))
}Run the classifier on emails_full. Calculate the accuracy.
# Apply spam_classifier to emails_full
pred_full <- spam_classifier_s(emails_full$avg_capital_seq)
# Build confusion matrix for emails_full: conf_full
conf_full <- table(emails_full$spam, pred_full)
# Calculate the accuracy with conf_full: acc_full
acc_full <- sum(diag(conf_full))/sum(conf_full)
# Print acc_full
acc_full## [1] 0.7259291
Repeat for emails.
pred_small <- spam_classifier_s(emails$avg_capital_seq)
conf_small <- table(emails$spam, pred_small)
acc_small <- sum(diag(conf_small))/sum(conf_small)
acc_small## [1] 0.7692308
Compare the results.
before = c(0.6561617, 1)
after = c(full = acc_full, small = acc_small)
results <- data.frame(before, after)
results## before after
## full 0.6561617 0.7259291
## small 1.0000000 0.7692308
The model no longer fits the small dataset perfectly, but it fits the large dataset better.
Increasing the bias on the model caused it to generalize better over the large dataset. While the first classifier overfits the data, an accuracy of 77% is far from satisfying for a spam filter.
Splitting criterion
Let try another method for classifying spam: decision trees.
Load another dataset: emails_all. Check it out
str(emails_all)## 'data.frame': 4600 obs. of 58 variables:
## $ word_freq_make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ word_freq_address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ word_freq_all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ word_freq_3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ word_freq_over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ word_freq_remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ word_freq_internet : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ word_freq_order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ word_freq_mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
## $ word_freq_receive : num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
## $ word_freq_will : num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
## $ word_freq_people : num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
## $ word_freq_report : num 0 0.21 0 0 0 0 0 0 0 0 ...
## $ word_freq_addresses : num 0 0.14 1.75 0 0 0 0 0 0 0.12 ...
## $ word_freq_free : num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
## $ word_freq_business : num 0 0.07 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_email : num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
## $ word_freq_you : num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
## $ word_freq_credit : num 0 0 0.32 0 0 0 0 0 3.53 0.06 ...
## $ word_freq_your : num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
## $ word_freq_font : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ word_freq_money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ word_freq_hp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_hpl : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_george : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_650 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_lab : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_labs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_telnet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_857 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_data : num 0 0 0 0 0 0 0 0 0.15 0 ...
## $ word_freq_415 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_85 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_technology : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_1999 : num 0 0.07 0 0 0 0 0 0 0 0 ...
## $ word_freq_parts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_pm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_direct : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_meeting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_original : num 0 0 0.12 0 0 0 0 0 0.3 0 ...
## $ word_freq_project : num 0 0 0 0 0 0 0 0 0 0.06 ...
## $ word_freq_re : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_edu : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_table : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_conference : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_. : num 0 0 0.01 0 0 0 0 0 0 0.04 ...
## $ char_freq_..1 : num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
## $ char_freq_..2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_..3 : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ char_freq_..4 : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ char_freq_..5 : num 0 0.048 0.01 0 0 0 0 0 0.022 0 ...
## $ capital_run_length_average: num 3.76 5.11 9.82 3.54 3.54 ...
## $ capital_run_length_longest: num 61 101 485 40 40 15 4 11 445 43 ...
## $ capital_run_length_total : num 278 1028 2259 191 191 ...
## $ spam : num 1 1 1 1 1 1 1 1 1 1 ...
dim(emails_all)## [1] 4600 58
Read in a train set and check it out.
str(train_all)## 'data.frame': 3221 obs. of 58 variables:
## $ word_freq_make : num 0 0 0 0 0 0 0 0 0 0.4 ...
## $ word_freq_address : num 0 0 0.22 0 0 0 0 0 0 0 ...
## $ word_freq_all : num 0 0 0 1.57 0 0 0 0 0.86 0.6 ...
## $ word_freq_3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_our : num 0.64 0.44 0.66 0.22 0.43 0 0 0 0 0.2 ...
## $ word_freq_over : num 0 0 0.22 0.22 0.43 0 0 0 0 0.6 ...
## $ word_freq_remove : num 0 0 0 0 0.43 0 0 0 0 0.2 ...
## $ word_freq_internet : num 0 0 0.44 0 0.43 0 0 0 0 0.6 ...
## $ word_freq_order : num 0 0 0.44 0 0 0 0 0 0 0.2 ...
## $ word_freq_mail : num 0 0.88 0.89 0 0 0 0 0 0 0.2 ...
## $ word_freq_receive : num 0.64 0 0 0 0 0 0 0 0 0.2 ...
## $ word_freq_will : num 0.64 0 0 0 0.43 0 0.99 0 0.86 1.2 ...
## $ word_freq_people : num 0 0 0.22 0.22 0 0 0 0 0 0 ...
## $ word_freq_report : num 0 0 0.22 0 0 0 0 0 0 0 ...
## $ word_freq_addresses : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_free : num 0 0.44 1.33 0 0 0 0 0 0 0.4 ...
## $ word_freq_business : num 1.29 0 0 0 0 0 0 0 0 1.61 ...
## $ word_freq_email : num 0 0 0 0 0 0 0 0 0.86 0.4 ...
## $ word_freq_you : num 1.29 1.32 0.89 2.02 0.87 5.81 5.94 0.77 0.86 2.21 ...
## $ word_freq_credit : num 5.19 0 0 0 0 0 0 0 0 1.81 ...
## $ word_freq_your : num 1.29 0 0.44 0.22 0 1.16 0 2.32 2.58 2.62 ...
## $ word_freq_font : num 0 0 0 0 9.17 0 0 0 0 0 ...
## $ word_freq_000 : num 0 0 0 0 0 0 0 0 0 0.2 ...
## $ word_freq_money : num 0.64 0 0.22 0 0 1.16 0 0 0 0.6 ...
## $ word_freq_hp : num 0 0 3.34 0 0 0 0 0 1.72 0 ...
## $ word_freq_hpl : num 0 0 3.56 0 0 0 0 0.77 0.86 0 ...
## $ word_freq_george : num 0 0 0.66 0 0 0 0 0 0 0 ...
## $ word_freq_650 : num 0 0.44 0.22 0 0 0 0 0 0.86 0 ...
## $ word_freq_lab : num 0 0 0.22 0 0 0 0 0 0 0 ...
## $ word_freq_labs : num 0 0 0.22 0 0 0 0 0 0.86 0 ...
## $ word_freq_telnet : num 0 0 0.22 0 0 0 0 0 0.86 0 ...
## $ word_freq_857 : num 0 0 0.22 0 0 0 0 0 0 0 ...
## $ word_freq_data : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_415 : num 0 0 0.22 0 0 0 0 0 0 0 ...
## $ word_freq_85 : num 0 0 0.22 0 0 0 0 0 0.86 0 ...
## $ word_freq_technology : num 0 0 0.22 0 0 0 0 0 0.86 0 ...
## $ word_freq_1999 : num 0 0 1.11 0 0 0 0 0 0 0 ...
## $ word_freq_parts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_pm : num 0 0 0 0.22 0 0 0 0 0 0 ...
## $ word_freq_direct : num 0 0 0.22 0 0 0 0 0 0 0 ...
## $ word_freq_cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_meeting : num 0 0 0 0 0 0 0 0.77 0 0 ...
## $ word_freq_original : num 0 0 0.22 0 0 0 0 0 0 0 ...
## $ word_freq_project : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_re : num 0 0 0.22 0.89 0 0 0.99 0 0 0 ...
## $ word_freq_edu : num 0 0.44 0 0 0 2.32 0 0 0 0 ...
## $ word_freq_table : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_conference : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_. : num 0 0 0.148 0 0 0 0.191 0 0 0 ...
## $ char_freq_..1 : num 0.468 0 0.372 0.091 0 0.163 0 0 0.11 0.096 ...
## $ char_freq_..2 : num 0 0 0.111 0 0 0 0 0 0 0 ...
## $ char_freq_..3 : num 0.093 0 0.372 0.045 0.395 ...
## $ char_freq_..4 : num 0 0 0.223 0 0 0 0 0 0 0.129 ...
## $ char_freq_..5 : num 0 0 0 0 1.12 ...
## $ capital_run_length_average: num 2.75 1.84 3.42 1.28 7.98 ...
## $ capital_run_length_longest: num 66 10 42 16 72 7 1 3 10 64 ...
## $ capital_run_length_total : num 135 186 411 97 495 34 18 37 58 513 ...
## $ spam : num 1 1 0 0 1 0 0 0 0 1 ...
dim(train_all)## [1] 3221 58
Read in a test set and check it out.
str(test_all)## 'data.frame': 1380 obs. of 58 variables:
## $ word_freq_make : num 0 0 0.07 0.23 0 0.5 0 0 0.99 0 ...
## $ word_freq_address : num 0 0 0 0 0 0.46 0 0 0.49 0 ...
## $ word_freq_all : num 0.68 0 0.29 0 0 0.34 0 0 0 0 ...
## $ word_freq_3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_our : num 0 0 0.07 0 1.38 0.15 1.21 0 0 0 ...
## $ word_freq_over : num 0 0 0.07 0.23 0 0.03 0 0 0 0 ...
## $ word_freq_remove : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_internet : num 1.36 0 0 0 0 0.19 0.6 0 0 0 ...
## $ word_freq_order : num 0 0 0.74 0 0 0.57 0 0 0 0 ...
## $ word_freq_mail : num 0 0 0 0 0 0.65 0.6 1.38 0.49 0 ...
## $ word_freq_receive : num 0.68 0 0 0 0 0.3 1.21 0 0 0 ...
## $ word_freq_will : num 0.68 0 0.22 0.92 4.16 0.73 0 0 0.49 0 ...
## $ word_freq_people : num 0 0 0 0.46 0 0.65 0 0 0 0 ...
## $ word_freq_report : num 0 0 0.07 0 0 1.27 0 0 0 0 ...
## $ word_freq_addresses : num 0 0 0 0 0 0.03 0 0 0 0 ...
## $ word_freq_free : num 0 0 0 0 0 0.23 1.82 0 0 0 ...
## $ word_freq_business : num 0 0 0 0 0 0.42 0 0 0 0 ...
## $ word_freq_email : num 0 0 0.07 0 0 0 0 0 0 0 ...
## $ word_freq_you : num 3.4 0 0.29 2.76 0 3.08 4.26 4.16 2.48 0.67 ...
## $ word_freq_credit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_your : num 1.36 0 0.22 2.76 0 1.34 0 0 1.99 1.35 ...
## $ word_freq_font : num 0 0 0 0 0 0 0 0 2.98 0 ...
## $ word_freq_000 : num 0.68 0 0 0 0 0.5 0 0 0 0 ...
## $ word_freq_money : num 0.68 0 0 0.69 0 0.5 0 0 0 0 ...
## $ word_freq_hp : num 0 0 0.67 0 0 0 0 0 0 0.67 ...
## $ word_freq_hpl : num 0 0 0.74 0 0 0 0 0 0 0 ...
## $ word_freq_george : num 0 33.33 0.07 0 0 ...
## $ word_freq_650 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_lab : num 0 0 0 0 6.94 0 0 0 0 0 ...
## $ word_freq_labs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_telnet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_857 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_data : num 0 0 1.63 0.46 0 0 0 0 0 0 ...
## $ word_freq_415 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_85 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_technology : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_1999 : num 0 0 0.59 0 0 0 0 0 0 0 ...
## $ word_freq_parts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_pm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_direct : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_meeting : num 0 0 0 0 6.94 0 0 0 0 0 ...
## $ word_freq_original : num 0 0 0.07 0 0 0 0 0 0 0 ...
## $ word_freq_project : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_re : num 0 0 0 0 0 0.03 0 0 0 0 ...
## $ word_freq_edu : num 0 0 0 0 0 0 0 1.38 0 0 ...
## $ word_freq_table : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_conference : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_. : num 0 0 0.163 0 0 0.011 0 0 0 0 ...
## $ char_freq_..1 : num 0 0 0.228 0.445 0.238 0.077 0.29 0 0 0.087 ...
## $ char_freq_..2 : num 0 0 0.032 0 0 0 0 0 0 0 ...
## $ char_freq_..3 : num 0.238 0 0 0.202 0 0.335 0.193 0 0.356 0 ...
## $ char_freq_..4 : num 0.238 0 0.021 0.121 0 ...
## $ char_freq_..5 : num 0 0 0 0 0 0.125 0 0 0.446 0.087 ...
## $ capital_run_length_average: num 2.23 1 3.03 1.95 1.58 ...
## $ capital_run_length_longest: num 19 1 45 7 4 595 26 4 64 24 ...
## $ capital_run_length_total : num 96 3 706 142 30 ...
## $ spam : num 1 0 0 0 0 1 1 0 1 0 ...
dim(test_all)## [1] 1380 58
The goal of trees is to understand how the dataset is structured. Is there any order in the chaos of information?
The tree builds categories: from the first node, black vs white, then, from the dark node, pale dark vs shades of grays, etc.
From the first node, we split the data. Each split brings more explanation. However, we could split the dataset indefinitely; which would overfit the data and all the analysis would be pointless. We must find the optimum number of splits. How?
First, we train the tree. Second, we make predictions with the newly created tree (we test it).
# Set random seed
set.seed(1)
# Load the principal package
library(rpart)
# Load additional packages to improve the visual
library(rattle)
library(rpart.plot)
library(RColorBrewer)
# FIRST LEARNING ALGORITHM
# Train with the 'gini criterion' (for splitting)
tree_g <- rpart(spam ~ ., train_all, method = 'class')
# Make predictions with the test set, compute the confusion matrix and accuracy
pred_g <- predict(tree_g, test_all, type = 'class')
conf_g <- table(test_all$spam, pred_g)
acc_g <- sum(diag(conf_g)) / sum(conf_g)
# SECOND LEARNING ALGORITHM
# Train with the 'information gain' as a splitting criterion
tree_i <- rpart(spam ~ ., train_all, method = 'class', parms = list(split = 'information'))
# Make predictions with the test set, compute the confusion matrix and accuracy
pred_i <- predict(tree_i, test_all, type = 'class')
conf_i <- table(test_all$spam, pred_i)
acc_i <- sum(diag(conf_i)) / sum(conf_i)
# Draw fancy plots
fancyRpartPlot(tree_g)fancyRpartPlot(tree_i)# Print acc_g and acc_i
acc_g## [1] 0.8905797
acc_i## [1] 0.8963768
Using a different splitting criterion can influence the learning algorithm and results. However, the resulting trees are quite similar. The same variables are often present in both trees and the accuracy on the test set is comparable: 89% and 90%.
Comparing the methods
Use the ROC to compare two predictions: (1) a tree with the gini (_g) (splitting) criterion, and (2) a tree with information gain (_i) as the splitting criterion.
library(ROCR)
# set random seed
set.seed(1)
# FIRST LEARNING ALGORITHM
# Train and test tree with the gini criterion
tree_g <- rpart(spam ~ ., train_all, method = 'class')
# Make the rpart probs
probs_g <- predict(tree_g, test_all, type = 'prob')[,2]
# Make the ROCR prediction
pred_g <- prediction(probs_g, test_all$spam)
# SECOND LEARNING ALGORITHM
# Train and test tree with information gain criterion
tree_i <- rpart(spam ~ ., train_all, method = 'class', parms = list(split = 'information'))
# Make the rpart probs
probs_i <- predict(tree_i, test_all, type = 'prob')[,2]
# Make the ROCR prediction
pred_i <- prediction(probs_i, test_all$spam)
# BOTH LEARNING ALGORITHM
# Make the performance objects for both models
perf_g <- performance(pred_g, 'tpr', 'fpr')
perf_i <- performance(pred_i, 'tpr', 'fpr')
# Draw the ROC lines
plot(perf_g, type = 'l', lwd = 2, col = 'darkgreen', main = 'ROC Curves', ylab = 'True positive rate', xlab = 'False positive rate', ylim = c(0,1), xlim = c(0,1))
par(new=TRUE)
plot(perf_i, type = 'l', lwd = 2, col = 'red', axes = FALSE, xlab = '', ylab = '')
text(0.25, 0.75, 'i in red', col = 'red')
text(0.25, 0.65, 'g in green', col = 'darkgreen')The larger the area under the curve, the red curve, the better the model. The best model is the tree with information gain (_i) as the splitting criterion.
acc_g## [1] 0.8905797
acc_i## [1] 0.8963768
acc_g < acc_i## [1] TRUE
It’s also the model with the highest accuracy ratio.