Foreword
As with tree-based methods, SVM are good supervised classifiers. We can use them to parse through emails and split a set into ‘Spam’ or ‘Not’.
Find out more about SMV:
kernlab packageLoad the package.
library(kernlab)The package contains a set of algorithms: kernel-based machine learning methods for classification, regression, clustering, novelty detection, quantile regression, and dimensionality reduction
As with regression functions, we feed data to the model, we train it, and the function makes connections among the data. The function grasp patterns. Each observation is made of a dependent variable (‘Spam’ or ‘Not’) and one or more independent variables. With one independent variable, we can visualize the relationships with a scatter diagram (2D).
The algorithm decides whether a dot is ‘Spam’ or ‘Not’. Spam should be bunched up in one area of the scatter diagram. Using a dot for dot product, the kernel computes distance values between dots. From these values, the algorithm associates some dots (a group) and dissociates the group of dots from the other groups of dots. An SVM maps out a geometric boundary, a hyperplane, between ‘Spam’ and ‘Not’.
Of course, there can more than one boundary, boundaries are not necessarily linear (depending on the kernel), there can be several groups, and problems can go beyond 3D when we deal with more than two independent variables (we cannot really visualize the results).
Learn more about the package with more advanced applications.
SVM are good to discriminate spam and nonspam.
The data
The dataset is part of the package.
data(spam)Print it (Source: Machine Learning Repository).
dim(spam)## [1] 4601 58
str(spam)## 'data.frame': 4601 obs. of 58 variables:
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ internet : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
## $ receive : num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
## $ will : num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
## $ people : num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
## $ report : num 0 0.21 0 0 0 0 0 0 0 0 ...
## $ addresses : num 0 0.14 1.75 0 0 0 0 0 0 0.12 ...
## $ free : num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
## $ business : num 0 0.07 0.06 0 0 0 0 0 0 0 ...
## $ email : num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
## $ you : num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
## $ credit : num 0 0 0.32 0 0 0 0 0 3.53 0.06 ...
## $ your : num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
## $ font : num 0 0 0 0 0 0 0 0 0 0 ...
## $ num000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ hp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hpl : num 0 0 0 0 0 0 0 0 0 0 ...
## $ george : num 0 0 0 0 0 0 0 0 0 0 ...
## $ num650 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ lab : num 0 0 0 0 0 0 0 0 0 0 ...
## $ labs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ telnet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ num857 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ data : num 0 0 0 0 0 0 0 0 0.15 0 ...
## $ num415 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ num85 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ technology : num 0 0 0 0 0 0 0 0 0 0 ...
## $ num1999 : num 0 0.07 0 0 0 0 0 0 0 0 ...
## $ parts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ direct : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ meeting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ original : num 0 0 0.12 0 0 0 0 0 0.3 0 ...
## $ project : num 0 0 0 0 0 0 0 0 0 0.06 ...
## $ re : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ edu : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ table : num 0 0 0 0 0 0 0 0 0 0 ...
## $ conference : num 0 0 0 0 0 0 0 0 0 0 ...
## $ charSemicolon : num 0 0 0.01 0 0 0 0 0 0 0.04 ...
## $ charRoundbracket : num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
## $ charSquarebracket: num 0 0 0 0 0 0 0 0 0 0 ...
## $ charExclamation : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ charDollar : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ charHash : num 0 0.048 0.01 0 0 0 0 0 0.022 0 ...
## $ capitalAve : num 3.76 5.11 9.82 3.54 3.54 ...
## $ capitalLong : num 61 101 485 40 40 15 4 11 445 43 ...
## $ capitalTotal : num 278 1028 2259 191 191 ...
## $ type : Factor w/ 2 levels "nonspam","spam": 2 2 2 2 2 2 2 2 2 2 ...
Compare spam to nonspam (‘Not’).
table(spam$type)##
## nonspam spam
## 2788 1813
Preprocess the data
Divide the dataset into a train set (2/3) and test set (1/3). First, randomize the readings.
randIndex <- sample(1:dim(spam[1]))
summary(randIndex)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1151 2301 2301 3451 4601
length(randIndex)## [1] 4601
head(randIndex)## [1] 132 1424 3266 3300 4443 3247
Compute the row cut point (2/3 of the total set).
cut_Point2_3 <- floor(2 * dim(spam)[1]/3)
cut_Point2_3## [1] 3067
Generate the train set (first rows to the cut point).
trainData <- spam[randIndex[1:cut_Point2_3],]Generate the test set (cut point + 1 to the last row)
testData <- spam[randIndex[(cut_Point2_3+1):dim(spam)[1]],]Running the algorithm
By setting parameter kpar (based on a heuristic process, it can be done automatically), we change the ‘rules of association and dissociation’.
Another parameter is the cost of constraints C (high means picky). Like fitting a regression to a dataset, we want the kernel function to fit the dataset. A picky parameter might yield a precise fit for the train set. However, it would not be applicable to other sets. We want to avoid overfitting.
Another parameter, cross, is cross-validation. Cross-validation is important to avoid overfitting. The cross-validation process verifies that the trained algorithm can carry out classification accurately on novel data. For example, a 3-fold cross-validation runs three times before generating the probability deciding on whether or not a message is ‘Spam’.
For this case, we use a radial basis function. There are different types of kernels just like there are different regression models (or link functions).
Run the rbfdot kernel (like we run a multivariate regression). type, the dependent variable, equals 1 or 0 (‘Spam’ or ‘Not’). Print the results.
svmOutput <- ksvm(type ~ ., data = trainData, kernel = "rbfdot", kpar = "automatic", C = 5, cross = 3, prob.model = TRUE)
svmOutput## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 5
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0294547653214877
##
## Number of Support Vectors : 941
##
## Objective Function Value : -1643.426
## Training error : 0.027388
## Cross validation error : 0.073688
## Probability model included.
The training error is low (about 3%). The cross-validation error is higher (rule of thumb: close to 7% is good). The cross-validation error is the difference among the folds. If there were two folds and the first fold yields 100 ‘Spam’ results, the second fold would yield the same results, but 93 times out of 100.
Exploring results
Compute the alpha assessor.
hist(alpha(svmOutput)[[1]])The alpha reveals the values of the support vectors. Because we split a set in two, we only need one set of support vectors ([[1]]); for ‘Not’.
c is equal to 5: the alpha goes from 0 to 5 vectors. Those support vectors at the maximum value represent the most difficult cases to classify.
On a scatter diagram, theses cases would be close, on or beyond the boundary splitting ‘Spam’ from ‘Not’. Cases close to ‘vector = 0’ are far from the boundary.
Rerun the kernel with C = 50 (pickier).
svmOutput2 <- ksvm(type ~ ., data = trainData, kernel = "rbfdot", kpar = "automatic", C = 50, cross = 3, prob.model = TRUE)
svmOutput2## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 50
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0289453457015552
##
## Number of Support Vectors : 807
##
## Objective Function Value : -6901.203
## Training error : 0.01076
## Cross validation error : 0.08249
## Probability model included.
summary(svmOutput2)## Length Class Mode
## 1 ksvm S4
The training error decreases; the cross-validation error increases.
Compute the alpha assessor.
hist(alpha(svmOutput2)[[1]])We reduced the number of hard cases.
We have over 800 support vectors in this output. Check out the vectors close to 0 (a subset).
Note: every time we run ksvm, we get different results. Overall, these results are all similar. But little details can change. For the next lines, we present snapshots. Of course, running the algorithm over and over will never generate the exact same outputs.
alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] < 0.05]## [1] 114 247 343 381 690 772 875 1115 1806 1846 1907 1930
## [13] 2052 2161 2180 2390 2616 2718 2814 3047
Couples of observations are returned. Take a look one row.
trainData[114,]## make address all num3d our over remove internet order mail receive
## 4530 0 0 0 0 0 0 0 0 0 0 0
## will people report addresses free business email you credit your
## 4530 0 0 0 0 0 0 0 2.33 0 1.86
## font num000 money hp hpl george num650 lab labs telnet num857 data
## 4530 0 0 0 0 0 0 0 0 0 0 0 0
## num415 num85 technology num1999 parts pm direct cs meeting original
## 4530 0 0 0 0 0 0.46 0 0 0 0
## project re edu table conference charSemicolon charRoundbracket
## 4530 0.46 0 0.46 0 0 0 0.082
## charSquarebracket charExclamation charDollar charHash capitalAve
## 4530 0 0 0 0 1.117
## capitalLong capitalTotal type
## 4530 3 38 nonspam
The row is classified as nonspam by humans (truth). We can see markers used to identify spam. Most of the markers are off or 0. The other markers are low. In other words, they do not have many traces of ‘suspicious words’.
Contrast the above results with another row: spam.
trainData[3,]## make address all num3d our over remove internet order mail receive
## 69 0.3 0 0.61 0 0 0 0 0 0 0.92 0.3
## will people report addresses free business email you credit your font
## 69 0.92 0.3 0.3 0 2.15 0.61 0 5.53 0 1.23 0
## num000 money hp hpl george num650 lab labs telnet num857 data num415
## 69 0 0.3 0 0 0 0 0 0 0 0 0 0.3
## num85 technology num1999 parts pm direct cs meeting original project
## 69 0 0 0 0 0 0 0 0 0 0
## re edu table conference charSemicolon charRoundbracket
## 69 0.3 0 0 0 0 0.1
## charSquarebracket charExclamation charDollar charHash capitalAve
## 69 0 1.053 0.351 0.25 3.884
## capitalLong capitalTotal type
## 69 66 303 spam
This time, there are more lights on!
Read the results.
cut <- alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] < 0.05]
trainData[cut, "type"]## [1] nonspam nonspam nonspam nonspam nonspam nonspam spam
## [8] nonspam spam spam nonspam spam spam nonspam
## [15] nonspam nonspam nonspam nonspam nonspam nonspam
## Levels: nonspam spam
There are more nonspam.
Check the other end of the vectors.
alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] == 50]## [1] 3 28 99 135 164 176 219 224 266 273 308 339 348 358
## [15] 367 370 426 571 619 641 717 727 747 751 826 831 875 884
## [29] 885 958 1008 1028 1052 1061 1077 1117 1136 1145 1147 1173 1178 1208
## [43] 1228 1241 1293 1309 1323 1351 1355 1359 1375 1401 1418 1445 1471 1558
## [57] 1614 1627 1632 1678 1681 1720 1734 1750 1762 1777 1800 1864 1937 1942
## [71] 1952 1990 2017 2022 2023 2048 2054 2059 2238 2257 2341 2368 2376 2401
## [85] 2437 2473 2519 2537 2563 2613 2614 2675 2717 2741 2782 2795 2798 2822
## [99] 2848 2866 2888 2942 2973
Pick a row.
trainData[2973,]## make address all num3d our over remove internet order mail receive
## 860 0.09 0 0.09 0 0.39 0.09 0.09 0 0.19 0.29 0.39
## will people report addresses free business email you credit your font
## 860 0.48 0 0.58 0 0.87 0.19 0 1.66 4.1 1.66 0
## num000 money hp hpl george num650 lab labs telnet num857 data num415
## 860 0.39 0.19 0 0 0 0 0 0 0 0 0 0
## num85 technology num1999 parts pm direct cs meeting original project
## 860 0 0 0 0 0 0 0 0 0 0
## re edu table conference charSemicolon charRoundbracket
## 860 0 0 0 0 0 0.14
## charSquarebracket charExclamation charDollar charHash capitalAve
## 860 0 0.326 0.155 0 6.813
## capitalLong capitalTotal type
## 860 494 1458 spam
These patterns seem to be confusing the algorithm.
Read the results and compute the share of spam and nonspam.
cut <- alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] == 50]
trainData[cut,"type"]## [1] nonspam spam spam spam nonspam nonspam nonspam spam
## [9] spam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [17] spam spam nonspam spam nonspam spam spam spam
## [25] nonspam spam nonspam nonspam spam spam nonspam nonspam
## [33] nonspam spam spam nonspam nonspam spam spam nonspam
## [41] spam nonspam nonspam nonspam nonspam spam spam spam
## [49] nonspam spam spam spam nonspam nonspam nonspam nonspam
## [57] nonspam nonspam nonspam spam nonspam nonspam spam spam
## [65] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam
## [73] nonspam spam spam spam spam spam spam nonspam
## [81] spam nonspam nonspam spam spam spam spam spam
## [89] nonspam nonspam spam spam nonspam spam nonspam spam
## [97] spam spam spam spam nonspam nonspam spam spam
## [105] spam spam spam nonspam spam spam
## Levels: nonspam spam
# total
length(cut)## [1] 110
# spam only
sum(trainData[cut,"type"] == 'spam')## [1] 57
# % spam only
sum(trainData[cut,"type"] == 'spam') / length(cut)## [1] 0.5181818
No matter how many time we run the algorithm, the split is always close to 50-50! Sometimes 60-40, sometimes 45-55… Down the line, the average runs around 50-50. No wonder these are the hard cases.
Predictions and performance measures
Use the support vectors we generated through this training process with another dataset to predict outcomes.
Run the trained algorithm on the test set.
svmPred2 <- predict(svmOutput2, testData, type = "votes")
str(svmPred2)## num [1:2, 1:1534] 1 0 0 1 1 0 1 0 1 0 ...
The prediction process works like a vote. The algorithm ‘votes’ on whether or not each observation is ‘Spam’ or ‘Not’.
Note: The left-hand list has one for non-spam votes and zeroes for a spam vote. Because this is a two-class problem, the other list has just the opposite. We can use either one because they are mirror images of each other.
Generate a data frame:
compTable2 <- data.frame(testData[,58], svmPred2[1,])
head(compTable2)## testData...58. svmPred2.1...
## 1 nonspam 1
## 2 spam 0
## 3 nonspam 1
## 4 nonspam 1
## 5 nonspam 1
## 6 nonspam 1
spam and nonspam.Compute the confusion matrix and proportional confusion matrix.
conf_full2 <- table(compTable2)
conf_full2## svmPred2.1...
## testData...58. 0 1
## nonspam 53 873
## spam 540 68
conf_full2 / sum(conf_full2)## svmPred2.1...
## testData...58. 0 1
## nonspam 0.03455020 0.56910039
## spam 0.35202086 0.04432855
On both matrices:
Wrongful predictions are Type I error (’false positive) and Type II error (false negative).
Compute the accuracy ratio.
inv_acc_full2 <- sum(diag(conf_full2))/sum(conf_full2)
1 - inv_acc_full2## [1] 0.9211213
The accuracy ratio is close to 95% or 19/20, the ideal significance level (alpha = 5%).
Remember that the cross-validation error is 8.8% with C = 50. Parameters have an impact of the confusion matrix and the accuracy ratio.
For the record, compute results for the first model where C = 5.
svmPred <- predict(svmOutput, testData, type = "votes")
compTable <- data.frame(testData[,58], svmPred[1,])
conf_full <- table(compTable)
conf_full## svmPred.1...
## testData...58. 0 1
## nonspam 41 885
## spam 547 61
conf_full / sum(conf_full)## svmPred.1...
## testData...58. 0 1
## nonspam 0.02672751 0.57692308
## spam 0.35658409 0.03976532
inv_acc_full <- sum(diag(conf_full))/sum(conf_full)
1 - inv_acc_full## [1] 0.9335072
Visualization
If we could illustrate the results, we would get two zones, ‘Spam’ and ‘Not’, divided by a boundary; a hyperplane. The dependent variable would be ‘X2’ and the sole independent variable would be ‘X1’. Most ‘O’, whatever it means spam or nonspam, fall into the blue zone, but we can find some in the pink zone (errors); vice-versa for ‘X’. A good model is when 95% of the ‘O’ and ‘X’ fall in the right zone.
The diagram above is generated with the e1071 package. Although it is not related to this case, it is a good illustration. It is always easier to illustrate models with less than 3 independent variables. We have such a case below.
e1071 packageWe replicate the above analysis with a different package.
The package is supported by a website: A Library for Support Vector Machines.
Load the package
library(e1071)Running the algorithm
Remember the code from the kernlab package.
svmOutput <- ksvm(type ~ ., data = trainData, kernel = "rbfdot", kpar = "automatic", C = 5, cross = 3, prob.model = TRUE)Run a similar model with the e1071 package. We must specify more parameters.
# type classification
svmOutput3 <- svm(type ~ ., data = trainData, type = "C", kernel = "radial", gamma = 0.00001, cost = 50, cross = 3, probability = TRUE)
# Other parameters
# gamma = c(0.00001, 0.0001, 0.002, 0.01, 0.04)
# cost = c(10, 80, 10, 200, 500, 1000)
# scale = TRUE
# degree = 3
# coef0 = 0
# nu = 0.5
# class.weights = NULL
# cachesize = 40
# tolerance = 0.001
# epsilon = 0.1
# shrinking = TRUE
# fitted = TRUEAs we can see with all these additional parameters, the svm function offers a lot more than the ksvm function.
Print the results.
summary(svmOutput3)##
## Call:
## svm(formula = type ~ ., data = trainData, type = "C", kernel = "radial",
## gamma = 1e-05, cost = 50, cross = 3, probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 50
## gamma: 1e-05
##
## Number of Support Vectors: 1502
##
## ( 748 754 )
##
##
## Number of Classes: 2
##
## Levels:
## nonspam spam
##
## 3-fold cross-validation on training data:
##
## Total Accuracy: 87.18618
## Single Accuracies:
## 86.88845 86.79061 87.87879
Exploring possibilities
The package offers interesting possibilities such as plotting the relationships between each independent variable and the dependent variable type.
par(mfrow = c(3, 3))
plot(type ~ ., data = trainData)par(mfrow = c(1, 1))We can also perform sensitivity analyses on some parameters. Instead of integers, we enter vectors.
Run a sensitivity analysis on gamma and cost. Warning: this procedure is time-consuming!
svmOutput3b <- tune.svm(type ~ ., data = trainData, gamma = 10^(-6:-1), cost = 10^(1:2))
summary(svmOutput3b)The results (a snapshot since each simulation has its little differences in the details):
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
gamma cost
0.01 10
- best performance: 0.06716165
- Detailed performance results:
gamma cost error dispersion
1 1e-06 10 0.40431224 0.034799192
2 1e-05 10 0.20575887 0.020010995
3 1e-04 10 0.10531711 0.005809638
4 1e-03 10 0.07629175 0.009072571
5 1e-02 10 0.06716165 0.011284842
6 1e-01 10 0.09488408 0.022941626
7 1e-06 100 0.20575994 0.020547132
8 1e-05 100 0.10727470 0.006302980
9 1e-04 100 0.08085308 0.009123160
10 1e-03 100 0.07074578 0.011033750
11 1e-02 100 0.06846778 0.011481863
12 1e-01 100 0.09911328 0.027344955
We go back to our original model and we plug back the ‘best parameters’ as computed above: gamma = 0.01 and cost = 10.
Run the algorithm again and print the results.
svmOutput3 <- svm(type ~ ., data = trainData, type = "C", kernel = "radial", gamma = 0.01, cost = 10, cross = 3, probability = TRUE)
summary(svmOutput3)##
## Call:
## svm(formula = type ~ ., data = trainData, type = "C", kernel = "radial",
## gamma = 0.01, cost = 10, cross = 3, probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
## gamma: 0.01
##
## Number of Support Vectors: 686
##
## ( 337 349 )
##
##
## Number of Classes: 2
##
## Levels:
## nonspam spam
##
## 3-fold cross-validation on training data:
##
## Total Accuracy: 93.15292
## Single Accuracies:
## 92.7593 92.95499 93.74389
The results above provide several accuracy ratios. With the ‘best parameters’, we maximum the total accuracy ratio.
Compare the results to those of the kernlab (they are similar).
Another case
The e1071 documentation offers examples with the iris dataset.
We run a difference analysis with data from the MASS package.
We all (almost) like cats. They proliferate on the Internet! The cats dataset shows various anatomical features of house cats. Bwt is the body weight in kilograms, Hwt is the heart weight in grams, and Sex should be obvious. We want to predict the Sex (‘M’ or ‘F’) with the anatomical features.
Perform the analysis and plot the results.
library(MASS)
data(cats)
str(cats)## 'data.frame': 144 obs. of 3 variables:
## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
model <- svm(Sex ~ ., data = cats)
summary(model)##
## Call:
## svm(formula = Sex ~ ., data = cats)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 84
##
## ( 39 45 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
plot(model, data = cats)Compute predictions and measure accuracy (along with confusion matrices and other ratios).
index <- 1:nrow(cats)
testindex <- sample(index, trunc(length(index)/3))
testset <- cats[testindex,]
trainset <- cats[-testindex,]
model <- svm(Sex~., data = trainset)
prediction <- predict(model, testset[,-1])
tab <- table(pred = prediction, true = testset[,1])
tab## true
## pred F M
## F 10 5
## M 4 29
tab / sum(tab)## true
## pred F M
## F 0.20833333 0.10416667
## M 0.08333333 0.60416667
accuracy <- sum(diag(tab))/sum(tab)
accuracy## [1] 0.8125
We double-check the calculation with one of the package function.
classAgreement(tab)## $diag
## [1] 0.8125
##
## $kappa
## [1] 0.5555556
##
## $rand
## [1] 0.6888298
##
## $crand
## [1] 0.3655488
The first statistic is the accuracy ratio (similar).
kernlab, Kernel-Based Machine Learning Lab.e1071, Misc Functions of the Department of Statistics, Probability Theory Group.klaR, Classification and Visualization.svmpath, The SVM Path Algorithm.shogun, The Shogun Machine Learning. Toolbox](http://www.shogun-toolbox.org/)This one is hard to classify… A male or a female?