Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets and results; SVM packages.
  • Source: Machine Learning Repository and ‘Modern Multivariate Statistical Techniques, Regression, Classification, and Manifold Learning, Springer, 2008 & 2013’.


Introducing Support Vector Machine (SVM) Algorithms

As with tree-based methods, SVM are good supervised classifiers. We can use them to parse through emails and split a set into ‘Spam’ or ‘Not’.

Find out more about SMV:

The kernlab package

Load the package.

library(kernlab)

The package contains a set of algorithms: kernel-based machine learning methods for classification, regression, clustering, novelty detection, quantile regression, and dimensionality reduction

As with regression functions, we feed data to the model, we train it, and the function makes connections among the data. The function grasp patterns. Each observation is made of a dependent variable (‘Spam’ or ‘Not’) and one or more independent variables. With one independent variable, we can visualize the relationships with a scatter diagram (2D).

The algorithm decides whether a dot is ‘Spam’ or ‘Not’. Spam should be bunched up in one area of the scatter diagram. Using a dot for dot product, the kernel computes distance values between dots. From these values, the algorithm associates some dots (a group) and dissociates the group of dots from the other groups of dots. An SVM maps out a geometric boundary, a hyperplane, between ‘Spam’ and ‘Not’.

Of course, there can more than one boundary, boundaries are not necessarily linear (depending on the kernel), there can be several groups, and problems can go beyond 3D when we deal with more than two independent variables (we cannot really visualize the results).

Learn more about the package with more advanced applications.

Classifying Spam Data

SVM are good to discriminate spam and nonspam.

The data

The dataset is part of the package.

data(spam)

Print it (Source: Machine Learning Repository).

dim(spam)
## [1] 4601   58
str(spam)
## 'data.frame':    4601 obs. of  58 variables:
##  $ make             : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ address          : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ all              : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ num3d            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ our              : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ over             : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ remove           : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ internet         : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ order            : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ mail             : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ receive          : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ will             : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ people           : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ report           : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ addresses        : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ free             : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ business         : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ email            : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ you              : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ credit           : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ your             : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ font             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num000           : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ money            : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ hp               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hpl              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ george           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num650           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lab              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ labs             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ telnet           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num857           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data             : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ num415           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num85            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ technology       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num1999          : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ parts            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pm               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ direct           : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ cs               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ meeting          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ original         : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ project          : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ re               : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ edu              : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ table            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ conference       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ charSemicolon    : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ charRoundbracket : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ charSquarebracket: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ charExclamation  : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ charDollar       : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ charHash         : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ capitalAve       : num  3.76 5.11 9.82 3.54 3.54 ...
##  $ capitalLong      : num  61 101 485 40 40 15 4 11 445 43 ...
##  $ capitalTotal     : num  278 1028 2259 191 191 ...
##  $ type             : Factor w/ 2 levels "nonspam","spam": 2 2 2 2 2 2 2 2 2 2 ...

Compare spam to nonspam (‘Not’).

table(spam$type)
## 
## nonspam    spam 
##    2788    1813

Preprocess the data

Divide the dataset into a train set (2/3) and test set (1/3). First, randomize the readings.

randIndex <- sample(1:dim(spam[1]))
summary(randIndex)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1    1151    2301    2301    3451    4601
length(randIndex)
## [1] 4601
head(randIndex)
## [1]  132 1424 3266 3300 4443 3247

Compute the row cut point (2/3 of the total set).

cut_Point2_3 <- floor(2 * dim(spam)[1]/3)
cut_Point2_3
## [1] 3067

Generate the train set (first rows to the cut point).

trainData <- spam[randIndex[1:cut_Point2_3],]

Generate the test set (cut point + 1 to the last row)

testData <- spam[randIndex[(cut_Point2_3+1):dim(spam)[1]],]

Processing the Data

Running the algorithm

By setting parameter kpar (based on a heuristic process, it can be done automatically), we change the ‘rules of association and dissociation’.

Another parameter is the cost of constraints C (high means picky). Like fitting a regression to a dataset, we want the kernel function to fit the dataset. A picky parameter might yield a precise fit for the train set. However, it would not be applicable to other sets. We want to avoid overfitting.

Another parameter, cross, is cross-validation. Cross-validation is important to avoid overfitting. The cross-validation process verifies that the trained algorithm can carry out classification accurately on novel data. For example, a 3-fold cross-validation runs three times before generating the probability deciding on whether or not a message is ‘Spam’.

For this case, we use a radial basis function. There are different types of kernels just like there are different regression models (or link functions).

Run the rbfdot kernel (like we run a multivariate regression). type, the dependent variable, equals 1 or 0 (‘Spam’ or ‘Not’). Print the results.

svmOutput <- ksvm(type ~ ., data = trainData, kernel = "rbfdot", kpar = "automatic", C = 5, cross = 3,   prob.model = TRUE)

svmOutput
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 5 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0294547653214877 
## 
## Number of Support Vectors : 941 
## 
## Objective Function Value : -1643.426 
## Training error : 0.027388 
## Cross validation error : 0.073688 
## Probability model included.

The training error is low (about 3%). The cross-validation error is higher (rule of thumb: close to 7% is good). The cross-validation error is the difference among the folds. If there were two folds and the first fold yields 100 ‘Spam’ results, the second fold would yield the same results, but 93 times out of 100.

Exploring results

Compute the alpha assessor.

hist(alpha(svmOutput)[[1]])

The alpha reveals the values of the support vectors. Because we split a set in two, we only need one set of support vectors ([[1]]); for ‘Not’.

c is equal to 5: the alpha goes from 0 to 5 vectors. Those support vectors at the maximum value represent the most difficult cases to classify.

On a scatter diagram, theses cases would be close, on or beyond the boundary splitting ‘Spam’ from ‘Not’. Cases close to ‘vector = 0’ are far from the boundary.

Rerun the kernel with C = 50 (pickier).

svmOutput2 <- ksvm(type ~ ., data = trainData, kernel = "rbfdot", kpar = "automatic", C = 50, cross = 3,   prob.model = TRUE)

svmOutput2
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 50 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0289453457015552 
## 
## Number of Support Vectors : 807 
## 
## Objective Function Value : -6901.203 
## Training error : 0.01076 
## Cross validation error : 0.08249 
## Probability model included.
summary(svmOutput2)
## Length  Class   Mode 
##      1   ksvm     S4

The training error decreases; the cross-validation error increases.

Compute the alpha assessor.

hist(alpha(svmOutput2)[[1]])

We reduced the number of hard cases.

We have over 800 support vectors in this output. Check out the vectors close to 0 (a subset).

Note: every time we run ksvm, we get different results. Overall, these results are all similar. But little details can change. For the next lines, we present snapshots. Of course, running the algorithm over and over will never generate the exact same outputs.

alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] < 0.05]
##  [1]  114  247  343  381  690  772  875 1115 1806 1846 1907 1930
##  [13] 2052 2161 2180 2390 2616 2718 2814 3047

Couples of observations are returned. Take a look one row.

trainData[114,]
##      make address all num3d our over remove internet order mail receive
## 4530    0       0   0     0   0    0      0        0     0    0       0
##      will people report addresses free business email  you credit your
## 4530    0      0      0         0    0        0     0 2.33      0 1.86
##      font num000 money hp hpl george num650 lab labs telnet num857 data
## 4530    0      0     0  0   0      0      0   0    0      0      0    0
##      num415 num85 technology num1999 parts   pm direct cs meeting original
## 4530      0     0          0       0     0 0.46      0  0       0        0
##      project re  edu table conference charSemicolon charRoundbracket
## 4530    0.46  0 0.46     0          0             0            0.082
##      charSquarebracket charExclamation charDollar charHash capitalAve
## 4530                 0               0          0        0      1.117
##      capitalLong capitalTotal    type
## 4530           3           38 nonspam

The row is classified as nonspam by humans (truth). We can see markers used to identify spam. Most of the markers are off or 0. The other markers are low. In other words, they do not have many traces of ‘suspicious words’.

Contrast the above results with another row: spam.

trainData[3,]
##    make address  all num3d our over remove internet order mail receive
## 69  0.3       0 0.61     0   0    0      0        0     0 0.92     0.3
##    will people report addresses free business email  you credit your font
## 69 0.92    0.3    0.3         0 2.15     0.61     0 5.53      0 1.23    0
##    num000 money hp hpl george num650 lab labs telnet num857 data num415
## 69      0   0.3  0   0      0      0   0    0      0      0    0    0.3
##    num85 technology num1999 parts pm direct cs meeting original project
## 69     0          0       0     0  0      0  0       0        0       0
##     re edu table conference charSemicolon charRoundbracket
## 69 0.3   0     0          0             0              0.1
##    charSquarebracket charExclamation charDollar charHash capitalAve
## 69                 0           1.053      0.351     0.25      3.884
##    capitalLong capitalTotal type
## 69          66          303 spam

This time, there are more lights on!

Read the results.

cut <- alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] < 0.05]

trainData[cut, "type"]
##  [1] nonspam nonspam nonspam nonspam nonspam nonspam spam 
##  [8] nonspam spam    spam    nonspam spam    spam    nonspam
##  [15] nonspam nonspam nonspam nonspam nonspam nonspam
## Levels: nonspam spam

There are more nonspam.

Check the other end of the vectors.

alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] == 50]
##   [1]    3   28   99  135  164  176  219  224  266  273  308  339  348  358
##  [15]  367  370  426  571  619  641  717  727  747  751  826  831  875  884
##  [29]  885  958 1008 1028 1052 1061 1077 1117 1136 1145 1147 1173 1178 1208
##  [43] 1228 1241 1293 1309 1323 1351 1355 1359 1375 1401 1418 1445 1471 1558
##  [57] 1614 1627 1632 1678 1681 1720 1734 1750 1762 1777 1800 1864 1937 1942
##  [71] 1952 1990 2017 2022 2023 2048 2054 2059 2238 2257 2341 2368 2376 2401
##  [85] 2437 2473 2519 2537 2563 2613 2614 2675 2717 2741 2782 2795 2798 2822
##  [99] 2848 2866 2888 2942 2973

Pick a row.

trainData[2973,]
##     make address  all num3d  our over remove internet order mail receive
## 860 0.09       0 0.09     0 0.39 0.09   0.09        0  0.19 0.29    0.39
##     will people report addresses free business email  you credit your font
## 860 0.48      0   0.58         0 0.87     0.19     0 1.66    4.1 1.66    0
##     num000 money hp hpl george num650 lab labs telnet num857 data num415
## 860   0.39  0.19  0   0      0      0   0    0      0      0    0      0
##     num85 technology num1999 parts pm direct cs meeting original project
## 860     0          0       0     0  0      0  0       0        0       0
##     re edu table conference charSemicolon charRoundbracket
## 860  0   0     0          0             0             0.14
##     charSquarebracket charExclamation charDollar charHash capitalAve
## 860                 0           0.326      0.155        0      6.813
##     capitalLong capitalTotal type
## 860         494         1458 spam

These patterns seem to be confusing the algorithm.

Read the results and compute the share of spam and nonspam.

cut <- alphaindex(svmOutput2)[[1]][alpha(svmOutput2)[[1]] == 50]

trainData[cut,"type"]
##   [1] nonspam spam    spam    spam    nonspam nonspam nonspam spam   
##   [9] spam    nonspam nonspam nonspam spam    nonspam nonspam nonspam
##  [17] spam    spam    nonspam spam    nonspam spam    spam    spam   
##  [25] nonspam spam    nonspam nonspam spam    spam    nonspam nonspam
##  [33] nonspam spam    spam    nonspam nonspam spam    spam    nonspam
##  [41] spam    nonspam nonspam nonspam nonspam spam    spam    spam   
##  [49] nonspam spam    spam    spam    nonspam nonspam nonspam nonspam
##  [57] nonspam nonspam nonspam spam    nonspam nonspam spam    spam   
##  [65] nonspam nonspam spam    nonspam nonspam nonspam nonspam nonspam
##  [73] nonspam spam    spam    spam    spam    spam    spam    nonspam
##  [81] spam    nonspam nonspam spam    spam    spam    spam    spam   
##  [89] nonspam nonspam spam    spam    nonspam spam    nonspam spam   
##  [97] spam    spam    spam    spam    nonspam nonspam spam    spam   
## [105] spam    spam    spam    nonspam spam    spam   
## Levels: nonspam spam
# total
length(cut)
## [1] 110
# spam only
sum(trainData[cut,"type"] == 'spam')
## [1] 57
# % spam only
sum(trainData[cut,"type"] == 'spam') / length(cut)
## [1] 0.5181818

No matter how many time we run the algorithm, the split is always close to 50-50! Sometimes 60-40, sometimes 45-55… Down the line, the average runs around 50-50. No wonder these are the hard cases.

Predictions and performance measures

Use the support vectors we generated through this training process with another dataset to predict outcomes.

Run the trained algorithm on the test set.

svmPred2 <- predict(svmOutput2, testData, type = "votes")

str(svmPred2)
##  num [1:2, 1:1534] 1 0 0 1 1 0 1 0 1 0 ...

The prediction process works like a vote. The algorithm ‘votes’ on whether or not each observation is ‘Spam’ or ‘Not’.

Note: The left-hand list has one for non-spam votes and zeroes for a spam vote. Because this is a two-class problem, the other list has just the opposite. We can use either one because they are mirror images of each other.

Generate a data frame:

  • The 58th column of the test set (true values).
  • The 1st column in the votes (predictions).
compTable2 <- data.frame(testData[,58], svmPred2[1,])
head(compTable2)
##   testData...58. svmPred2.1...
## 1        nonspam             1
## 2           spam             0
## 3        nonspam             1
## 4        nonspam             1
## 5        nonspam             1
## 6        nonspam             1
  • Left: true values are spam and nonspam.
  • Right: predictions are 1 (nonspam) and 0 (spam).

Compute the confusion matrix and proportional confusion matrix.

conf_full2 <- table(compTable2)
conf_full2
##               svmPred2.1...
## testData...58.   0   1
##        nonspam  53 873
##        spam    540  68
conf_full2 / sum(conf_full2)
##               svmPred2.1...
## testData...58.          0          1
##        nonspam 0.03455020 0.56910039
##        spam    0.35202086 0.04432855

On both matrices:

  • Top-left: predicted as spam while the truth is nonspam: error.
  • Bottom-left: truthfully predicted as nonspam.
  • Top-right: predicted as nonspam while the truth is spam: error.
  • Bottom-right: truthfully predicted as spam.

Wrongful predictions are Type I error (’false positive) and Type II error (false negative).

Compute the accuracy ratio.

inv_acc_full2 <- sum(diag(conf_full2))/sum(conf_full2)
1 - inv_acc_full2
## [1] 0.9211213

The accuracy ratio is close to 95% or 19/20, the ideal significance level (alpha = 5%).

Remember that the cross-validation error is 8.8% with C = 50. Parameters have an impact of the confusion matrix and the accuracy ratio.

For the record, compute results for the first model where C = 5.

svmPred <- predict(svmOutput, testData, type = "votes")

compTable <- data.frame(testData[,58], svmPred[1,])

conf_full <- table(compTable)
conf_full
##               svmPred.1...
## testData...58.   0   1
##        nonspam  41 885
##        spam    547  61
conf_full / sum(conf_full)
##               svmPred.1...
## testData...58.          0          1
##        nonspam 0.02672751 0.57692308
##        spam    0.35658409 0.03976532
inv_acc_full <- sum(diag(conf_full))/sum(conf_full)
1 - inv_acc_full
## [1] 0.9335072

Visualization

If we could illustrate the results, we would get two zones, ‘Spam’ and ‘Not’, divided by a boundary; a hyperplane. The dependent variable would be ‘X2’ and the sole independent variable would be ‘X1’. Most ‘O’, whatever it means spam or nonspam, fall into the blue zone, but we can find some in the pink zone (errors); vice-versa for ‘X’. A good model is when 95% of the ‘O’ and ‘X’ fall in the right zone.


The diagram above is generated with the e1071 package. Although it is not related to this case, it is a good illustration. It is always easier to illustrate models with less than 3 independent variables. We have such a case below.


The e1071 package

We replicate the above analysis with a different package.

The package is supported by a website: A Library for Support Vector Machines.

Load the package

library(e1071)

Running the algorithm

Remember the code from the kernlab package.

svmOutput <- ksvm(type ~ ., data = trainData, kernel = "rbfdot", kpar = "automatic", C = 5, cross = 3,   prob.model = TRUE)

Run a similar model with the e1071 package. We must specify more parameters.

# type classification
svmOutput3 <- svm(type ~ ., data = trainData, type = "C", kernel = "radial", gamma = 0.00001, cost = 50, cross = 3, probability = TRUE)

# Other parameters
# gamma = c(0.00001, 0.0001, 0.002, 0.01, 0.04)
# cost = c(10, 80, 10, 200, 500, 1000)
# scale = TRUE
# degree = 3
# coef0 = 0
# nu = 0.5
# class.weights = NULL
# cachesize = 40
# tolerance = 0.001
# epsilon = 0.1
# shrinking = TRUE
# fitted = TRUE

As we can see with all these additional parameters, the svm function offers a lot more than the ksvm function.

Print the results.

summary(svmOutput3)
## 
## Call:
## svm(formula = type ~ ., data = trainData, type = "C", kernel = "radial", 
##     gamma = 1e-05, cost = 50, cross = 3, probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  50 
##       gamma:  1e-05 
## 
## Number of Support Vectors:  1502
## 
##  ( 748 754 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  nonspam spam
## 
## 3-fold cross-validation on training data:
## 
## Total Accuracy: 87.18618 
## Single Accuracies:
##  86.88845 86.79061 87.87879

Exploring possibilities

The package offers interesting possibilities such as plotting the relationships between each independent variable and the dependent variable type.

par(mfrow = c(3, 3))
plot(type ~ ., data = trainData)

par(mfrow = c(1, 1))

We can also perform sensitivity analyses on some parameters. Instead of integers, we enter vectors.

Run a sensitivity analysis on gamma and cost. Warning: this procedure is time-consuming!

svmOutput3b <- tune.svm(type ~ ., data = trainData, gamma = 10^(-6:-1), cost = 10^(1:2))
summary(svmOutput3b)

The results (a snapshot since each simulation has its little differences in the details):

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 gamma cost
  0.01   10

- best performance: 0.06716165 

- Detailed performance results:
   gamma cost      error  dispersion
1  1e-06   10 0.40431224 0.034799192
2  1e-05   10 0.20575887 0.020010995
3  1e-04   10 0.10531711 0.005809638
4  1e-03   10 0.07629175 0.009072571
5  1e-02   10 0.06716165 0.011284842
6  1e-01   10 0.09488408 0.022941626
7  1e-06  100 0.20575994 0.020547132
8  1e-05  100 0.10727470 0.006302980
9  1e-04  100 0.08085308 0.009123160
10 1e-03  100 0.07074578 0.011033750
11 1e-02  100 0.06846778 0.011481863
12 1e-01  100 0.09911328 0.027344955

We go back to our original model and we plug back the ‘best parameters’ as computed above: gamma = 0.01 and cost = 10.

Run the algorithm again and print the results.

svmOutput3 <- svm(type ~ ., data = trainData, type = "C", kernel = "radial", gamma = 0.01, cost = 10, cross = 3, probability = TRUE)

summary(svmOutput3)
## 
## Call:
## svm(formula = type ~ ., data = trainData, type = "C", kernel = "radial", 
##     gamma = 0.01, cost = 10, cross = 3, probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.01 
## 
## Number of Support Vectors:  686
## 
##  ( 337 349 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  nonspam spam
## 
## 3-fold cross-validation on training data:
## 
## Total Accuracy: 93.15292 
## Single Accuracies:
##  92.7593 92.95499 93.74389

The results above provide several accuracy ratios. With the ‘best parameters’, we maximum the total accuracy ratio.

Compare the results to those of the kernlab (they are similar).

Another case

The e1071 documentation offers examples with the iris dataset.

We run a difference analysis with data from the MASS package.

We all (almost) like cats. They proliferate on the Internet! The cats dataset shows various anatomical features of house cats. Bwt is the body weight in kilograms, Hwt is the heart weight in grams, and Sex should be obvious. We want to predict the Sex (‘M’ or ‘F’) with the anatomical features.

Perform the analysis and plot the results.

library(MASS)

data(cats)
str(cats)
## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
model <- svm(Sex ~ ., data = cats)

summary(model)
## 
## Call:
## svm(formula = Sex ~ ., data = cats)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  84
## 
##  ( 39 45 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M
plot(model, data = cats)

Compute predictions and measure accuracy (along with confusion matrices and other ratios).

index <- 1:nrow(cats)
testindex <- sample(index, trunc(length(index)/3))
testset <- cats[testindex,]
trainset <- cats[-testindex,]

model <- svm(Sex~., data = trainset)
prediction <- predict(model, testset[,-1])

tab <- table(pred = prediction, true = testset[,1])
tab
##     true
## pred  F  M
##    F 10  5
##    M  4 29
tab / sum(tab)
##     true
## pred          F          M
##    F 0.20833333 0.10416667
##    M 0.08333333 0.60416667
accuracy <- sum(diag(tab))/sum(tab)
accuracy
## [1] 0.8125

We double-check the calculation with one of the package function.

classAgreement(tab)
## $diag
## [1] 0.8125
## 
## $kappa
## [1] 0.5555556
## 
## $rand
## [1] 0.6888298
## 
## $crand
## [1] 0.3655488

The first statistic is the accuracy ratio (similar).

All SVM Packages

  • kernlab, Kernel-Based Machine Learning Lab.
  • e1071, Misc Functions of the Department of Statistics, Probability Theory Group.
  • klaR, Classification and Visualization.
  • svmpath, The SVM Path Algorithm.
  • shogun, The Shogun Machine Learning. Toolbox](http://www.shogun-toolbox.org/)

This one is hard to classify… A male or a female?