Foreword
“Modélisation prédictive et apprentissage statistiques avec R” means “Predictive modeling and statistical learning with R”. Statistical learning is synonymous with machine learning.
The author implements 30 machine learning techniques with the same dataset and with the same objective: predict a binary dependent variable. In other words, we repeat the same case study, in 30 variations, from the classical and proven techniques to the more recent and sophisticated algorithms.
The case is about credit scoring. The author uses the ‘German Credit Data’, made publicly available in many repositories such as this one or included in the caret package (after loading the package with library(caret), load the dataset with data(GermanCredit)).
The techniques can be grouped into 4 categories:
For each category, the author provides the AUC results from the test set and identify the best performing techniques in each category. The author discusses each of the 30 techniques: their specificities, procedures, computations, interpretations, tips, advantages and inconveniences.
Not only can we learn how to implement these techniques, we can compare the procedures and the results as well. The author also releases the R code.
See the table endnotes.
| No | Predictive method | AUC Test | Chap | Page | Off-the-shelf | Readability | Computation speed | Overall rating | Best |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Logistic regression (logit), automatic (unsupervised) | 0.713 | 5.1 | 61 | 1 | 2 | 2 | 5 | |
| 2 | Logistic regression (logit), automatic (unsupervised) | 0.730 | 5.1 | 62 | 1 | 2 | 2 | 5 | |
| 3 | Logit, supervised | 0.762 | 5.1 | 63 | 0 | 2 | 2 | 4 | |
| 4 | Probit | 0.758 | 5.1 | 110 | 0 | 2 | 2 | 4 | |
| 5 | Log-Log | 0.763 | 5.11 | 113 | 0 | 2 | 2 | 4 | |
| 6 | Logit, supervised | 0.765 | 5.5-5.9 | 91, 104 | 0 | 2 | 2 | 4 | |
| 7 | Logit, global selection | 0.765 | 5.2-5.3 | 80 | 0 | 2 | 1 | 3 | |
| 8 | Logit, with all possible combinaisons | 0.787 | 5.4 | 86 | 1 | 2 | 1 | 4 | |
| 9 | Logit, principal components of Multiple Correspondent Analysis (MCA) | 0.793 | 1 | 2 | 2 | 5 | max AUC | ||
| 10 | Logistic regression with the ridge shrinkage method | 0.784 | 6 | 134 | 1 | 2 | 2 | 5 | |
| 11 | Logistic regression with the lasso shrinkage method | 0.774 | 7 | 158 | 1 | 2 | 2 | 5 | |
| 12 | Partial Least Squares logistic regression | 0.780 | 8 | 161 | 1 | 2 | 2 | 5 | |
| 13 | CART | 0.740 | 9 | 192 | 2 | 2 | 2 | 6 | |
| 14 | PRIM, principal components of Multiple Correspondent Analysis (MCA) | 0.775 | 10 | 213 | 2 | 1 | 1 | 4 | |
| 15 | Bagging | 0.759 | 12 | 249 | 2 | 0 | 1 | 3 | |
| 16 | Random Forest CART | 0.783 | 11.6 | 243 | 2 | 0 | 1 | 3 | |
| 17 | Extra-Trees | 0.786 | 11.7 | 246 | 2 | 0 | 2 | 4 | |
| 18 | Random Forest Logit | 0.791 | 13 | 257-258 | 2 | 2 | 0 | 4 | max AUC |
| 19 | Boosting CART | 0.784 | 14.3 | 282, 289 | 1 | 0 | 0 | 1 | |
| 20 | Boosting Logit | 0.782 | 14.5 | 289 | 1 | 2 | 0 | 3 | |
| 21 | SVM with linear kernel | 0.743 | 15.2, 15.3 | 302, 317 | 0 | 2 | 1 | 3 | |
| 22 | SVM with linear kernel, principal components of Multiple Correspondent Analysis (MCA) | 0.800 | 15.8, 15.9 | 329 | 0 | 2 | 1 | 3 | max AUC |
| 23 | SVM with polynomial kernel of second degree | 0.778 | 15.7 | 313, 317 | 0 | 0 | 1 | 1 | |
| 24 | SVM with polynomial kernel of third degree | 0.776 | 15.7 | 316, 317 | 0 | 0 | 1 | 1 | |
| 25 | SVM with Gaussian radial basic function kernel | 0.772 | 15.4 | 317 | 0 | 0 | 1 | 1 | |
| 26 | SVM with Laplacian basic function kernel | 0.773 | 15.5 | 311 | 0 | 0 | 1 | 1 | |
| 27 | SVM with sigmoid kernel | 0.744 | 15.6 | 313, 317 | 0 | 0 | 1 | 1 | |
| 28 | Neural Networks | 0.785 | 16.2 | 338 | 0 | 0 | 1 | 1 | |
| 29 | Boosting Neural Networks | 0.789 | 16.4 | 353 | 0 | 0 | 0 | 0 | |
| 30 | Random Forest Neural Networks | 0.795 | 16.5 | 360-2 | 0 | 0 | 0 | 0 | max AUC |
Notes
| No | Predictive method | AUC Test | R libraries | R basic functions | Description |
|---|---|---|---|---|---|
| 1 | Logistic regression (logit), automatic (unsupervised) | 0.713 | glm(), step() |
Ascending (or forward) stepwise selection based on BIC, k=log(numbers of observations). | |
| 2 | Logistic regression (logit), automatic (unsupervised) | 0.730 | glm(), step() |
Ascending (or forward) stepwise selection based on AIC, k=log(numbers of observations). | |
| 3 | Logit, supervised | 0.762 | glm(), step() |
Ascending (or forward) stepwise selection based on Mallows Cp and BIC, k=2 (set by the supervisor). | |
| 4 | Probit | 0.758 | glm() |
Based on the previously selected predictors. | |
| 5 | Log-Log | 0.763 | glm() |
Based on the previously selected predictors. | |
| 6 | Logit, supervised | 0.765 | glm(), step() |
Stepwise selection, optimized discrimination guided by the AUC, convert continuous and discrete variables into factors (with labels and levels) – see section 3.3 and chapter 4 for automatic supervised discrimination of continuous variables – for e.g., all age are grouped and becomes age categories. | |
| 7 | Logit, global selection | 0.765 | leaps |
As an alternative to stepwise selection: leaps and bounds algorithm where Mallows Cp (p.71) and BIC (p.78) are optimized by changing the number of predictors to find the optimal numbers. | |
| 8 | Logit, with all possible combinaitions | 0.787 | combinat |
Sweeping over all possible combinaison of predictors. | |
| 9 | Logit, principal components of Multiple Correspondent Analysis (MCA) | 0.793 | FactoMineR |
6 components. | |
| 10 | Logistic regression with the ridge shrinkage method | 0.784 | glmnet |
L^2, alpha=0, variance regularization of heteroscedasticity (against the increasing variance of the error term). | |
| 11 | Logistic regression with the lasso shrinkage method | 0.774 | glmnet, grlasso |
L^1, alpha=1, variance regularization – there are several methods: relaxed lasso, SCAD, adaptative lasso, group lasso, elastic net (a ridge and lasso combo) – alternative package: penalized, lasso2, lars, logistf. | |
| 12 | Partial Least Squares logistic regression | 0.780 | plsRglm |
alternative to variance regularization, 1 component – in cases where k > n or missing values. | |
| 13 | CART | 0.740 | tree, rpart |
9 tree nodes – higher AUC is attainable without pruning the tree – other tree-based methods: CHAID was followed by CART, then followed by C4.5, then by c5.0. | |
| 14 | PRIM, principal components of Multiple Correspondent Analysis (MCA) | 0.775 | prim, FactoMineR |
Patient Rule Induction Method is an alternative to variance regularization, 6 components, alpha=0.5, beta=0.06 – a spanning tree in a 2-dimensional space – alpha reduces & beta augments, alpha>beta. | |
| 15 | Bagging | 0.759 | ipred |
Ensemble method with randomization. | |
| 16 | Random Forest CART | 0.783 | randomForest |
Ensemble method with randomization – 500 iterations, out-of-bag (OOB) with replacement, node size=5. | |
| 17 | Extra-Trees | 0.786 | extraTrees |
Ensemble method with extremely randomized trees – 500 iterations, out-of-bag (OOB) with replacement, node size=5. | |
| 18 | Random Forest Logit | 0.791 | randomForest |
Ensemble method with randomization – 300 iterations, random selection of 2, 3 or 4 predictor over 8, based on AIC | |
| 19 | Boosting CART | 0.784 | ada |
Real AdaBoost, exponential loss function, 1800 iterations, penality=0.01 – ensemble method, deterministic and adaptative with successive improvements – several methods: discrete AdaBoost, Real AdaBoost, and its variant Arcing. | |
| 20 | Boosting Logit | 0.782 | ada |
Real AdaBoost, exponential loss function, 3000 iterations, penality=0.001. | |
| 21 | SVM with linear kernel | 0.743 | e1071 |
Penalty parameter of the error term C=0.1 – generalized discriminant analysis, separator – alternative libraries: klaR, svmpath, LiblineaR. | |
| 22 | SVM with linear kernel, principal components of Multiple Correspondent Analysis (MCA) | 0.800 | e1071, kernlab, FactoMineR |
6 components, penalty parameter of the error term=C=0.1. | |
| 23 | SVM with polynomial kernel of second degree | 0.778 | e1071 |
gamma=0.02, C=2.9, coef0=-0.005. | |
| 24 | SVM with polynomial kernel of third degree | 0.776 | e1071 |
gamma=0.049, C=0.99, coef0=-0.076. | |
| 25 | SVM with Gaussian radial basic function kernel | 0.772 | e1071 |
gamma=0.0643, C=1.2. | |
| 26 | SVM with Laplacian basic function kernel | 0.773 | kernlab |
sigma=0.312, C=1.79. | |
| 27 | SVM with sigmoid kernel | 0.744 | e1071 |
gamma=1/48, C=1. | |
| 28 | Neural Networks | 0.785 | nnet |
100 iterations, 1 hidden layer of size 2, weight decay=1. | |
| 29 | Boosting Neural Networks | 0.789 | nnet, ada |
Real AdaBoost, 1 hidden layer perceptron of size 10, weight decay=0.1 – several methods: discrete AdaBoost, Real AdaBoost, and its variant Arcing. | |
| 30 | Random Forest Neural Networks | 0.795 | nnet, randomForest |
500 iterations, 1 hidden layer perceptron of size 5 or 10, weight decay=0.1 or 0.01, sampling randomly 3 variables each iteration. |
| Bagging method | Random Forests | Boosting methods |
|---|---|---|
| Random mechanism – probabilistic. | Random mechanism – probabilistic. | Adaptative mechanism – generally deterministic. |
| For each iteration, the ‘machine’ learns with a different bootstrap sample. | For each iteration, the ‘machine’ learns with a different bootstrap sample. | For each iteration, the ‘machine’ learns with the full initial sample, except for the arcing method (like bagging). |
| For each iteration, the ‘machine’ learns with all predictors. | For each iteration, the ‘machine’ learns with a random subset of all predictors. | For each iteration, the ‘machine’ learns with all predictors. |
| For each iteration, the model must perform well with all observations. | For each iteration, the model must perform well with all observations – underperforms bagging since only a subset of predictors are used. | For each iteration, the model must perform well with all observations – some models perform well with outliers, but less with other observations. |
| In the final aggregation, all generated models are equally weighted. | In the final aggregation, all generated models are equally weighted. | In the final aggregation, the generated models weight is their error rate. |
| Bagging method | Random Forests | Boosting methods | Notes |
|---|---|---|---|
| For a given bias, it reduces the variance by averaging the models (high variance can cause overfitting). | For a given bias, it greatly reduces the variance by averaging the models (high variance can cause overfitting). | Can reduce both the variance and the bias of the classifier (high bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)). However, with a stable classifier, for a given bias, the variance can increase. | The bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set. |
| Less readable when the classifier is a classification tree. | Less readable when the classifier is a classification tree. | Less readable when the classifier is a classification tree. | |
| Do not manage stumps efficiently. | Manage stump efficiently. | Manage stump very efficiently. | A decision stump is a weak classification model (among all the other generated models) with the simple tree structure consisting of one split, which can also be considered a one-level decision tree. Due to its simplicity, the stump often demonstrates a low predictive performance. |
| Iterations converge rapidly. | Iterations converge rapidly. | Iterations converge slowly (can take 10 times more iterations). | |
| The algorithm can compute in parallel. | The algorithm can compute in parallel. | No parallel computing possible since the algorithm is sequential (step by step). | Parallel computing can be done with packages biglm, ff, ffbase, snow, etc. |
| No overfitting – beat boosting when there is a large amount of noise. | The overfitting risk increases with the number of iterations. | ||
| Simple to set up, fewer parameters, but the classifier underperforms the other methods. | Random forests are always better classifiers than bagging, and sometimes beat boosting when discrete (categorical or factor) predictors are abundant. | Boosting generally are always better classifiers than bagging when the amount of noise is limited. |
In addition to the package referenced in the tables, the book uses:
MASS, boot, gmodel, carcorrplot, ggplot2, lattice, rgl, corrplot.biglm, ff, ffbase, foreach, snow, doSNOW.caret.arules, arulesViz.ade4, MASS, FactorMineR.foreign.rROC, ROCR.Some package description
ade4: multivariate data analysis, graphical display.arules: association rules, apriori algorithm, market basket analysis, data mining.biglm: bounded lm regression for data too large to fit in memory.boot: bootstrapping, random resampling.caret: preprocessing, classification & regression models, feature selection, resampling.FactoMineR: dimension reduction, multivariate data analysis (PCA, MCA, factor analysis, etc.), graphical display.ff, ffbase: data structure are stored on disk, but behave as if they were in RAM.foreach: loops.foreign: read & write foreign files: SAS, SPSS, Stata, dBase, etc.gmodels: model fitting.missforest: nonparametric missing values when using RandomForest.rgl: 3D interactive graphics.Présentation du jeu de données. Préparation des données. Exploration des données. Discrétisation automatique supervisée des variables continues. La régression logistique. La régression logistique pénalisée ridge. La régression logistique pénalisée lasso. La régression logistique PLS. L’arbre de décision CART. L’algorithme PRIM. Les forêts aléatoires. Le bagging. Les forêts aléatoires de modèles logistiques. Le boosting. Les Support Vector Machines. Les réseaux de neurones. Synthèse des méthodes prédictives. Annexes. Bibliographie. Index des packages R utilisés.