Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Article.

The Book


“Modélisation prédictive et apprentissage statistiques avec R” means “Predictive modeling and statistical learning with R”. Statistical learning is synonymous with machine learning.

The author implements 30 machine learning techniques with the same dataset and with the same objective: predict a binary dependent variable. In other words, we repeat the same case study, in 30 variations, from the classical and proven techniques to the more recent and sophisticated algorithms.

The case is about credit scoring. The author uses the ‘German Credit Data’, made publicly available in many repositories such as this one or included in the caret package (after loading the package with library(caret), load the dataset with data(GermanCredit)).

The techniques can be grouped into 4 categories:

  • Regressions.
  • Tree-based and ensemble methods.
  • Support Vector Machines (SVM) algorithms.
  • Neural networks.

For each category, the author provides the AUC results from the test set and identify the best performing techniques in each category. The author discusses each of the 30 techniques: their specificities, procedures, computations, interpretations, tips, advantages and inconveniences.

Not only can we learn how to implement these techniques, we can compare the procedures and the results as well. The author also releases the R code.

Summary of the Results

Table 1 – Performance measure, chapter, page, ratings

See the table endnotes.

No Predictive method AUC Test Chap Page Off-the-shelf Readability Computation speed Overall rating Best
1 Logistic regression (logit), automatic (unsupervised) 0.713 5.1 61 1 2 2 5
2 Logistic regression (logit), automatic (unsupervised) 0.730 5.1 62 1 2 2 5
3 Logit, supervised 0.762 5.1 63 0 2 2 4
4 Probit 0.758 5.1 110 0 2 2 4
5 Log-Log 0.763 5.11 113 0 2 2 4
6 Logit, supervised 0.765 5.5-5.9 91, 104 0 2 2 4
7 Logit, global selection 0.765 5.2-5.3 80 0 2 1 3
8 Logit, with all possible combinaisons 0.787 5.4 86 1 2 1 4
9 Logit, principal components of Multiple Correspondent Analysis (MCA) 0.793 1 2 2 5 max AUC
10 Logistic regression with the ridge shrinkage method 0.784 6 134 1 2 2 5
11 Logistic regression with the lasso shrinkage method 0.774 7 158 1 2 2 5
12 Partial Least Squares logistic regression 0.780 8 161 1 2 2 5
13 CART 0.740 9 192 2 2 2 6
14 PRIM, principal components of Multiple Correspondent Analysis (MCA) 0.775 10 213 2 1 1 4
15 Bagging 0.759 12 249 2 0 1 3
16 Random Forest CART 0.783 11.6 243 2 0 1 3
17 Extra-Trees 0.786 11.7 246 2 0 2 4
18 Random Forest Logit 0.791 13 257-258 2 2 0 4 max AUC
19 Boosting CART 0.784 14.3 282, 289 1 0 0 1
20 Boosting Logit 0.782 14.5 289 1 2 0 3
21 SVM with linear kernel 0.743 15.2, 15.3 302, 317 0 2 1 3
22 SVM with linear kernel, principal components of Multiple Correspondent Analysis (MCA) 0.800 15.8, 15.9 329 0 2 1 3 max AUC
23 SVM with polynomial kernel of second degree 0.778 15.7 313, 317 0 0 1 1
24 SVM with polynomial kernel of third degree 0.776 15.7 316, 317 0 0 1 1
25 SVM with Gaussian radial basic function kernel 0.772 15.4 317 0 0 1 1
26 SVM with Laplacian basic function kernel 0.773 15.5 311 0 0 1 1
27 SVM with sigmoid kernel 0.744 15.6 313, 317 0 0 1 1
28 Neural Networks 0.785 16.2 338 0 0 1 1
29 Boosting Neural Networks 0.789 16.4 353 0 0 0 0
30 Random Forest Neural Networks 0.795 16.5 360-2 0 0 0 0 max AUC

Notes

  • AUC Test: maximize the area under the curve, the ROC, with the test set.
  • Off-the-shelf: “readiness”;
    • “1”: not much setup, preprocessing nor variable selection,
    • “2”: limited setup, few parameters; CART, bagging, random forests for e.g.,
    • “0”, more parameters to adjust, more research and optimization to perform, lots of preprocessing and variable selection; boosting, neural networks for e.g.
  • Readability: ensemble tree-based methods are difficult to understand (“0”) vs. logit results that are easier to interpret/use with their coefficients (“2”)
  • Computation speed: ensemble methods, such as random forests, are computer-intensive; for the same computer, some algorithms may take up to 30 minutes to compute (“0”) compared to more instantaneous algorithms (“2”)
  • Predictive methods involving principal components of Multiple Correspondent Analysis (MCA) are principal component regressions (PCR). The technique is based on a standard linear regression model, but instead of regressing the dependent variable on the explanatory variables directly, the principal components or the multiple correspondents of the explanatory variables are used as regressors, thus making PCR a regularized procedure. It works with OLS, logit or tree-based methods. Often, the principal components with higher variances (the ones based on eigenvectors corresponding to the higher eigenvalues of the sample variance-covariance matrix of the explanatory variables) are selected as regressors. It’s advisable to use statistical learning and model selection techniques.

Table 2 – Performance measure, R packages and functions, description

No Predictive method AUC Test R libraries R basic functions Description
1 Logistic regression (logit), automatic (unsupervised) 0.713 glm(), step() Ascending (or forward) stepwise selection based on BIC, k=log(numbers of observations).
2 Logistic regression (logit), automatic (unsupervised) 0.730 glm(), step() Ascending (or forward) stepwise selection based on AIC, k=log(numbers of observations).
3 Logit, supervised 0.762 glm(), step() Ascending (or forward) stepwise selection based on Mallows Cp and BIC, k=2 (set by the supervisor).
4 Probit 0.758 glm() Based on the previously selected predictors.
5 Log-Log 0.763 glm() Based on the previously selected predictors.
6 Logit, supervised 0.765 glm(), step() Stepwise selection, optimized discrimination guided by the AUC, convert continuous and discrete variables into factors (with labels and levels) – see section 3.3 and chapter 4 for automatic supervised discrimination of continuous variables – for e.g., all age are grouped and becomes age categories.
7 Logit, global selection 0.765 leaps As an alternative to stepwise selection: leaps and bounds algorithm where Mallows Cp (p.71) and BIC (p.78) are optimized by changing the number of predictors to find the optimal numbers.
8 Logit, with all possible combinaitions 0.787 combinat Sweeping over all possible combinaison of predictors.
9 Logit, principal components of Multiple Correspondent Analysis (MCA) 0.793 FactoMineR 6 components.
10 Logistic regression with the ridge shrinkage method 0.784 glmnet L^2, alpha=0, variance regularization of heteroscedasticity (against the increasing variance of the error term).
11 Logistic regression with the lasso shrinkage method 0.774 glmnet, grlasso L^1, alpha=1, variance regularization – there are several methods: relaxed lasso, SCAD, adaptative lasso, group lasso, elastic net (a ridge and lasso combo) – alternative package: penalized, lasso2, lars, logistf.
12 Partial Least Squares logistic regression 0.780 plsRglm alternative to variance regularization, 1 component – in cases where k > n or missing values.
13 CART 0.740 tree, rpart 9 tree nodes – higher AUC is attainable without pruning the tree – other tree-based methods: CHAID was followed by CART, then followed by C4.5, then by c5.0.
14 PRIM, principal components of Multiple Correspondent Analysis (MCA) 0.775 prim, FactoMineR Patient Rule Induction Method is an alternative to variance regularization, 6 components, alpha=0.5, beta=0.06 – a spanning tree in a 2-dimensional space – alpha reduces & beta augments, alpha>beta.
15 Bagging 0.759 ipred Ensemble method with randomization.
16 Random Forest CART 0.783 randomForest Ensemble method with randomization – 500 iterations, out-of-bag (OOB) with replacement, node size=5.
17 Extra-Trees 0.786 extraTrees Ensemble method with extremely randomized trees – 500 iterations, out-of-bag (OOB) with replacement, node size=5.
18 Random Forest Logit 0.791 randomForest Ensemble method with randomization – 300 iterations, random selection of 2, 3 or 4 predictor over 8, based on AIC
19 Boosting CART 0.784 ada Real AdaBoost, exponential loss function, 1800 iterations, penality=0.01 – ensemble method, deterministic and adaptative with successive improvements – several methods: discrete AdaBoost, Real AdaBoost, and its variant Arcing.
20 Boosting Logit 0.782 ada Real AdaBoost, exponential loss function, 3000 iterations, penality=0.001.
21 SVM with linear kernel 0.743 e1071 Penalty parameter of the error term C=0.1 – generalized discriminant analysis, separator – alternative libraries: klaR, svmpath, LiblineaR.
22 SVM with linear kernel, principal components of Multiple Correspondent Analysis (MCA) 0.800 e1071, kernlab, FactoMineR 6 components, penalty parameter of the error term=C=0.1.
23 SVM with polynomial kernel of second degree 0.778 e1071 gamma=0.02, C=2.9, coef0=-0.005.
24 SVM with polynomial kernel of third degree 0.776 e1071 gamma=0.049, C=0.99, coef0=-0.076.
25 SVM with Gaussian radial basic function kernel 0.772 e1071 gamma=0.0643, C=1.2.
26 SVM with Laplacian basic function kernel 0.773 kernlab sigma=0.312, C=1.79.
27 SVM with sigmoid kernel 0.744 e1071 gamma=1/48, C=1.
28 Neural Networks 0.785 nnet 100 iterations, 1 hidden layer of size 2, weight decay=1.
29 Boosting Neural Networks 0.789 nnet, ada Real AdaBoost, 1 hidden layer perceptron of size 10, weight decay=0.1 – several methods: discrete AdaBoost, Real AdaBoost, and its variant Arcing.
30 Random Forest Neural Networks 0.795 nnet, randomForest 500 iterations, 1 hidden layer perceptron of size 5 or 10, weight decay=0.1 or 0.01, sampling randomly 3 variables each iteration.

Table 3a – Bagging, Random Forest, and Boosting Specs

Bagging method Random Forests Boosting methods
Random mechanism – probabilistic. Random mechanism – probabilistic. Adaptative mechanism – generally deterministic.
For each iteration, the ‘machine’ learns with a different bootstrap sample. For each iteration, the ‘machine’ learns with a different bootstrap sample. For each iteration, the ‘machine’ learns with the full initial sample, except for the arcing method (like bagging).
For each iteration, the ‘machine’ learns with all predictors. For each iteration, the ‘machine’ learns with a random subset of all predictors. For each iteration, the ‘machine’ learns with all predictors.
For each iteration, the model must perform well with all observations. For each iteration, the model must perform well with all observations – underperforms bagging since only a subset of predictors are used. For each iteration, the model must perform well with all observations – some models perform well with outliers, but less with other observations.
In the final aggregation, all generated models are equally weighted. In the final aggregation, all generated models are equally weighted. In the final aggregation, the generated models weight is their error rate.

Table 3b – Bagging, Random Forest, and Boosting Pros & Cons

Bagging method Random Forests Boosting methods Notes
For a given bias, it reduces the variance by averaging the models (high variance can cause overfitting). For a given bias, it greatly reduces the variance by averaging the models (high variance can cause overfitting). Can reduce both the variance and the bias of the classifier (high bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)). However, with a stable classifier, for a given bias, the variance can increase. The bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.
Less readable when the classifier is a classification tree. Less readable when the classifier is a classification tree. Less readable when the classifier is a classification tree.
Do not manage stumps efficiently. Manage stump efficiently. Manage stump very efficiently. A decision stump is a weak classification model (among all the other generated models) with the simple tree structure consisting of one split, which can also be considered a one-level decision tree. Due to its simplicity, the stump often demonstrates a low predictive performance.
Iterations converge rapidly. Iterations converge rapidly. Iterations converge slowly (can take 10 times more iterations).
The algorithm can compute in parallel. The algorithm can compute in parallel. No parallel computing possible since the algorithm is sequential (step by step). Parallel computing can be done with packages biglm, ff, ffbase, snow, etc.
No overfitting – beat boosting when there is a large amount of noise. The overfitting risk increases with the number of iterations.
Simple to set up, fewer parameters, but the classifier underperforms the other methods. Random forests are always better classifiers than bagging, and sometimes beat boosting when discrete (categorical or factor) predictors are abundant. Boosting generally are always better classifiers than bagging when the amount of noise is limited.

Other R Packages

In addition to the package referenced in the tables, the book uses:

  • More stats functions: MASS, boot, gmodel, car
  • Visualization: corrplot, ggplot2, lattice, rgl, corrplot.
  • Big Data & parallel computing: biglm, ff, ffbase, foreach, snow, doSNOW.
  • Classification: caret.
  • Association rules: arules, arulesViz.
  • Dimension reduction, MCA, PCA, etc.: ade4, MASS, FactorMineR.
  • Importing/writing SAS files: foreign.
  • AUC & ROC: rROC, ROCR.

Some package description

  • ade4: multivariate data analysis, graphical display.
  • arules: association rules, apriori algorithm, market basket analysis, data mining.
  • biglm: bounded lm regression for data too large to fit in memory.
  • boot: bootstrapping, random resampling.
  • caret: preprocessing, classification & regression models, feature selection, resampling.
  • FactoMineR: dimension reduction, multivariate data analysis (PCA, MCA, factor analysis, etc.), graphical display.
  • ff, ffbase: data structure are stored on disk, but behave as if they were in RAM.
  • foreach: loops.
  • foreign: read & write foreign files: SAS, SPSS, Stata, dBase, etc.
  • gmodels: model fitting.
  • missforest: nonparametric missing values when using RandomForest.
  • rgl: 3D interactive graphics.

Book Content and Translation

Présentation du jeu de données. Préparation des données. Exploration des données. Discrétisation automatique supervisée des variables continues. La régression logistique. La régression logistique pénalisée ridge. La régression logistique pénalisée lasso. La régression logistique PLS. L’arbre de décision CART. L’algorithme PRIM. Les forêts aléatoires. Le bagging. Les forêts aléatoires de modèles logistiques. Le boosting. Les Support Vector Machines. Les réseaux de neurones. Synthèse des méthodes prédictives. Annexes. Bibliographie. Index des packages R utilisés.

  • C1, describe the dataset.
  • C2, import and prepare the data.
  • C3, explore the data.
  • C4, supervised automatic discrimination of continuous variable.
  • C5, logit, probit, log-log models. See DataCamp, DataCamp.
  • C6, penalization methods; ridge regression.
  • C7, penalization methods; lasso regressions.
  • C8, PLS logistic regression.
  • C9, tree-based methods. See DataCamp, DataCamp.
  • C10, PRIM.
  • C11, ensemble methods; randoms forests.
  • C12, ensemble methods; bagging.
  • C13, ensemble methods; randoms forest logits.
  • C14, ensemble methods; boosting.
  • C15, SVM. See other, DataCamp.
  • C16, neural networks. See DataCamp.
  • Appendices: MCA, association rules and apriori algorithm for web mining and market basket analysis, credit scoring.

Books from the same Author