Predictive Modeling

Foreword

Output options: the ‘tango’ syntax and the ‘readable’ theme.
Article.

The Book

“Modélisation prédictive et apprentissage statistiques avec R” means “Predictive modeling and statistical learning with R”. Statistical learning is synonymous with machine learning.

The author implements 30 machine learning techniques with the same dataset and with the same objective: predict a binary dependent variable. In other words, we repeat the same case study, in 30 variations, from the classical and proven techniques to the more recent and sophisticated algorithms.

The case is about credit scoring. The author uses the ‘German Credit Data’, made publicly available in many repositories such as this one or included in the caret package (after loading the package with library(caret), load the dataset with data(GermanCredit)).

The techniques can be grouped into 4 categories:

Regressions.
Tree-based and ensemble methods.
Support Vector Machines (SVM) algorithms.
Neural networks.

For each category, the author provides the AUC results from the test set and identify the best performing techniques in each category. The author discusses each of the 30 techniques: their specificities, procedures, computations, interpretations, tips, advantages and inconveniences.

Not only can we learn how to implement these techniques, we can compare the procedures and the results as well. The author also releases the R code.

Summary of the Results

Table 1 – Performance measure, chapter, page, ratings

See the table endnotes.

No	Predictive method	AUC Test	Chap	Page	Off-the-shelf	Readability	Computation speed	Overall rating	Best
1	Logistic regression (logit), automatic (unsupervised)	0.713	5.1	61	1	2	2	5
2	Logistic regression (logit), automatic (unsupervised)	0.730	5.1	62	1	2	2	5
3	Logit, supervised	0.762	5.1	63	0	2	2	4
4	Probit	0.758	5.1	110	0	2	2	4
5	Log-Log	0.763	5.11	113	0	2	2	4
6	Logit, supervised	0.765	5.5-5.9	91, 104	0	2	2	4
7	Logit, global selection	0.765	5.2-5.3	80	0	2	1	3
8	Logit, with all possible combinaisons	0.787	5.4	86	1	2	1	4
9	Logit, principal components of Multiple Correspondent Analysis (MCA)	0.793			1	2	2	5	max AUC
10	Logistic regression with the ridge shrinkage method	0.784	6	134	1	2	2	5
11	Logistic regression with the lasso shrinkage method	0.774	7	158	1	2	2	5
12	Partial Least Squares logistic regression	0.780	8	161	1	2	2	5
13	CART	0.740	9	192	2	2	2	6
14	PRIM, principal components of Multiple Correspondent Analysis (MCA)	0.775	10	213	2	1	1	4
15	Bagging	0.759	12	249	2	0	1	3
16	Random Forest CART	0.783	11.6	243	2	0	1	3
17	Extra-Trees	0.786	11.7	246	2	0	2	4
18	Random Forest Logit	0.791	13	257-258	2	2	0	4	max AUC
19	Boosting CART	0.784	14.3	282, 289	1	0	0	1
20	Boosting Logit	0.782	14.5	289	1	2	0	3
21	SVM with linear kernel	0.743	15.2, 15.3	302, 317	0	2	1	3
22	SVM with linear kernel, principal components of Multiple Correspondent Analysis (MCA)	0.800	15.8, 15.9	329	0	2	1	3	max AUC
23	SVM with polynomial kernel of second degree	0.778	15.7	313, 317	0	0	1	1
24	SVM with polynomial kernel of third degree	0.776	15.7	316, 317	0	0	1	1
25	SVM with Gaussian radial basic function kernel	0.772	15.4	317	0	0	1	1
26	SVM with Laplacian basic function kernel	0.773	15.5	311	0	0	1	1
27	SVM with sigmoid kernel	0.744	15.6	313, 317	0	0	1	1
28	Neural Networks	0.785	16.2	338	0	0	1	1
29	Boosting Neural Networks	0.789	16.4	353	0	0	0	0
30	Random Forest Neural Networks	0.795	16.5	360-2	0	0	0	0	max AUC

Notes

AUC Test: maximize the area under the curve, the ROC, with the test set.
Off-the-shelf: “readiness”;
- “1”: not much setup, preprocessing nor variable selection,
- “2”: limited setup, few parameters; CART, bagging, random forests for e.g.,
- “0”, more parameters to adjust, more research and optimization to perform, lots of preprocessing and variable selection; boosting, neural networks for e.g.
Readability: ensemble tree-based methods are difficult to understand (“0”) vs. logit results that are easier to interpret/use with their coefficients (“2”)
Computation speed: ensemble methods, such as random forests, are computer-intensive; for the same computer, some algorithms may take up to 30 minutes to compute (“0”) compared to more instantaneous algorithms (“2”)
Predictive methods involving principal components of Multiple Correspondent Analysis (MCA) are principal component regressions (PCR). The technique is based on a standard linear regression model, but instead of regressing the dependent variable on the explanatory variables directly, the principal components or the multiple correspondents of the explanatory variables are used as regressors, thus making PCR a regularized procedure. It works with OLS, logit or tree-based methods. Often, the principal components with higher variances (the ones based on eigenvectors corresponding to the higher eigenvalues of the sample variance-covariance matrix of the explanatory variables) are selected as regressors. It’s advisable to use statistical learning and model selection techniques.

Table 2 – Performance measure, R packages and functions, description

No	Predictive method	AUC Test	R libraries	R basic functions	Description
1	Logistic regression (logit), automatic (unsupervised)	0.713		`glm(), step()`	Ascending (or forward) stepwise selection based on BIC, k=log(numbers of observations).
2	Logistic regression (logit), automatic (unsupervised)	0.730		`glm(), step()`	Ascending (or forward) stepwise selection based on AIC, k=log(numbers of observations).
3	Logit, supervised	0.762		`glm(), step()`	Ascending (or forward) stepwise selection based on Mallows Cp and BIC, k=2 (set by the supervisor).
4	Probit	0.758		`glm()`	Based on the previously selected predictors.
5	Log-Log	0.763		`glm()`	Based on the previously selected predictors.
6	Logit, supervised	0.765		`glm(), step()`	Stepwise selection, optimized discrimination guided by the AUC, convert continuous and discrete variables into factors (with labels and levels) – see section 3.3 and chapter 4 for automatic supervised discrimination of continuous variables – for e.g., all age are grouped and becomes age categories.
7	Logit, global selection	0.765	`leaps`		As an alternative to stepwise selection: leaps and bounds algorithm where Mallows Cp (p.71) and BIC (p.78) are optimized by changing the number of predictors to find the optimal numbers.
8	Logit, with all possible combinaitions	0.787	`combinat`		Sweeping over all possible combinaison of predictors.
9	Logit, principal components of Multiple Correspondent Analysis (MCA)	0.793	`FactoMineR`		6 components.
10	Logistic regression with the ridge shrinkage method	0.784	`glmnet`		L^2, alpha=0, variance regularization of heteroscedasticity (against the increasing variance of the error term).
11	Logistic regression with the lasso shrinkage method	0.774	`glmnet, grlasso`		L^1, alpha=1, variance regularization – there are several methods: relaxed lasso, SCAD, adaptative lasso, group lasso, elastic net (a ridge and lasso combo) – alternative package: penalized, lasso2, lars, logistf.
12	Partial Least Squares logistic regression	0.780	`plsRglm`		alternative to variance regularization, 1 component – in cases where k > n or missing values.
13	CART	0.740	`tree, rpart`		9 tree nodes – higher AUC is attainable without pruning the tree – other tree-based methods: CHAID was followed by CART, then followed by C4.5, then by c5.0.
14	PRIM, principal components of Multiple Correspondent Analysis (MCA)	0.775	`prim, FactoMineR`		Patient Rule Induction Method is an alternative to variance regularization, 6 components, alpha=0.5, beta=0.06 – a spanning tree in a 2-dimensional space – alpha reduces & beta augments, alpha>beta.
15	Bagging	0.759	`ipred`		Ensemble method with randomization.
16	Random Forest CART	0.783	`randomForest`		Ensemble method with randomization – 500 iterations, out-of-bag (OOB) with replacement, node size=5.
17	Extra-Trees	0.786	`extraTrees`		Ensemble method with extremely randomized trees – 500 iterations, out-of-bag (OOB) with replacement, node size=5.
18	Random Forest Logit	0.791	`randomForest`		Ensemble method with randomization – 300 iterations, random selection of 2, 3 or 4 predictor over 8, based on AIC
19	Boosting CART	0.784	`ada`		Real AdaBoost, exponential loss function, 1800 iterations, penality=0.01 – ensemble method, deterministic and adaptative with successive improvements – several methods: discrete AdaBoost, Real AdaBoost, and its variant Arcing.
20	Boosting Logit	0.782	`ada`		Real AdaBoost, exponential loss function, 3000 iterations, penality=0.001.
21	SVM with linear kernel	0.743	`e1071`		Penalty parameter of the error term C=0.1 – generalized discriminant analysis, separator – alternative libraries: klaR, svmpath, LiblineaR.
22	SVM with linear kernel, principal components of Multiple Correspondent Analysis (MCA)	0.800	`e1071, kernlab, FactoMineR`		6 components, penalty parameter of the error term=C=0.1.
23	SVM with polynomial kernel of second degree	0.778	`e1071`		gamma=0.02, C=2.9, coef0=-0.005.
24	SVM with polynomial kernel of third degree	0.776	`e1071`		gamma=0.049, C=0.99, coef0=-0.076.
25	SVM with Gaussian radial basic function kernel	0.772	`e1071`		gamma=0.0643, C=1.2.
26	SVM with Laplacian basic function kernel	0.773	`kernlab`		sigma=0.312, C=1.79.
27	SVM with sigmoid kernel	0.744	`e1071`		gamma=1/48, C=1.
28	Neural Networks	0.785	`nnet`		100 iterations, 1 hidden layer of size 2, weight decay=1.
29	Boosting Neural Networks	0.789	`nnet, ada`		Real AdaBoost, 1 hidden layer perceptron of size 10, weight decay=0.1 – several methods: discrete AdaBoost, Real AdaBoost, and its variant Arcing.
30	Random Forest Neural Networks	0.795	`nnet, randomForest`		500 iterations, 1 hidden layer perceptron of size 5 or 10, weight decay=0.1 or 0.01, sampling randomly 3 variables each iteration.

Table 3a – Bagging, Random Forest, and Boosting Specs

Bagging method	Random Forests	Boosting methods
Random mechanism – probabilistic.	Random mechanism – probabilistic.	Adaptative mechanism – generally deterministic.
For each iteration, the ‘machine’ learns with a different bootstrap sample.	For each iteration, the ‘machine’ learns with a different bootstrap sample.	For each iteration, the ‘machine’ learns with the full initial sample, except for the arcing method (like bagging).
For each iteration, the ‘machine’ learns with all predictors.	For each iteration, the ‘machine’ learns with a random subset of all predictors.	For each iteration, the ‘machine’ learns with all predictors.
For each iteration, the model must perform well with all observations.	For each iteration, the model must perform well with all observations – underperforms bagging since only a subset of predictors are used.	For each iteration, the model must perform well with all observations – some models perform well with outliers, but less with other observations.
In the final aggregation, all generated models are equally weighted.	In the final aggregation, all generated models are equally weighted.	In the final aggregation, the generated models weight is their error rate.

Table 3b – Bagging, Random Forest, and Boosting Pros & Cons

Bagging method	Random Forests	Boosting methods	Notes
For a given bias, it reduces the variance by averaging the models (high variance can cause overfitting).	For a given bias, it greatly reduces the variance by averaging the models (high variance can cause overfitting).	Can reduce both the variance and the bias of the classifier (high bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)). However, with a stable classifier, for a given bias, the variance can increase.	The bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.
Less readable when the classifier is a classification tree.	Less readable when the classifier is a classification tree.	Less readable when the classifier is a classification tree.
Do not manage stumps efficiently.	Manage stump efficiently.	Manage stump very efficiently.	A decision stump is a weak classification model (among all the other generated models) with the simple tree structure consisting of one split, which can also be considered a one-level decision tree. Due to its simplicity, the stump often demonstrates a low predictive performance.
Iterations converge rapidly.	Iterations converge rapidly.	Iterations converge slowly (can take 10 times more iterations).
The algorithm can compute in parallel.	The algorithm can compute in parallel.	No parallel computing possible since the algorithm is sequential (step by step).	Parallel computing can be done with packages `biglm, ff, ffbase, snow`, etc.
No overfitting – beat boosting when there is a large amount of noise.		The overfitting risk increases with the number of iterations.
Simple to set up, fewer parameters, but the classifier underperforms the other methods.	Random forests are always better classifiers than bagging, and sometimes beat boosting when discrete (categorical or factor) predictors are abundant.	Boosting generally are always better classifiers than bagging when the amount of noise is limited.

Other R Packages

In addition to the package referenced in the tables, the book uses:

More stats functions: MASS, boot, gmodel, car
Visualization: corrplot, ggplot2, lattice, rgl, corrplot.
Big Data & parallel computing: biglm, ff, ffbase, foreach, snow, doSNOW.
Classification: caret.
Association rules: arules, arulesViz.
Dimension reduction, MCA, PCA, etc.: ade4, MASS, FactorMineR.
Importing/writing SAS files: foreign.
AUC & ROC: rROC, ROCR.

Some package description

ade4: multivariate data analysis, graphical display.
arules: association rules, apriori algorithm, market basket analysis, data mining.
biglm: bounded lm regression for data too large to fit in memory.
boot: bootstrapping, random resampling.
caret: preprocessing, classification & regression models, feature selection, resampling.
FactoMineR: dimension reduction, multivariate data analysis (PCA, MCA, factor analysis, etc.), graphical display.
ff, ffbase: data structure are stored on disk, but behave as if they were in RAM.
foreach: loops.
foreign: read & write foreign files: SAS, SPSS, Stata, dBase, etc.
gmodels: model fitting.
missforest: nonparametric missing values when using RandomForest.
rgl: 3D interactive graphics.

Book Content and Translation

Présentation du jeu de données. Préparation des données. Exploration des données. Discrétisation automatique supervisée des variables continues. La régression logistique. La régression logistique pénalisée ridge. La régression logistique pénalisée lasso. La régression logistique PLS. L’arbre de décision CART. L’algorithme PRIM. Les forêts aléatoires. Le bagging. Les forêts aléatoires de modèles logistiques. Le boosting. Les Support Vector Machines. Les réseaux de neurones. Synthèse des méthodes prédictives. Annexes. Bibliographie. Index des packages R utilisés.

C1, describe the dataset.
C2, import and prepare the data.
C3, explore the data.
C4, supervised automatic discrimination of continuous variable.
C5, logit, probit, log-log models. See DataCamp, DataCamp.
C6, penalization methods; ridge regression.
C7, penalization methods; lasso regressions.
C8, PLS logistic regression.
C9, tree-based methods. See DataCamp, DataCamp.
C10, PRIM.
C11, ensemble methods; randoms forests.
C12, ensemble methods; bagging.
C13, ensemble methods; randoms forest logits.
C14, ensemble methods; boosting.
C15, SVM. See other, DataCamp.
C16, neural networks. See DataCamp.
Appendices: MCA, association rules and apriori algorithm for web mining and market basket analysis, credit scoring.