Foreword

  • Output options: ‘tango’ syntax, the ‘readable’ theme.
  • Results only.


‘Data Storytelling’ covers the most common techniques. As with any movies, there are cut scenes.

Here is a review of the special cases. We brush up a ‘big picture’ since it is impossible to cover it all.

We provide examples: why/where these technique are mainly applied. Without pretending to unveil all the details, we provide some leads to functions from the base stats package and to extra packages.

Although it is focused on R, the techniques can be applied in Python. The important thing is to know what to look for. R is a constallation of concurrent and complementary packages, while Python is more streamlined. The Python scikit-learn library, for example, encompasses lots of R packages.

We can find more functions and packages by topic and field: consult the CRAN Task Views.

Applications

The literature mostly comes from the academic sectors. Most of the examples are scientific cases. The business sector, from large corporations to SME, should consider these special applications as data becomes more available. Computer power and IT costs are no longer an obstacle for SME; most tools are open-source.

Marketing teams, for instance, can benefit from econometric methods:

  • Slicing and dicing marketing data with specialized packages (reshapes2, tidyr, dplyr, data.table) the way you would do it with SQL databases, OLAP cubes, and PivotTables.
  • Using probabilistic techniques. Business cases are deterministic; drivers are integers or percentages. What if drivers were distributions fitting the behaviour of drivers (normal, uniform or triangular distributions). Combining these drivers and measuring the results is performing a ‘bootstrap’ or a ‘Monte Carlo’.
  • Doing clustering, classification and segmentation of customers, products, features with clusters, collaborative filters, multidimensional scaling, classification trees and tree-based methods, other classification algorithms (naive Bayes, discriminant analysis, k-means, and much more).
  • Measuring price elasticity prior to launching a service. Data and observations can take any form, from normal to nonparametric distributions. Estimating market demand.
  • Estimating pricing strategies (bundling, nonlinear, skimming, penetration).
  • Analyzing consumers, whether B2C or B2B using conjoint analysis and dimension-reduction techniques, factor analysis, and principal component analysis.
  • Revenue management models, costing models, business models, corporate models.
  • Running comparative tests in a lab and in social science are like running A/B tests in the web development. Tests can be parametric or nonparametric. Measuring marketing campaigns, advertising effectiveness, website metrics and other statistics.
  • Applying association analysis, market basket analysis, ROC and lift charts to retailing; even the smaller retailers.
  • Profiling customer behaviour with logistic regressions, tree-based methods, and classification techniques.
  • Analyzing consumer choice and behaviour with logit, probit, log-log models.
  • Assessing market penetration with adoption predictions, the BASS model or the logistic model.
  • Managing subscribers with survival analysis (churn); computing business cases with the average client lifetime (duration); calculating profitability (CLV or LTV); devising win-back campaigns (2LTV).
  • Estimating market potential, price changes, and increase in production with simultaneous equations: the industry supply and the consumer demand.
  • Filling the gaps, replacing missing values, dealing with outliers, estimating extreme cases, calculating risks, etc.
  • Forecasting. Forecasting using time series, regressions, neural networks, k-NN. Forecasting inventory, subscribers, production, units, etc.
  • Gauging advertising response: customer acquisition rate, response rate, repurchase and customer retention.
  • Understanding market preferences, customer choices, product positioning, store placement with dimension reduction methods and principal component regressions.
  • Exploring social networks.
  • Geomarketing, geoprocessing, geostatistics, map mashups, etc.
  • Allocating and optimizing retail space, delivery routes, media selection, and sales resources.
  • Analyzing sentiment with text mining and topic modeling.
  • And more…

Source

  • Marketing Analytics, A Practical Guide to Real Marketing Science, Kogan Page, 2015.
  • Statistical Methods in Customer Relationship Management, Wiley, 2012.
  • Marketing Data Science, Modeling Techniques in Predictive Analytics with R and Python, Pearson FT Press, 2015.
  • Marketing Research, Wiley, 2011.
  • Predictive Marketing: Easy Ways Every Marketer Can Use Customer Analytics and Big Data, Wiley, 2015.
  • Marketing Metrics: The Manager’s Guide to Measuring Marketing Performance, 3rd Edition, Pearson FT Press, 2015.
  • Data Mining and Business Analytics with R, Wiley, 2013.
  • Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2nd edition, 2015.

Mastering metrics

The field of applied econometrics (marketing stats/maths), encompasses the statistical and mathematical methods economists (quantitative marketers) use to untangle cause and effect in human (consumer) affairs. The book connects the dots between mathematical formulas, statistical methods, and real-world policy analysis.

Source: Mastering Metrics, Princeton University Press, 2014.

Nonparametric methods

When a histogram or a kernel density estimation depicts an unknown data pattern or distribution, we have a nonparametric distribution.

A Normal distribution has one central hump and two tails where roughly 68 % of the observations are within the first standard deviation from the (central) mean. A nonnormal distribution can one of many of these specs:

  • More than one hump,
  • Abnormal fat tails where we find more than (1 - 68%) of the observations,
  • Series of highs and lows, mimicking wavelets,
  • Chaotic patterns, financial returns, risks on derivatives where extreme cases are more common than in a normal distribution,
  • Different ‘regimes’ with a stagnant growth followed by an accelerating growth (a ‘hockey-stick’ pattern),
  • And much more.

Example: we can think nonparametric patterns are mostly the feature of time series, but we can find some in cross-sectional, longitudinal data or panel data (everywhere!). Not only do we find these ‘exotic’ patterns in finance, with derivative instruments, but also in natural science data (environmental data for e.g.) and social science.

We need predictors that do not take a predetermined form (based on a Normal distribution) but are constructed according to information derived from the data.

Here is a tentative list of nonparametric methods:

  • Kernel regressions: the continuous dependent variable from a limited set of data points is ‘blurred’ from the influence of the data points so that their values can be used to predict the value for nearby locations. It resembles the k-Nearest Neighbours. It’s advisable to use statistical learning and model selection techniques (such as monitoring diagrams, cross-validation). Lead: the ksmooth function and the ibr package.
  • Nonparametric multiplicative regressions: based on Kernel regressions. Lead: assist package.
  • k-Nearest Neighbours: an alternative to classical regressions. k-NNs can ‘replace’ missing/unobserved instances based on a number k of points in the training set which are nearest to unseen instances. Lead: the knn function.
  • Support vector machines (SVMs): both classifying and regression methods. Leads: the e1071, kernlab packages.
  • Tree-based models:
    • Classification trees resemble logistic regressions (and other generalized linear models applying a link function such as probit or loglog). The goal is to classify a binary outcome (true or false, 1 or 0) or categorical (a, b or c) or ordinal (1, 2, 3, 4) outcomes as would also do k-NNs and SVMs.
    • Regression trees can interpolate (‘fill in gaps’) based on the actual non-categorical observations as would do k-NNs. Tree-based regression models is a vast field. Recursive partitioning (commonly called CART), Random Forests, bagging, boosting and other ensembles methods are among the many techniques.
    • Leads: the rpart, party, randomForest, rpartOrdinal, tree, maptree, evtree, varSelRF, CORElearn, longRPart, REEMtree, caret packages.
  • Local regressions (or local polynomials):
    • We can capture lots of phenomena by transforming the independent and/or the dependent variables: logarithm, exponential, power, derivative, etc. We then obtain alternative forms. Among the most common forms are the Cobb-Douglas, the quadratic, the growth, the power, the logistic and the BASS models. In other cases, these transformations are not enough.
    • LOESS (locally estimated scatterplot smoothing). Lead: the loess function; also integrated in the plotting functions.
    • LOWESS (locally weighted scatterplot smoothing). Lead: the lowess function; also integrated in the plotting functions.
    • These methods combine multiple regression models in a k-Nearest-Neighbor-based meta-model. LOESS is a generalization of LOWESS. Both methods can be built on linear and nonlinear least squares regression. LOESS combines much of the simplicity of linear least squares regression with the flexibility of nonlinear regression.
    • We can extend the field of local regressions to splines and wavelets. These methods consist in using mathematical functions for interpolation or smoothing. Leads: the smooth.spline, splinefun functions and the splines, wavelets package.
  • Gaussian process regressions (or kriging): smoothing functions. Leads: the gptk, kriging, gstat packages.
  • Generalized Additive Model (GAM): part of time series. Leads: the gam, mgcv packages.
  • Moving least squares: similar to local regressions. They consist in reconstructing continuous functions from a set of unorganized point samples via the calculation of a weighted least squares measures. Lead: the earth package.
  • Multivariate adaptive regression splines (MARS): a regression technique, that can be seen as an extension of linear models, that automatically models nonlinearities and interactions between variables. MARS builds a model in two phases: the forward and the backward pass. This two-stage approach is the same as that used by recursive partitioning trees (CART). CART can be either a classifier (for categorical dependent variables) or a regressor (for non-categorical data). MARS can be seen as a generalization of CART that allows the model to better handle non-categorical data. Lead: the earth package.
  • Segmented regressions (or piecewise regression or ‘broken-stick regression’): the independent variable is partitioned into intervals and a separate line segment is fit to each interval. Lead: the segmented package.
  • Stepwise regressions: the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. The most common criteria are: a sequence of F-tests or t-tests, adjusted R2, Akaike information criterion, Bayesian information criterion and Mallows’s Cp. Leads: the DAAG, MASS, leaps, relaimpo packages.
  • Isotonic regressions (or monotonic regressions): fitting a free-form line to a sequence of observations under the following constraints. Isotonic regressions have applications in statistical inference, for instance. Lead: the iso package.
  • Semiparametric regressions: include regression models that combine parametric and nonparametric models. The most popular methods are the partially linear or partial least squares, the index, and the varying coefficient models. Leads: the pls, np, mgcv, gamair packages.
  • Survival analysis: the Kaplan-Meier estimates makes no assumption about the shape of the hazard function as opposed to parametric methods relying on Weibull, lognormal, log-logistic distributions for instance. The Cox regression and piecewise constant exponential models are semiparametric as they make no assumption about the shape of the hazard function. Leads: the eha, survival, survsim packages.
  • Nonlinear regressions: a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations. Leads: the nlme package.

Again, consult the CRAN Task Views for a list of topics and fields grouping packages and functions.

Source:

  • Wikipedia.
  • Régression avec R, Springer, 2011.
  • Data science, fondamentaux et études de cas, machine learnig avec Python et R, Eyrolles, 2015.
  • Data Mining and Business Analytics with R, Wiley, 2013.
  • Data Mining and Business Analytics with R, Wiley, 2013.
  • Basic Econometrics, 5th edition, MGraw-Hill, 2010.
  • Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2nd edition, 2015.
  • Econometrics by Examples, Palgrave Macmillan, 2014.

Regulatization

It is not unusual to see the number of input variables (k) greatly exceed the number of observations (n).

Example: We find this phenomenon in microarray data, analytical chemistry, chemometrics and environmental pollution studies. In food processing, we may extract 700 explanatory variables obtained by spectral analysis of just 40 cookies! Food is more than just charbs, fat, sodium, and vitamins…

In any case, with many predictors (k), fitting the full model without penalization will result in large prediction intervals and/or multicollinearity. The corrective measure is to use regularization or shrinkage:

  • Ridge regressions (or Tikhonov regularization): shrink, in some cases, close to or equal to zero (for large values), the coefficient estimates of the highly correlated variables. The shrinking parameter is impossible to determine a priori. It’s advisable to use the results of ridge regression (the set of coefficient estimates) with statistical learning and model selection techniques (such as monitoring diagrams, cross-validation) to determine the most appropriate model for the given data. Lead: the pls package.
  • Least Absolute Shrinkage and Selection Operator (LASSO) and Least Angle Regressions (LARS) models: alternative regularized versions of least squares. The ridge regression and its alternatives have all their flaws. If LASSO has its own advantage over the ridge regression, it also has its own flaws. The same goes for LARS. The Elastic-net Regularized Generalized Linear model is a combination of ridge and LASSO. Leads: the glmnet, lars, elasticnet packages.
  • Principal component regressions (PCR): are regression analysis techniques that are based on principal component analysis (PCA). It is based on a standard linear regression model, but instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors, thus making PCR a regularized procedure. Often, the principal components with higher variances (the ones based on eigenvectors corresponding to the higher eigenvalues of the sample variance-covariance matrix of the explanatory variables) are selected as regressors. It’s advisable to use statistical learning and model selection techniques. Lead: the pls package.
  • Partial Least Squares (PLS) regressions: the method bears some relation with PCR; it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. In both PCR and PLS, we must understand how PCA works. In PCA, the number of principal components is less than the number of original variables (the independent variables) or the number of observations. In other words, it is like compressing a 3D space into a 2D plane for instance. It’s advisable to use statistical learning and model selection techniques. Partial Least Squares Discriminant Analysis (PLS-DA) is a variant used when the dependent variable is categorical. Lead: the pls package.

Source:

  • Régression avec R, Springer, 2011.
  • Data Mining and Business Analytics with R, Wiley, 2013.
  • Data science, fondamentaux et études de cas, machine learnig avec Python et R, Eyrolles, 2015.
  • Data Mining and Business Analytics with R, Wiley, 2013.
  • Regression Modeling Strategies With Applications to Linear Models, - Logistic and Ordinal Regression, and Survival Analysis, Springer, 2nd edition, 2015.
  • R for Marketing Research and Analytics, Springer, 2015.

Result diagnostics

We can plot the results of a regression with summary; we can get 6 diagnostic plots. Visual inspection can show unusual observations.

We can also compute various leave-one-out (or deletion) diagnostics. Leads: the influence.measures function that calls dfbetas, dffits, covratio and cooks.distance.

Package lmtest enables all kings of tests:

  • The Ramsey’s RESET tests violations to the assumptions and misspecification of the functional form. Lead: the resettest function.
  • The rainbow test takes a different approach to testing the functional form. Lead: the raintest function.
  • Another diagnostic test that relies on ordering the sample prior to testing is the Harvey-Collier test. Lead: the harvtest function.

Several procedures for heteroscedasticity consistent (HC) and, more generally, for heteroskedasticity and autocorrelation consistent (HAC) covariance matrix estimation. Lead: the AER, sandwich packages and the vcovHC and vcovHAC functions.

Furthermore, the estimates produced by these functions can be easily plugged into the functions coeftest, waldtest from lmtest, generalizing the summary and anova methods.

Source:

  • Applied Econometrics with R, Springer, 2008.
  • R for Marketing Research and Analytics, Springer, 2015.

Resistant/robust regressions

Visualization and leave-one-out diagnostics are a popular means for detecting unusual observations. With low-dimensional data, we can always resort to plotting to detect such problems, but the situation is much worse with high-dimensional data. A solution is to use robust regression techniques.

OLS is said to be not robust to violations of its assumptions. Robust regression methods are designed to be not overly affected by violations of assumptions by the underlying data-generating process.

Resistant/robust regression methods can withstand alterations of a small percentage of the data set (in more practical terms, the estimates are unaffected by a certain percentage of outlying observations).

Example: cases involving incomes and wages are biased by a few larger observations (there are a few, the ‘1 %’, that earn disproportionately more than the others). This is not normally a problem if the outlier is simply an extreme observation drawn from the tail of a normal distribution, but if the outlier results from non-normal measurement error or some other violation of standard ordinary least squares assumptions, then it compromises the validity of the regression results.

Leads: the MASS, robust, robustbase packages and the lqs function (for Least trimmed squares regressions).

Resistant/robust GLM regressions. Leads: the robustbase package.

Source: Applied Econometrics with R, Springer, 2008.

Quantile regressions

OLS regressions model the conditional mean of a response. Sometimes, other characteristics are more interesting; for instance, the median or, more generally, the quantiles. In some situations, the mean is biased by larger observations.

Example: cases involving incomes and wages are biased by a few larger observations (there are a few, the ‘1 %’, that earn disproportionately more than the others). This is not normally a problem if the outlier is simply an extreme observation drawn from the tail of a normal distribution, but if the outlier results from non-normal measurement error or some other violation of standard ordinary least squares assumptions, then it compromises the validity of the regression results. The median is not biased by the outliers. Lead: the quantreg, extremevalues, mvoutlier, outliers, Rlof package.

Source: Applied Econometrics with R, Springer, 2008.

Generalized linear models (GLM)

  • Logit, probit, and loglog models address binary (categorical) dependent variables (1 or 0 classifications). For example: will a bank loan be accepted (1) or rejected (0), will a student graduate or not, will a mother go back to work or not, according to categorical variables (including dummy variables) and non-categorical variables. Lead: the glm function.
  • Logit and probit can be multinomial. For example, a bank loan can be accepted as they are or rejected, but they can also be flagged as ‘to be negotiated’. A household can have 1, 2, 3 or 4 and more cars/bicycle according to some independent variables. Logit and probit can also be ordinal or ordered. For example, we may want to classify subscribers from A-type to D-type when applying customer retention and win-back measures. Leads: the mlogit, nnet, MASS, VGAM packages.
  • Mixed logit can also utilize any distribution for the random coefficients, unlike probit which is limited to the normal distribution.
  • Poisson count data models. These regressions assume the independent variable has a Poisson distribution where the mean is equal to the variance. They are sometimes known as a log-linear models. For example, a supermarket can assess sales of BBQ chickens per day knowing the demand copies the Poisson distribution. The negative binomial regression is a popular generalization of Poisson regression because it loosens the highly restrictive assumption that the variance is equal to the mean made by the Poisson model. Lead: the glm function.
  • Weighted least squares (WLS) is a special case of generalized least squares when all the variances of the observations depict heteroscedasticity. For example, compensation is much more variable in large companies. In other words, variance, and residuals, increases with size, causing heteroscedasticity (a violation of one OLS assumption). We can employ the standard deviation of wages as a weight. Since each weight is inversely proportional to the error variance, it reflects the information in that observation. So, an observation with small error variance has a large weight since it contains relatively more information than an observation with large error variance (small weight). The method of iteratively reweighted least squares (IRLS) can be used.

Source:

  • Applied Econometrics with R, Springer, 2008.
  • Data Mining and Business Analytics with R, Wiley, 2013.
  • Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2nd edition, 2015.
  • R for Marketing Research and Analytics, Springer, 2015.
  • Econometrics by Examples, Palgrave Macmillan, 2014; Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2nd edition, 2015.
  • Political Analysis Using R, Springer, 2015.

Time series

Financial, economic, production, operational, meteorological data are all candidate to time series. The examples are endless: model and forecast airport traffic (tendencies, monthly, lag plots, automate modeling, prediction, trajectory simulation, etc), monthly temperature (using the SARIMA model, trigonometric regressions, forecast, comparison, spectral analysis), electricity consumption (OLA, ARMAX, non-stationary error model, forecast), milk production (exploratory analysis, regime change, SARIMA, ARMAX), stock returns, option returns, etc. Leads: the zoo, xts, tseries, timesac, ast packages and the ts class.

Key techniques:

  • Filtering and removing seasonality. Lead: the filter function.
  • Moving or rolling statistics (moving average, rolling standard deviation, running median for e.g.). Lead: the rollapply function from the zoo package and the xts package.
  • Additive or multiplicative decomposition into seasonal, trend, and irregular components. Lead: the decompose, ares function.
  • Exponential smoothing type, such as simple or double exponential smoothing, and the Holt-Winters method employing recursively reweighted lagged observations for predicting future data. Lead: the HoltWinters function.
  • Autoregressive integrated moving average (ARIMA). The Box-Jenkins approach includes the autocorrelating functions AR(p), MA(q), and ARMA(p,q). We can extend to the seasonal ARMA or SARMA, the ARMAX, the MINIC methods and the integrated series ARIMA and SARIMA. Leads: the ar, arima, arma, auto.arima, StructTS, ArDec function. The dse, fracdiff package.
  • Vector autoregressive (VAR) models. Lead: the vars package.
  • Unit roots (tests). Leads: the adf.test, pp.test from the tseries package and the fUnitRoots package.
  • Stationarity (tests) of the autocorrelation, white noise, remedial measures in case of nonstationarity. Lead: the kpss.test from the tseries package.
  • Unit root and cointegration (test). Leads: the po.test from the tseries package and ca.jo from the urca package.
  • Structural change. Leads: the dynlm, strucchange packages.
  • Forecasting. Lead: the forecast package.
  • ARCH and GARCH models. Tests, simulation, conditional heteroscedasticity, forecast, conditional heteroscedasticity error models, etc. Leads: the garch function from the tseries package.
  • Dynamic (lag) models, distributed lag models, and the Koyck approach. Leads: the dyn, dynlm packages.
  • Autocorrelated models (with serially correlated error terms). Lead: the nlme package.
  • Granger causality. Lead: the grangertest from the lmtest package.

Source:

  • Applied Econometrics with R, Springer, 2008.
  • Econometrics by Examples, Palgrave Macmillan, 2014.

Simultaneous equations

Two-stage least squares (2SLS) address cases of supply & demand equations. We might think these models are the bread and butter of banks, but many industry will for example estimate S&D to simulate price wars and other market behaviours. Leads: the foreign, systemfit packages.

Source: R for Marketing Research and Analytics, Springer, 2015.

Instrumental variable

In statistics, econometrics, epidemiology and the related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. For example, suppose we wish to estimate the causal effect of smoking on general health. Correlation between health and smoking does not imply that smoking causes poor health because other variables may affect both health and smoking, or because health may affect smoking. It is at best difficult and expensive to conduct controlled experiments on smoking status in the general population. We may attempt to estimate the causal effect of smoking on health from observational data by using the tax rate for tobacco products as an instrument for smoking. An instrumental variable is also known as a proxy variable. Lead: the ivreg function from the AER, ivpack package.

Source: Econometrics by Examples, Palgrave Macmillan, 2014.

Multilevel models

Multilevel models (also known as hierarchical linear models, nested data models, mixed models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parameters that vary at more than one level. They assume that the data being analyzed are drawn from a hierarchy of different populations whose differences relate to that hierarchy.

These models are useful in a wide variety of disciplines: physical, biological and social sciences. Even in marketing research. They are designed for measuring the impact of a group or an environment or a policy on individuals. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped.

A fixed effects model is a statistical model that represents the observed quantities in terms of explanatory variables that are treated as if the quantities were non-random. This is in contrast to random effects models and mixed models in which either all or some of the explanatory variables are treated as if they arise from random causes. In panel data analysis, the term fixed effects estimator (also known as the within estimator) is used to refer to an estimator for the coefficients in the regression model. If we assume fixed effects, we impose time independent effects for each entity that are possibly correlated with the regressors.

The random effects model is a special case of the fixed effects model. Random effects models are used in the analysis of hierarchical or panel data; “fixed” and “random” effects respectively refer to the population-average and subject-specific effects (and where the latter are generally assumed to be unknown latent variables).

Nonlinear random-effects models (or variance components models) use counts, binary dependent variables, etc. They are a kind of hierarchical linear model. These models are used in the analysis of hierarchical or panel data when one assumes no fixed effects.

A mixed model contains both fixed effects and random effects. They are particularly useful in settings where repeated measurements are made on the same statistical units (longitudinal study), or where measurements are made on clusters of related statistical units. Because of their advantage in dealing with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures ANOVA.

Leads: the multilevel, nlme, lme4, xxM.

Source:

  • Wikipedia; Multilevel Modeling Using R, CRC Press, 2014.
  • R for Marketing Research and Analytics, Springer, 2015.

Interaction effects

Multilevel models use dummy or dichotomic variables. A dummy can split a dataset into two layers, one layer for each ‘regime’. The regression model can discriminate between two clusters and not be an average in the data cloud. We can plot two parallel regression lines: one when the dummy in on, another when it is off. The lines have different intercepts. In the case of interaction, two independent variables are included in the model with a third one which is the multiplication of variables. The result is the two regressions will not have the same slope, but they may share the same intercept. Cases are numerous: two variables, categorical, dichotomic, noncategorical, one of each or both similar and one interaction, more than two variables and several interactions, etc. At the simplest level, for example, we could model clothing expenditures against the impact of two dummies, being a female or not and being a college graduate or not, and their interaction.

Another use of the dummy variable in the piecewise linear regression consisting of two segments and a threshold or a knot where the slope changes. For example, we can model a plant output where passed a threshold, the second production line opens or another shift is added, boosting the output potential.

Naïve Bayes

Naïve Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. There are alternative to logistic regressions. For example, using these algorithms, we can predict whether a banker can offer a loan to a customer (is the customer creditworthy or not). Leads: the NaiveBayes, bnlearn packages.

Source:

  • Data science, fondamentaux et études de cas, machine learnig avec Python et R, Eyrolles, 2015.
  • Data Mining and Business Analytics with R, Wiley, 2013.
  • Bayesian Networks in R with Applications in Systems Biology, Springer, 2013.
  • Bayesian Networks with Examples in R, CRC Press, 2014.