Data Storytelling

Foreword

Output options: ‘tango’ syntax, the ‘readable’ theme.
Results only.
Source: publicly available data.

1, Preview

Data science and storytelling

Pulp Fiction classical scene

The Big Mac Index

The new scenes

2, Results Exploration and Problems of Multicollinearity

Marcellus and the wine business - part 1

Introduction

R may appear opaque. Some people advocate the use of spreadsheets or visualization software such as Tableau or Qlik. These user-friendly tools do not have steep and long learning curves. Nonetheless, it is a matter of long-term investment. Implementing R scripts can save lots of time if we have to release reports on the recurring basis for example. Imagine the burden of completing the same report month after month with a spreadsheet when we can instead embed R scripts withing a process, automate things, and free resources for other tasks.

Documentation on R can be found:

Synonyms

In the following regressions, we use terms that are sometimes synonyms:

Variable Y, dependent, explained, predictand, regressand, response, endogenous.
Variable X, independent, explanatory, predictor, regressor, stimulus or control, exogenous.
Differences between the actuals and predictions, residuals, errors, disturbances.

Datasets

Most of the cases rely on cross-sectional datasets. Even though we may come across time and dates in the observations, they are simply markers.

Some cases will address time series.

Another case will combine cross-sectional data with time series: panel data.

Models are simple and datasets are short. The methods remain the same with larger datasets or recursive analyses. We import data from spreadsheets. For large dataset, we can instead connect to databases or work with Big Data technologies (parallel computing and the likes). Scripts are shorts. For repetitive tasks, we can implement loops (using for(var in vec) {expr} and while(cond){expr}).

Ordinary Least Squares

The OLS model or method relies on several assumptions to work properly. The following is a list of the main assumptions (and their violations). We must be careful to test if the assumptions are violated and bring on remedial measures (if necessary since sometimes, violations can be tolerated).

Normality, nonnormality.
Linearity, nonlinearity.
Omission of irrelevant and inclusion of relevant variables (overfitted/underfitted model)
Reliability, consistency (garbage in, garbage out).
More observations than predictors (n > p).
Residuals constant variance; homoscedasticity (heteroscedasticity).
Residuals are not correlated over time; no autocorrelation (serial correlation).
Regressors are not correlated (multicollinearity).

The R case

Goal

Find out the most efficient firms; the producers which can maximize their sales with their means of production. It is an exploratory analysis to reduce the sample to study and proceed to deeper analyses (say we have a larger sample of 20 players and we want to focus on the 3 best).

Explore the dataset

Import the data: sales_m (in millions of Euros) of Champagne producers and their respective workforce and hectares (cultivated land surface) as explanatory variables.

Pointer

The XLConnect package fetches data directly into a Workbook. When dealing with several light datasets, the best option is keeping them in a unique workbook with several worksheets.

However, XLConnect is memory-intensive because workbooks are binary files. For larger datasets, a better option is to maintain separate csv files and load them into R with the readr package.

Print the head and tail of the dataset.

##               house sales_m workforce hectares
## 1            Boizel      97        10        1
## 2    Canard Duchêne     103        30       44
## 3 Charles Heidsieck     157       180       65

##              house sales_m workforce hectares
## 8  Laurent Perrier     208       150      150
## 9  Moët et Chandon     724      1100     1000
## 10         Pommery     288        80      230

Summarize the dataset.

##     house              sales_m        workforce         hectares      
##  Length:10          Min.   : 97.0   Min.   :  10.0   Min.   :   1.00  
##  Class :character   1st Qu.:121.8   1st Qu.:  24.0   1st Qu.:  49.25  
##  Mode  :character   Median :208.0   Median :  60.0   Median : 121.00  
##                     Mean   :239.6   Mean   : 170.2   Mean   : 199.90  
##                     3rd Qu.:269.2   3rd Qu.: 132.5   3rd Qu.: 210.00  
##                     Max.   :724.0   Max.   :1100.0   Max.   :1000.00

Summaries are good a detecting missing values, inconsistencies, and outliers. For example, in workforce, we notice a high maximum, pushing the mean way over the median.

Use another view (psych package).

##           vars  n  mean     sd median trimmed    mad min  max range skew
## house*       1 10   NaN     NA     NA     NaN     NA Inf -Inf  -Inf   NA
## sales_m      2 10 239.6 183.98    208  196.88 118.61  97  724   627 1.70
## workforce    3 10 170.2 331.73     60   74.00  65.23  10 1100  1090 2.15
## hectares     4 10 199.9 291.65    121  124.75 119.35   1 1000   999 1.99
##           kurtosis     se
## house*          NA     NA
## sales_m       1.90  58.18
## workforce     3.19 104.90
## hectares      2.71  92.23

describeBy flips the view. We obtain more descriptive statistics: central tendency measures, standard deviation, range, skewness, kurtosis, and standard error.

Bring out more analytic power!

## champagne 
## 
##  4  Variables      10  Observations
## ---------------------------------------------------------------------------
## house 
##        n  missing distinct 
##       10        0       10 
## 
## Boizel (1, 0.1), Canard Duchêne (1, 0.1), Charles Heidsieck (1, 0.1),
## Charles Laffite (1, 0.1), Château Malakoff (1, 0.1), Delamotte (1, 0.1),
## Deutz (1, 0.1), Laurent Perrier (1, 0.1), Moët et Chandon (1, 0.1),
## Pommery (1, 0.1)
## ---------------------------------------------------------------------------
## sales_m 
##        n  missing distinct     Info     Mean      Gmd 
##       10        0        8    0.988    239.6    177.7 
##                                           
## Value       97 103 110 157 208 213 288 724
## Frequency    1   1   1   1   2   1   2   1
## Proportion 0.1 0.1 0.1 0.1 0.2 0.1 0.2 0.1
## ---------------------------------------------------------------------------
## workforce 
##        n  missing distinct     Info     Mean      Gmd 
##       10        0        8    0.988    170.2      262 
##                                                   
## Value        10   22   30   60   80  150  180 1100
## Frequency     2    1    1    2    1    1    1    1
## Proportion  0.2  0.1  0.1  0.2  0.1  0.1  0.1  0.1
## ---------------------------------------------------------------------------
## hectares 
##        n  missing distinct     Info     Mean      Gmd 
##       10        0        8    0.988    199.9    256.2 
##                                                   
## Value         1   37   44   65  121  150  230 1000
## Frequency     1    1    1    1    2    1    2    1
## Proportion  0.1  0.1  0.1  0.1  0.2  0.1  0.2  0.1
## ---------------------------------------------------------------------------

The describe function reveals more metrics: number of observations, missing values, distinct values (important when dealing with categorical variables) and statistics on the distribution.

Visualize the dataset.

We can spot outliers.

Explore the results

Run the regression.

Print the coefficients and their confident intervals.

## (Intercept)   workforce    hectares 
## 106.7472958  -0.1699307   0.8092792

##                  2.5 %       97.5 %
## (Intercept) 86.8478521 126.64673957
## workforce   -0.3267326  -0.01312889
## hectares     0.6309260   0.98763244

Print the predicted values: ‘Fitted’ sales_m. As opposed to the actual sales_m.

##        1        2        3        4        5        6        7        8 
## 105.8573 137.2577 128.7629 289.1430 194.4742 202.9708 126.4948 202.6496 
##        9       10 
## 729.1027 279.2871

Plot the predicted values against the actual values of the regressand.

The mode does a good job at predicting the regressand. A perfect fit would be right on the line.

Print the residuals: the differences between the predicted and actual values:

Residuals.
Studentized residuals.
Standardized residuals.

##          1          2          3          4          5          6 
##  -8.857268 -34.257659  28.237086  -1.143037  18.525764   5.029228 
##          7          8          9         10 
## -16.494783   5.350432  -5.102707   8.712945

##           1           2           3           4           5           6 
## -0.46034348 -2.37241633  2.27236684 -0.07161078  0.97759263  0.25578198 
##           7           8           9          10 
## -0.89577936  0.26388855 -2.20350550  0.47769086

##           1           2           3           4           5           6 
## -0.48867287 -1.84069022  1.79938600 -0.07731545  0.98070167  0.27478205 
##           7           8           9          10 
## -0.90869540  0.28339241 -1.76945496  0.50642492

Plot the residuals.

t-tests and F-tests

One function does it all. Print the tests.

## 
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.258  -7.919   1.943   7.872  28.237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.74730    8.41548  12.685 4.38e-06 ***
## workforce    -0.16993    0.06631  -2.563   0.0374 *  
## hectares      0.80928    0.07543  10.730 1.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared:  0.9907, Adjusted R-squared:  0.9881 
## F-statistic: 374.4 on 2 and 7 DF,  p-value: 7.648e-08

We get a t-test (p-value) for each coefficient and the model F-statistic (p-value). They are all significant at or below the 5% level.

The t-value, or t-statistic, is the significance of an individual coefficient.

The p-value is the probability associated with a statistical test. A coefficient is statistically significant when the p-value associated with a test is less than, say, 5%. 1% and lower is even better. Sometimes we can tolerate 10%.

With the F-statistic, when significant, we conclude that taken together, the regressors are collectively different from 0.

Alternatively, compute an ANOVA table.

## Analysis of Variance Table
## 
## Response: sales_m
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## workforce  1 255417  255417  633.60 3.986e-08 ***
## hectares   1  46408   46408  115.12 1.343e-05 ***
## Residuals  7   2822     403                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or…

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## workforce    1 255417  255417   633.6 3.99e-08 ***
## hectares     1  46408   46408   115.1 1.34e-05 ***
## Residuals    7   2822     403                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-test’s p-values are below the 1% significance level (3 asterisks means below 0.1%), which is highly significant. We reject the ‘all $\beta = 0$’ hypothesis. We have robust coefficients. However, we want to remain cautious…

Outliers, extreme values, and influential observations

It is essential to understand the impact of outliers on a model. First, we need to detect them. Second, we decide whether we keep them.

There are several methods for detecting outliers. As we did above, we can use descriptive functions such as summary. We also plotted the observations (boxplots, histograms, etc.) prior to running the regression. Finally, we plotted the regression residuals. There are other techniques.

Run the Bonferroni t-test for catching extreme values (from the car package). We test the assumption of ‘no extreme value’. The test is good for lm and glm assuming normal distributions.

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 2 -2.372416           0.055338      0.55338

The second observation (obs#2) is the most extreme but does not make the cut as an outlier. We accept the ‘no extreme value’ hypothesis (p-value is over 10%).

A value may be extreme, but does it have a significant impact on the regression?

To answer the question, plot the Cook’s distance (set the threshold to $4 \times mean(cooksd)$).

Alternatively.

The Cook’s distance computes the influence exerted by each observation on the regression. Observation 9 has more influence than any other observations.

Extract the observations above the threshold.

##        9 
## 49.54683

Deal with an outlier

There are several solutions:

Tolerate the outlier.
Process the outlier like a missing value. Replace it (with a mean or median value).
Cap the outlier. Replace the outlier with either the 5th or 95th centiles.
Replace the outlier with a prediction or a neighbour (using knn, a decision tree or a multivariate imputation by chained equations).

Normality

We plotted the predicted values against the actual values with a regression line…

…and the residual values.

We can plot an alternative view of these residuals. Plot a Q-Q chart (using qqPlot from the car package).

Not only do we obtain the regression line, we also get the confidence interval. When residuals values are located around the regression line and within the confidence interval, we have a normal and linear regression.

We can represent these results with a histogram and a distribution of studentized residuals (using the MASS package).

We are far from a perfect normal distribution. We have fat tails, even though most values are concentrated around 0 (the regression line).

Under the hypothesis of normality, data should be symmetrical, not platykurtic, not leptokurtic, not skewed to the left nor to the right.

Run the $\chi^2$ goodness-of-fit test (actuals over the probability of predictions).

## 
##  Chi-squared test for given probabilities
## 
## data:  champagne$sales_m
## X-squared = 19.977, df = 9, p-value = 0.01805

We reject the ‘normality’ hypothesis at the 5% significance level.

Run the Jarque-Berra normality test (from the normtest package).

## 
##  Jarque-Bera test for normality
## 
## data:  fitted(champagne_model)
## JB = 12.247, p-value < 2.2e-16

We reject again the ‘normality’ hypothesis at the 5% significance level.

Run the Shapiro normality test.

## 
##  Shapiro-Wilk normality test
## 
## data:  fitted(champagne_model)
## W = 0.6761, p-value = 0.0004527

We reject the ‘normality’ hypothesis at the 5% significance level.

We can relax these three results knowing that we deal with a small sample. We always have to consider test results in a global perspective including a visual exploration of the dataset, the results, and a battery of tests.

Global test and wrap-up

The normtest package allows for other tests on kurtosis, skewness, etc. Instead, we can use the gvlma package that runs a battery of tests.

Run a global validation of the model assumptions (from the gvlma package).

## 
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.258  -7.919   1.943   7.872  28.237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.74730    8.41548  12.685 4.38e-06 ***
## workforce    -0.16993    0.06631  -2.563   0.0374 *  
## hectares      0.80928    0.07543  10.730 1.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared:  0.9907, Adjusted R-squared:  0.9881 
## F-statistic: 374.4 on 2 and 7 DF,  p-value: 7.648e-08
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = champagne_model) 
## 
##                      Value p-value                   Decision
## Global Stat        7.25407  0.1231    Assumptions acceptable.
## Skewness           0.18300  0.6688    Assumptions acceptable.
## Kurtosis           0.01897  0.8905    Assumptions acceptable.
## Link Function      4.57713  0.0324 Assumptions NOT satisfied!
## Heteroscedasticity 2.47496  0.1157    Assumptions acceptable.

We obtain a more detail summary of statistics:

t-tests are significant.
F-statistic is significant.
Most OLS assumptions are accepted.

The only failing assumption is the link function. Residuals can be normally distributed or they can follow an alternative distribution. For example, a logit function for a binomial distribution or an inverse function for a gamma distribution. In other words, the results would suggest running the regression with another link function to fit a nonnormal distribution!

It can also reveal another problem…

Extract statistics using the influence…

## $hat
##         1         2         3         4         5         6         7 
## 0.1850521 0.1407477 0.3891173 0.4578062 0.1147924 0.1690161 0.1826214 
##         8         9        10 
## 0.1157645 0.9793705 0.2657118 
## 
## $coefficients
##     (Intercept)    workforce     hectares
## 1  -1.955594615 -0.006557502  0.009929118
## 2  -6.189522488 -0.012490588  0.021653354
## 3   9.899199277  0.078639350 -0.093353050
## 4  -0.001332613  0.004157981 -0.004588158
## 5   2.262288485 -0.005640924  0.003955039
## 6   0.501077166 -0.004932110  0.004720268
## 7  -3.553614199 -0.014595512  0.020108856
## 8   0.787650638  0.002235383 -0.002816517
## 9  19.073660021 -0.169639458 -0.074717411
## 10  0.355799817 -0.015896119  0.017690363
## 
## $sigma
##        1        2        3        4        5        6        7        8 
## 21.31341 15.57781 15.89873 21.67727 20.14168 21.56925 20.36732 21.56176 
##        9       10 
## 16.12286 21.28555 
## 
## $wt.res
##          1          2          3          4          5          6 
##  -8.857268 -34.257659  28.237086  -1.143037  18.525764   5.029228 
##          7          8          9         10 
## -16.494783   5.350432  -5.102707   8.712945

…and summary.

## 
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.258  -7.919   1.943   7.872  28.237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.74730    8.41548  12.685 4.38e-06 ***
## workforce    -0.16993    0.06631  -2.563   0.0374 *  
## hectares      0.80928    0.07543  10.730 1.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared:  0.9907, Adjusted R-squared:  0.9881 
## F-statistic: 374.4 on 2 and 7 DF,  p-value: 7.648e-08

Signs on coefficients are confusing. More means of production should translate into more sales, but the workforce coefficient is negative; although significant at the 5% level. $R^2$ and adjusted-$R^2$ are very strong. $R^2$ is the regression explanatory power, the overall fit of the model (useful when comparing models). 98% of the sales can be explained by the independent variables. Only 2% is random or unexplained.

All p-values and F-statistic are below the 5%, even the 1% level. However, we want to keep in mind that in the light of the above tests and a questionable coefficient, we must continue searching for problems.

Plot the residuals.

We take up all the previous findings:

Top left: residuals around the regression line (the dotted line).
Bottom left: transformed residuals with an ‘absolute’ distances from y = 0; the x-axis is the regression line (fitted values).
Top right: Q-Q plot where the residuals are classified in negative and positive quantiles; we can spot outliers at the extremes and see how divergent they are from the regression line (the dotted line).
Bottom left: the Cook’s distance shows the influence of each observation (x-axis) against their transformed distance from the regression line.

Although observations 2, 3, 5 are further away from the regression line, observation 9 is the extreme value with the most impact on the regression.

More plots are available (type in method(plot)).

The last two plots are alternative Cook’s distance plots.

Something is still wrong with the results and we have not found it.

Multicollinearity

Test the ‘no multicollinearity’ hypothesis between coefficients with the variance inflation factors (vif).

## workforce  hectares 
##  10.80331  10.80331

## workforce  hectares 
##  3.286839  3.286839

## workforce  hectares 
##      TRUE      TRUE

The test is positive (TRUE). We reject the ‘no multicollinearity’ hypothesis. We found our culprit!

workforce and hectares are not independent. Intuitively, it makes sense that a large farming land implies a large workforce. Multicollinearity between these two variables explains the coefficients (high $R^2$, one low t-value).

Multicollinearity causes larger variances, large covariances, and high standard errors of the regressors. The results have wider confidence intervals, distribution with fatter tails. These large swings around the mean explain the rejection of normality by the tests.

With multicollinearity, we have weaker or insignificant t-values, but high $R^2$.

Pointer

It is a t-test. When the test threshold is 2 (at the 5% significance level). According to the student distribution, at the 5% significance level, t-values tends towards 1.96 as we add more observations. We round it to 2, a more conservative benchmark.

With 10 observations, 9 degrees of freedom, the t-value = 1.833. sqrt(vif) > 1.833. Even with this threshold, the test rejects the 'no multicollinearity' hypothesis.

Remedies to multicollinearity

As suggested by the failing assumption of the link function, we face a specification error. We need not change the link function, though.

## 
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.258  -7.919   1.943   7.872  28.237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.74730    8.41548  12.685 4.38e-06 ***
## workforce    -0.16993    0.06631  -2.563   0.0374 *  
## hectares      0.80928    0.07543  10.730 1.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared:  0.9907, Adjusted R-squared:  0.9881 
## F-statistic: 374.4 on 2 and 7 DF,  p-value: 7.648e-08

All p-values but one are below the 1% level. workforce has the weakest t-value (the closest to 0). Simply remove workforce from the model and run another regression with hectares alone. Print the results.

## 
## Call:
## lm(formula = sales_m ~ hectares, data = champagne)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.138 -17.639   0.664  21.475  29.583 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 114.63142   10.20049   11.24 3.53e-06 ***
## hectares      0.62516    0.02988   20.92 2.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.15 on 8 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.9798 
## F-statistic: 437.6 on 1 and 8 DF,  p-value: 2.861e-08

The results are still robust without workforce. Multicollinearity simply disappeared with only one regressor.

Rerun the normality and global tests to be sure.

## 
##  Chi-squared test for given probabilities
## 
## data:  champagne$sales_m
## X-squared = 19.977, df = 9, p-value = 0.01805

## 
##  Jarque-Bera test for normality
## 
## data:  fitted(champagne_model)
## JB = 12.247, p-value = 5e-04

## 
## Call:
## lm(formula = sales_m ~ hectares, data = champagne)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.138 -17.639   0.664  21.475  29.583 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 114.63142   10.20049   11.24 3.53e-06 ***
## hectares      0.62516    0.02988   20.92 2.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.15 on 8 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.9798 
## F-statistic: 437.6 on 1 and 8 DF,  p-value: 2.861e-08
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = champagne_model2) 
## 
##                      Value  p-value                   Decision
## Global Stat        8.15102 0.086200    Assumptions acceptable.
## Skewness           0.05058 0.822056    Assumptions acceptable.
## Kurtosis           0.73125 0.392477    Assumptions acceptable.
## Link Function      7.26055 0.007049 Assumptions NOT satisfied!
## Heteroscedasticity 0.10864 0.741695    Assumptions acceptable.

We still have a problem of nonnormality. The plausible cause is the small sample size (n < 30). A larger sample would enable us to conclude without any doubt whether the sample is normal or not.

Conclusion

Add the predicted sales to the dataset and plot the results.

Any firm with ‘actual sales > predicted sales’ is likely to be an outperformer because the regression averages the results. Pommery, for example, sells more than what is expected by the model.

Extension

Other remedies to multicollinearity

Removing one (or more) explanatory variable(s) is only one way of removing multicollinearity:

Increasing the size of the sample may attenuate the collinearity problem.
Taking the first difference of each regressor: run the regression with the new explanatory variables that are the results of $X_{t} - X_{(t-1)}$.
Creating a centered interaction term between collinear variables.
Pooling the data: combining cross-sectional with time-series data such as with panel data.
Employing advanced techniques such as factor analysis, principal components analysis (PCA) or the implementation of orthogonal (uncorrelated) variables, and ridge regressions (for that matter, the effects function might be of use).

Multicollinearity, covariance, and correlation

Print the covariance matrix of the model coefficients.

##             (Intercept)    workforce     hectares
## (Intercept)  70.8202727  0.204013691 -0.326320674
## workforce     0.2040137  0.004397218 -0.004764483
## hectares     -0.3263207 -0.004764483  0.005689023

The covariance matrix simply shows a measure of the joint variability between coefficients but does not indicate if coefficients are highly correlated.

Print the Pearson correlation matrix of the dataset.

##           sales_m workforce hectares
## sales_m      1.00      0.92     0.99
## workforce    0.92      1.00     0.95
## hectares     0.99      0.95     1.00
## 
## n= 10 
## 
## 
## P
##           sales_m workforce hectares
## sales_m           2e-04     0e+00   
## workforce 2e-04             0e+00   
## hectares  0e+00   0e+00

The correlation matrix shows the high collinearity between workforce and hectares; a sign of potential multicollinearity.

More statistics

We can extract a lot more from the regression results.

##    (Intercept) workforce hectares
## 1            1        10        1
## 2            1        30       44
## 3            1       180       65
## 4            1        22      230
## 5            1        60      121
## 6            1        10      121
## 7            1        60       37
## 8            1       150      150
## 9            1      1100     1000
## 10           1        80      230
## attr(,"assign")
## [1] 0 1 2

## $std.dev
## [1] 20.07782
## 
## $hat
##  [1] 0.1850521 0.1407477 0.3891173 0.4578062 0.1147924 0.1690161 0.1826214
##  [8] 0.1157645 0.9793705 0.2657118
## 
## $std.res
##  [1] -0.48867287 -1.84069022  1.79938600 -0.07731545  0.98070167
##  [6]  0.27478205 -0.90869540  0.28339241 -1.76945496  0.50642492
## 
## $stud.res
##  [1] -0.46034348 -2.37241633  2.27236684 -0.07161078  0.97759263
##  [6]  0.25578198 -0.89577936  0.26388855 -2.20350550  0.47769086
## 
## $cooks
##  [1]  0.018075047  0.184995371  0.687464383  0.001682436  0.041573872
##  [6]  0.005119069  0.061495548  0.003504794 49.546825102  0.030935223
## 
## $dfits
##  [1]  -0.21936337  -0.96017691   1.81359302  -0.06580244   0.35203989
##  [6]   0.11535533  -0.42341397   0.09548260 -15.18249574   0.28735523
## 
## $correlation
##             (Intercept)  workforce   hectares
## (Intercept)   1.0000000  0.3655876 -0.5140992
## workforce     0.3655876  1.0000000 -0.9525942
## hectares     -0.5140992 -0.9525942  1.0000000
## 
## $std.err
##                   [,1]
## (Intercept) 8.41547816
## workforce   0.06631152
## hectares    0.07542561
## 
## $cov.scaled
##             (Intercept)    workforce     hectares
## (Intercept)  70.8202727  0.204013691 -0.326320674
## workforce     0.2040137  0.004397218 -0.004764483
## hectares     -0.3263207 -0.004764483  0.005689023
## 
## $cov.unscaled
##               (Intercept)     workforce      hectares
## (Intercept)  0.1756808097  5.060880e-04 -8.094897e-04
## workforce    0.0005060880  1.090799e-05 -1.181905e-05
## hectares    -0.0008094897 -1.181905e-05  1.411252e-05

Pointer

Other functions: deviance(champagne_model), logLik(champagne_model), champagne_model$df

3, Structural Break and Changes over Times

Marcellus and the wine business - part 2

The R case

Goal

Import the data: year, average.price.per.bottle (average price per bottle (Euros)), grape.cost.per.kg (cost per kg of grapes (Euros)).

Print the head of the dataset.

##   year average.price.per.bottle grape.cost.per.kg
## 1 1987                    11.06              3.32
## 2 1988                    11.15              3.66
## 3 1989                    11.82              4.12
## 4 1990                    12.48              4.88
## 5 1991                    12.94              4.57
## 6 1992                    11.50              3.65

The dataset is named champagne2 since we follow up on the previous case. This dataset is more inclusive and focuses on all wine appellations.

Plot the data.

Find out if a dataset does not hide two subsets; a phenomenon known as ‘structural break’.

Structural change

The cost of grapes represents 70% of the cost price per bottle. In 1990, the market was liberalized. Prices became more volatile. Hikes in the prices coupled with an economic downturn have shrunk the demand since 1991.

This liberalization (structural break) had an impact on the relationship between average.price.per.bottle and grape.cost.per.kg. Our hypothesis: there are two regimes:

1987-1991, market planning.
1992-2002, market liberalization.

Pointer

`year` is facultative. We deal with a cross-sectional dataset indexed by a date marker to help subsetting it. The ideal dataset to deal with would be a time series.

Run the regressions.

1987-2002.

## 
## Call:
## lm(formula = average.price.per.bottle ~ grape.cost.per.kg, data = champagne2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94270 -0.35698  0.03706  0.53772  0.64311 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.340      1.142   4.674 0.000358 ***
## grape.cost.per.kg    1.542      0.297   5.192 0.000137 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5417 on 14 degrees of freedom
## Multiple R-squared:  0.6582, Adjusted R-squared:  0.6337 
## F-statistic: 26.95 on 1 and 14 DF,  p-value: 0.0001366

1987-1991, market planning.

## 
## Call:
## lm(formula = average.price.per.bottle ~ grape.cost.per.kg, data = champagne2_pre)
## 
## Residuals:
##       1       2       3       4       5 
##  0.1024 -0.2089 -0.0818 -0.3188  0.5071 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         7.0390     1.2150   5.793   0.0102 *
## grape.cost.per.kg   1.1803     0.2928   4.031   0.0274 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.374 on 3 degrees of freedom
## Multiple R-squared:  0.8441, Adjusted R-squared:  0.7922 
## F-statistic: 16.25 on 1 and 3 DF,  p-value: 0.02745

1992-2002, market liberalization

## 
## Call:
## lm(formula = average.price.per.bottle ~ grape.cost.per.kg, data = champagne2_post)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8420 -0.3452 -0.0034  0.5469  0.7001 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         4.7690     2.0949   2.277   0.0488 *
## grape.cost.per.kg   1.6702     0.5659   2.951   0.0162 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5893 on 9 degrees of freedom
## Multiple R-squared:  0.4918, Adjusted R-squared:  0.4353 
## F-statistic: 8.709 on 1 and 9 DF,  p-value: 0.01619

We can cut through all these results with a Chow test (F-test) to validate or not the ‘no structural break’ hypothesis where the first segment’s coefficients should be equal to the second segment’s coefficient.

## [1] 0.004132164

The p-value is below 1%. The Chow test rejects the ‘no structural break’ hypothesis.

Conclusion

We have two subsets rather than one dataset.

Visualize the two regimes: the restricted model vs. the two unrestricted models.

We can see that the ‘pre’ intercept is higher. Under market planning, prices are higher and the market is more stable. For any cost price, we get a higher price, which translates into more predictable revenues for producers.

Under market liberalization, prices are lower, but the variance is higher, as the observations are more widely dispersed, accounting for more volatility.

Pointer

There are several types of structural breaks. Dissimilar regressions (the case above) with two intercepts and slopes. Concurrent regressions have a common intercept, but different slopes. Parallel regressions have a common slope, but different intercepts.

Volatility

Volatility is measured with the coefficients’ standard error and the regression’s $\sigma$. Standard error and $\sigma$ are related to variance and standard deviation, two measures of volatility.

Pre.

## $coefficients
##                   Estimate Std. Error  t value   Pr(>|t|)
## (Intercept)       7.038981  1.2150274 5.793269 0.01023226
## grape.cost.per.kg 1.180297  0.2928131 4.030888 0.02744966

## $sigma
## [1] 0.3739763

Post.

## $coefficients
##                   Estimate Std. Error  t value   Pr(>|t|)
## (Intercept)       4.769050  2.0948718 2.276535 0.04883840
## grape.cost.per.kg 1.670211  0.5659492 2.951167 0.01619019

## $sigma
## [1] 0.5892843

champagne2_model_pre has a higher intercept (higher average price), a flatter slope (less sensitive to the cost of grapes), and lower $\sigma$ (less volatility).

We can also compare models with the Root Mean Squared Errors or RMSE. The RMSE is the standard deviation of the residuals (the spread of the points about the fitted regression line), a good measure of volatility.

\[\mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}{(y_i-\hat{y}_i)^2}}\]

Pre.

## [1] 1.098646

Post.

## [1] 1.361421

champagne2_model_pre has a lower RMSE (less volatility, better fit).

Extension

Finding the break

Most of the time, a visual exploration reveals the structural break. However, we might want to opt for analytical techniques to find the break. The strucchange package offers several functions to deal with structural breaks (fit, plot, test), including generalized fluctuation tests to detect breaks. For example, we can compute the CUSUM and MOSUM processes with the efp function (empirical fluctuation processes). Other functions emphasize on deviations from the norm (boundaries). We can implement Chow (F-) tests as well.

Double linear regressions and more

If the deviation falls back in line following a second structural break, we might have to decide whether we want to deal with three subsets (three regressions) or we keep a unique dataset (one regression).

Modeling the break

Just like an interaction effect, each subset might have their distinct intercept, distinct slope or both. We can capture this change with a dummy variable. We will cover dummy variable in another case.

Piecewise linear regressions

Alternatively, we can consider the piecewise time series (a general class of spline functions). The model is:

\[Y = \beta_0 + \beta_1X1 + \beta_2(X1-X^*)D + \epsilon\]

Where $X^*$ is the knot (the hinge or the joint).

There are different forms of piecewise linear regressions.

Other than the model above, there are multiple ways to handle piecewise regressions. It all depends on what we want to achieve. We can then consider polynomial forms, use spline regressions, work with time-series functions (moving average for example), add dummy variables to capture the interaction effects, slope changes and intercept jumps, etc. We can also use the segmented package for segmented regressions.

4, Correlation; Model, and Variable Selection

Marcellus and real estate

The R case

Goal

Import the data: a sample of 11 buildings.

Print the head of the dataset.

##   real_estate_value  area offices entrances age
## 1            213000 25156       2       2.0  20
## 2            210000 25406       2       2.0  12
## 3            226500 25657       3       1.5  33

real_estate_value (in $US) is determined by the area in square feet, the number of offices and entrances, and the building’s age in years.

Discover the undervalued buildings. It is an exploratory analysis to reduce the number of targets and proceed to deeper analyses (say we have a larger set of 100 buildings and we want to focus on the 10 best).

Correlation

Check the explanatory variable independence with a Pearson correlation matrix,…

##                   real_estate_value      area    offices  entrances
## real_estate_value         1.0000000 0.3593228  0.8832723  0.5162754
## area                      0.3593228 1.0000000  0.2234715  0.6204393
## offices                   0.8832723 0.2234715  1.0000000  0.3107144
## entrances                 0.5162754 0.6204393  0.3107144  1.0000000
## age                      -0.4089111 0.2214926 -0.0522657 -0.0505677
##                          age
## real_estate_value -0.4089111
## area               0.2214926
## offices           -0.0522657
## entrances         -0.0505677
## age                1.0000000

…with a Spearman correlation matrix,…

##                   real_estate_value      area   offices   entrances
## real_estate_value        1.00000000 0.3272727 0.9045340 0.294510775
## area                     0.32727273 1.0000000 0.1909572 0.569709369
## offices                  0.90453403 0.1909572 1.0000000 0.178809772
## entrances                0.29451078 0.5697094 0.1788098 1.000000000
## age                     -0.08200477 0.3325749 0.1989476 0.004839056
##                            age
## real_estate_value -0.082004769
## area               0.332574895
## offices            0.198947637
## entrances          0.004839056
## age                1.000000000

…and with a Kendall correlation matrix.

##                   real_estate_value      area   offices  entrances
## real_estate_value        1.00000000 0.2000000 0.8090398 0.23164312
## area                     0.20000000 1.0000000 0.1797866 0.44222778
## offices                  0.80903983 0.1797866 1.0000000 0.15617376
## entrances                0.23164312 0.4422278 0.1561738 1.00000000
## age                     -0.03669879 0.2201928 0.1587632 0.04250511
##                           age
## real_estate_value -0.03669879
## area               0.22019275
## offices            0.15876322
## entrances          0.04250511
## age                1.00000000

With the Pearson matrix, the independent variables are not correlated except for area and entrance: over 0.60.
With the Spearman matrix, area and entrance: rounded to 0.57.
With the Kendall matrix, area and entrance: rounded to 0.44.

Test the ‘no correlation’ hypothesis (t-test) among the variables (based on the Pearson correlation).

##                   real_estate_value area offices entrances   age
## real_estate_value              1.00 0.36    0.88      0.52 -0.41
## area                           0.36 1.00    0.22      0.62  0.22
## offices                        0.88 0.22    1.00      0.31 -0.05
## entrances                      0.52 0.62    0.31      1.00 -0.05
## age                           -0.41 0.22   -0.05     -0.05  1.00
## 
## n= 11 
## 
## 
## P
##                   real_estate_value area   offices entrances age   
## real_estate_value                   0.2778 0.0003  0.1040    0.2118
## area              0.2778                   0.5089  0.0417    0.5128
## offices           0.0003            0.5089         0.3524    0.8787
## entrances         0.1040            0.0417 0.3524            0.8826
## age               0.2118            0.5128 0.8787  0.8826

The top matrix confirms the correlation between area and entrance (0.62). The bottom matrix displays the p-values. area and entrance are below the 5% significance level. We reject the ‘no correlation’ hypothesis.

More specifically, test the ‘no correlation’ hypothesis between area and entrance.

## 
##  Pearson's product-moment correlation
## 
## data:  building$area and building$entrance
## t = 2.3734, df = 9, p-value = 0.04168
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03275536 0.88932133
## sample estimates:
##       cor 
## 0.6204393

We reject the hypothesis and we confirm the previous results.

Visualize the pairwise correlations with a scatter plot matrix (or splom).

Try another splom.

Due to a limited sample (15 observations), these sploms do not display any distinctive patterns.

Plot a ‘carrologram’, which is a more intuitive representation of pairwise correlations.

The darker blue shade for positive correlation and the 2/3 pie at the crossing of area and entrance reveals the positive correlation close to 0.60.

Multicollinearity

The presence of correlation is a sign of potential multicollinearity.

Compute three models:

All explanatory variables.
All but area.
All but entrance.

Compare the results.

$R^2$.

## $r.squared
## [1] 0.9903542

## $r.squared
## [1] 0.9704106

## $r.squared
## [1] 0.9801031

Adjusted-$R^2$.

## $adj.r.squared
## [1] 0.9839236

## $adj.r.squared
## [1] 0.9577295

## $adj.r.squared
## [1] 0.9715758

F-statistic.

## $fstatistic
##    value    numdf    dendf 
## 154.0076   4.0000   6.0000

## $fstatistic
##    value    numdf    dendf 
## 76.52386  3.00000  7.00000

## $fstatistic
##    value    numdf    dendf 
## 114.9377   3.0000   7.0000

Coefficients, t-values and p-values.

## $coefficients
##                 Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 55266.300303 32140.372917  1.719529 1.363131e-01
## area            4.612014     1.309435  3.522140 1.248610e-02
## offices     19318.313733  1050.854689 18.383430 1.669850e-06
## entrances    3519.961492  1393.949896  2.525171 4.497194e-02
## age          -331.122208    34.852129 -9.500774 7.750398e-05

## $coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 167779.422 5744.52017 29.206864 1.420161e-08
## offices      19528.100 1701.25100 11.478671 8.561646e-06
## entrances     6573.309 1770.05110  3.713627 7.518847e-03
## age           -291.234   53.44688 -5.449036 9.569803e-04

## $coefficients
##                 Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)  8602.215670 34966.208964  0.2460151 8.127284e-01
## area            6.668365     1.363477  4.8907053 1.771817e-03
## offices     19865.617817  1367.265388 14.5294527 1.745583e-06
## age          -351.264281    45.112410 -7.7864225 1.082962e-04

Model 1: we have strong $R^2$ and adjusted-$R^2$, a strong F-statistic, and all t-values are significant although entrances is the weakest coefficient.
Model 2: without area, we can see an improvement in the significance of entrances while all the other parameters remain robust.
Model 3: without entrances, model 3 has better statistics than model 2.

Test collinearity for the three models.

##      area   offices entrances       age 
##     FALSE     FALSE     FALSE     FALSE

##   offices entrances       age 
##     FALSE     FALSE     FALSE

##    area offices     age 
##   FALSE   FALSE   FALSE

All tests are negative (FALSE). We accept the ‘no multicollinearity’ hypothesis.

Again, correlation or collinearity a priori is simply an indicator of potential collinearity a posteriori.

Multicollinearity can be a problem. Can? We will address the issue in another case where we encounter multicollinearity again…

Select a model - Methods

Given the results, Model 1 appears to be the best.

However, before jumping to a conclusion, let’s cover the methods for selecting the best model.

When building a model, we want to measure if adding or removing variables improves the model.

First, we need benchmarks. Here is a checklist:

Seek consistency and theory-backed models: make sure coefficients have the right signs.
Do not omit relevant variables: underfitting the model. We would wind up with biased coefficients, invalid residuals, and misleading tests.
Do not include irrelevant variable: overfitting the model. Do not make spurious connections between explanatory variables and the dependent variable. We would get a false reading. While everything looks good, larger variances render coefficients inconsistent.
Use the right functional form: some relationships are not linear.
Use quality data and right measurements: garbage in is garbage out.
Maximize the predictive power: measure it with metrics such as $R^2$, AIC, BIC, and Mallow’s $C_p$.

Above all, we want to pick the simplest and most explanatory model of all, following the concept of ‘parsimony’.

Pointer

Occam's razor, the 'law of parsimony', is a problem-solving principle attributed to William of Ockham (c. 1287-1347). The principle can be interpreted as stating among competing hypotheses, the one with the fewest assumptions should be selected.

In science, Occam's razor is used as a heuristic technique (discovery tool) to guide scientists in the development of theoretical models.

Second, we need a procedure to run and compare models. Depending on the approach, it can more or less automated. The following procedures focus on a few items from the checklist. Functional forms will be cover in other cases, but we will not address data quality.

Preprocessing data is a vast subject covering detection/ removal/replacement of measurement errors, missing values, outliers, extreme values, etc.

Select a model - ANOVA

We compare two models in terms of coefficients. We test if Model A’s coefficients are statistically different than Model B’s coefficients. If both models are not different, they belong to the same class. We boiled down the number of cases. We then pick the best model in each class and compare the winning of each class before declaring the overall winner.

The models:

All explanatory variables.
All but area.
All but entrance.

The comparisons:

1 vs. 2.

## Analysis of Variance Table
## 
## Model 1: real_estate_value ~ area + offices + entrances + age
## Model 2: real_estate_value ~ offices + entrances + age
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)  
## 1      6  38997951                              
## 2      7 119629281 -1 -80631329 12.405 0.01249 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1 vs. 3.

## Analysis of Variance Table
## 
## Model 1: real_estate_value ~ area + offices + entrances + age
## Model 2: real_estate_value ~ area + offices + age
##   Res.Df      RSS Df Sum of Sq      F  Pr(>F)  
## 1      6 38997951                              
## 2      7 80442941 -1 -41444990 6.3765 0.04497 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic is higher and more significant (lower p-value) in ‘1 vs. 2’ than in ‘1 vs. 3’. The F-test measure the statistical difference. The higher the F-statistic, the lower the p-value, the more significant the difference is.

Model 1 is more statistically different than Model 2 (lower p-value), but less statistically different than Model 3.

Model 1 and 3 are ‘closer’ to each other. The class winner would be Model 1 though (better statistics). In the final round, Model 1, which happens to have a higher predictive power, wins over Model 3.

The ANOVA method can be fastidious.

Select a model - Heuristic technique

Let’s automate things a bit.

We can employ the ‘backward’ method where we start from the most comprehensive model and remove variables one by one (we start removing one variable, starting with the less significant one, than a second variable, etc.) until we find an optimal model where we marginally improve the overall significance.

Another approach consists in introducing new variables in a similar manner and searching for the optimal model: the ‘forward’ method.

Nonetheless, the best procedure is the ‘stepwise’ method where we pick the first regressor, the most correlated with the regressand, calculate the residuals, then pick a second regressor, the most correlated with the residuals, and so on, until we marginally improve the overall significance.

Load the MASS package to implement the heuristic technique. The improvement criterion is to minimize the AIC coefficient (we will develop on AIC below).

The stepAIC function in package MASS offers a wider range of object classes as opposed to the minimal step, add1 and drop1 functions from the stats package.

Print the results.

Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
real_estate_value ~ area + offices + entrances + age

Final Model:
real_estate_value ~ area + offices + entrances + age


  Step Df Deviance Resid. Df Resid. Dev      AIC
1                          6   38997951 175.8924

For one thing, this approach is faster. Second, the results show that when removing a variable, the AIC increases (a deterioration). Therefore, the best decision is not to remove any variable to minimize the AIC.

We would be better off if we could work with a larger sample (not 10 observations, but at least 30).

Nevertheless, the ‘stepwise optimization’ is controversial… We use a heuristic approach, that is ‘optimization’, normally used on deterministic values while statistics is based on estimated values and probabilities.

Select a model - Algorithmic technique 1

We should favour the ANOVA approach or the like. With the leaps package, we can add a ‘loop’ to speed up the procedure. Run a bunch of regressions using iterations and keep the 10 best.

## Subset selection object
## Call: regsubsets.formula(real_estate_value ~ area + offices + entrances + 
##     age, data = building, nbest = 10)
## 4 Variables  (and intercept)
##           Forced in Forced out
## area          FALSE      FALSE
## offices       FALSE      FALSE
## entrances     FALSE      FALSE
## age           FALSE      FALSE
## 10 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          area offices entrances age
## 1  ( 1 ) " "  "*"     " "       " "
## 1  ( 2 ) " "  " "     "*"       " "
## 1  ( 3 ) " "  " "     " "       "*"
## 1  ( 4 ) "*"  " "     " "       " "
## 2  ( 1 ) " "  "*"     " "       "*"
## 2  ( 2 ) " "  "*"     "*"       " "
## 2  ( 3 ) "*"  "*"     " "       " "
## 2  ( 4 ) " "  " "     "*"       "*"
## 2  ( 5 ) "*"  " "     " "       "*"
## 2  ( 6 ) "*"  " "     "*"       " "
## 3  ( 1 ) "*"  "*"     " "       "*"
## 3  ( 2 ) " "  "*"     "*"       "*"
## 3  ( 3 ) "*"  "*"     "*"       " "
## 3  ( 4 ) "*"  " "     "*"       "*"
## 4  ( 1 ) "*"  "*"     "*"       "*"

The algorithm quickly produces 10 regressions; each regression can have up to 4 regressors. Each model is ranked according to one or many statistics. Rank the regressions according to: $R^2$, adjusted-$R^2$, BIC and AIC.

Let’s focus on two metrics: $R^2$ and adjusted-$R^2$. The most robust model (with the highest r2) has four regressors, confirming the above findings. The second-best model removed entrances; this is Model 3.

The other metrics are:

BIC: Bayesian (Schwarz’s) Information Criterion.
AIC (from the stepAIC function): Akaike’s Information Criterion.
Mallow’s $C_p$.

AIC and BIC are close relatives; they are suitable when comparing models fitted by maximum likelihood.

AIC and BIC are also good benchmarks for OLS. Since the Mallow’s $C_p$ is estimated using OLS, it is equivalent to the AIC (and the BIC) in the special case of Gaussian (normal) linear regression or a MLE using a normal distribution.

As opposed to $R^2$, the smaller these metrics, the better.

Pointer

The car package has the subsets function. It has similarities with the regsubsets function from the leaps package.

Select a model - Algorithmic technique 2

When we plot the Cook’s distance, we show the explanatory power of each observation in one model: the relative importance or leverage.

Focusing on Model 1, compute the relative importance, but for each regressor, with the relaimpo package. Choose four criteria to rank the regressors ('lmg', 'last', 'first', and 'pratt').

## Response variable: real_estate_value 
## Total response variance: 404298409 
## Analysis based on 11 observations 
## 
## 4 Regressors: 
## area offices entrances age 
## Proportion of variance explained by model: 99.04%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                  lmg       last      first      pratt
## area      0.06664414 0.02775291 0.09613542 0.06914137
## offices   0.64838551 0.75604618 0.58090227 0.69325371
## entrances 0.12150659 0.01426516 0.19846167 0.07095651
## age       0.16346376 0.20193575 0.12450063 0.16664841
## 
## Average coefficients for different model sizes: 
## 
##                     1X          2Xs          3Xs          4Xs
## area          8.696244     5.703305     4.404754     4.612014
## offices   21952.083333 20790.024850 19752.970125 19318.313733
## entrances 13351.127820 10701.658241  7309.851421  3519.961492
## age        -335.470519  -344.902991  -336.238494  -331.122208

No matter the criterion ‘type’, the relative importance ranking is: offices, age, entrances, and area. Again, area is the weakest link.

We could repeat this calculation on Model 2 and Model 3.

Types:

'lmg'; the $R^2$ contribution averaged over orderings among regressors.
'last'; each variables contribution when included last in the regression; sometimes called ‘usefulness’.
'first'; each variables contribution when included first, which is just the squared covariance between the regressand and the regressor.
'pratt'; the product of the standardized coefficient and the correlation.

Pointer

There are other types = c('pmvd', 'betasq', 'genizi', 'car').

We can also use a ‘bootstrap measures of relative importance’.

Warning! This method is picky and will only work if it detects a ‘positive definite covariance matrix’ or a data matrix/data frame with linearly independent columns.

Now, we have detected collinearity a priori between area and entrance. building_model2 does not include area. Therefore, for the example, we will run the bootstrap with Model 2 to satisfy the condition.

Print the results.

Response variable: real_estate_value 
Total response variance: 404298409 
Analysis based on 11 observations 

3 Regressors: 
offices entrances age 
Proportion of variance explained by model: 97.04%
Metrics are normalized to sum to 100% (rela=TRUE). 

Relative importance metrics: 

                lmg       last     first     pratt
offices   0.6865683 0.75187003 0.6426873 0.7151843
entrances 0.1649901 0.07869662 0.2195701 0.1352301
age       0.1484416 0.16943335 0.1377426 0.1495856

Average coefficients for different model sizes: 

                  1X        2Xs       3Xs
offices   21952.0833 20682.3290 19528.100
entrances 13351.1278  9885.6865  6573.309
age        -335.4705  -306.6348  -291.234

 
 Confidence interval information ( 100 bootstrap replicates, bty= perc ): 
Relative Contributions with confidence intervals: 
 
                               Lower  Upper
                percentage 0.95 0.95    0.95   
offices.lmg      0.6866    AB_   0.3939  0.9022
entrances.lmg    0.1650    ABC   0.0235  0.4352
age.lmg          0.1484    _BC   0.0313  0.4081
                                               
offices.last     0.7519    A__   0.4459  0.9478
entrances.last   0.0787    _BC   0.0084  0.2798
age.last         0.1694    _BC   0.0258  0.4587
                                               
offices.first    0.6427    AB_   0.3328  0.9388
entrances.first  0.2196    ABC   0.0012  0.4876
age.first        0.1377    ABC   0.0013  0.4730
                                               
offices.pratt    0.7152    A__   0.4288  1.0301
entrances.pratt  0.1352    _BC  -0.0930  0.3918
age.pratt        0.1496    _BC  -0.1976  0.3729

Letters indicate the ranks covered by bootstrap CIs. 
(Rank bootstrap confidence intervals always obtained by percentile method) 
CAUTION: Bootstrap confidence intervals can be somewhat liberal. 

 
 Differences between Relative Contributions: 
 
                                        Lower   Upper
                        difference 0.95 0.95    0.95   
offices-entrances.lmg    0.5216         -0.0132  0.8523
offices-age.lmg          0.5381     *    0.0476  0.8706
entrances-age.lmg        0.0165         -0.3079  0.3197
                                                       
offices-entrances.last   0.6732     *    0.2918  0.9312
offices-age.last         0.5824     *    0.0141  0.9169
entrances-age.last      -0.0907         -0.4158  0.2019
                                                       
offices-entrances.first  0.4231         -0.1047  0.9012
offices-age.first        0.5049         -0.0946  0.9152
entrances-age.first      0.0818         -0.3997  0.4313
                                                       
offices-entrances.pratt  0.5800     *    0.1542  1.0946
offices-age.pratt        0.5656     *    0.0712  1.1860
entrances-age.pratt     -0.0144         -0.3206  0.5696

* indicates that CI for difference does not include 0. 
CAUTION: Bootstrap confidence intervals can be somewhat liberal.

All these numbers! Let’s make it simple and plot the results.

All the results from the calc.relimp function are All We now have a distribution. For each method, the bar charts show the leverage per regressor (the % contribution to $R^2$). In addition, we add a 95% confidence intervals (upper and lower limits) as shown with the fine lines.

Conclusion

Use ‘Model 1’ for valuation. Plot the results.

The model averages results. Accordingly, when the actual value is above the predicted value (undervalued by the model), the observation is likely to be a good candidate for further analyses. Buildings 1, 3, 4, 8, and 9 deserve a closer attention. Why do they appear undervalued? Are they really undervalued?

We can also perform sensitivity analyses with the coefficients.

##  (Intercept)         area      offices    entrances          age 
## 55266.300303     4.612014 19318.313733  3519.961492  -331.122208

Run the analyses and explore the scenarios.

Base case (average real estate valuation).

## [1] 223759.1

What is the impact on the base case? Increasing area of 100 sq ft?

## [1] 461.2014

Compare the value between two building: age = 2 vs. age = 12.

## [1] 235649.4

## [1] 232338.2

Compare two buildings: age = 2 vs. age = 12 and increase area of 100 sq ft.

## [1] 235649.4

## [1] 232799.4

Compare: age = 10 vs. age = 5 and decrease area of 50 sq ft.

## [1] 233000.4

## [1] 234425.4

Compare real estate value between now and after 1 year?

## [1] 223759.1

## [1] 223428

We should not rely on results beyond 1 year. As a building ages, its value decreases (asset depreciation), but real estate valuation relies on many other factors (some with positive impacts) such as good location, inflation, and the neighbourhood economic development. Keep in mind it is a snapshot of the situation.

Extension

Global analyses vs. partial analyses

One of the traps of multivariate regressions is to pick a wrong model when the main indicators indicate goodness of fit or no violations to the assumptions. We must stress the importance of graphical analysis as pointed out by Anscombe (Anscombe’s Quartet).

With multivariate regressions, we would also want to visualize partial relationships between the regressors and the regressand, and explore partial residuals:

Component-residuals plots or partial residual plots.
Added-variable plots.
Partial residuals plots.
Leverage plots.

Marcellus and the wine business - part 1

Let’s step back and review our first case: champagne_model. It is a simpler multivariate regression to begin with.

Component-residuals plots:

A partial residual plot attempts to model the residuals of one regressor against the regressand. A component residual plot adds a line indicating where the line of best fit lies. A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable.

If hectares lines are similar, workforce lines are a bit off.

Added-variable plots:

What happens to a regression coefficient when the other variables are held constant (ceteris paribus)? For each variable, we compute an OLS fit; with intercept zero and slope $\beta_i$. The influences of individual data values on the estimation of a coefficient are easy to see.

hectares has a positive influence, while workforce has a negative one. As a reminder, there exists a higher collinearity between sales_m and hectares, than sales_m and workforce

Partial residuals plots:

##            Test stat Pr(>|t|)
## workforce     -2.124    0.078
## hectares      -2.306    0.061
## Tukey test    -2.250    0.024

This is the contribution on each regressor to the model variability (fitted values vs. residuals). hectares introduces less variability in the model than workforce.

In addition, we get a Tukey t-test. The test compares all possible pairs of means and poses a ‘not different’ hypothesis. We can reject the equality hypothesis at the 5% level; the difference between the means is statistically significant.

Leverage plots:

This is the explanatory power of each regressor to the model. High-leverage points have the potential to cause larger changes in the regressand estimation when they are deleted (measured on the y-axis). hectares vs. sales_m demonstrates much more sensitivity than workforce vs. sales_m.

Marcellus and real estate

In the light of the previous explanations, run the same plots with building_model1.

##            Test stat Pr(>|t|)
## area          -0.302    0.775
## offices       -0.907    0.406
## entrances     -0.736    0.495
## age           -0.646    0.547
## Tukey test    -0.244    0.807

Keep in mind that we have a rather small sample (n < 30) to draw conclusions.

5, Multicollinearity (again), Hypothesis Testing

Marcellus and the holding - part 1

The R case

Goal

Import the data: capitalization (in millions of Euros) of French corporations and their respective Free Cash Flow (FCF in millions of Euros) and Net Debt/Equity ratio (D.E or financial leverage).

Print the head of the dataset.

##    companies capitalization  FCF  D.E
## 1 Air France           1580   34 0.67
## 2    Alcatel           3024   30 1.03
## 3     Alstom            946 -986 0.69

Find the impact of heavy indebtedness on a corporation’s capacity (reflected by the market capitalization).

Model

Run the regression; Free Cash Flow and the D/E ratio are the regressors.

Routine tests

Test for multicollinearity.

##   FCF   D.E 
## FALSE FALSE

We quickly rule out one problem.

Test the coefficients. With summary.

## 
## Call:
## lm(formula = capitalization ~ FCF + D.E, data = companies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2473.2 -1556.9   223.2  1069.0  3128.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  7225.432   2612.950   2.765  0.02788 * 
## FCF             5.810      1.112   5.226  0.00122 **
## D.E         -5332.840   2675.653  -1.993  0.08650 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2201 on 7 degrees of freedom
## Multiple R-squared:  0.8043, Adjusted R-squared:  0.7483 
## F-statistic: 14.38 on 2 and 7 DF,  p-value: 0.003318

With anova.

## Analysis of Variance Table
## 
## Response: capitalization
##           Df    Sum Sq   Mean Sq F value   Pr(>F)   
## FCF        1 120070030 120070030 24.7893 0.001603 **
## D.E        1  19240970  19240970  3.9724 0.086495 . 
## Residuals  7  33905304   4843615                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All p-values are below the 5% level except for the D.E coefficient. $R^2$ and adjusted-$R^2$ are strong.

Normality, linearity, etc.

Plot the residuals.

Left: residuals are distributed on both sides of y = 0 and standardized residuals are grouped together (‘3’, ‘9’ and ‘10’ appear to be outliers). We have a small sample. It is difficult to conclude to a normal distribution. We might have signs of heteroscedasticity and autocorrelation (abnormal variance of the residuals).
Top right: residuals are somehow well distributed on both sides of the Q-Q line even though some values stray away (‘3’, ‘9’ and ‘10’).
Bottom right: ‘3’ appears to be the most influential outlier.

Plot the Cook’s distance.

Q-Q plots are good all-round visualizations. Plot more Q-Q charts…

…about the model studentized residuals…

…about the residuals…

…about the predicted (‘fitted’) values.

Plot a distribution of the residuals (using a histogram and a density plot).

Extends the summary statistics.

## 
## Call:
## lm(formula = capitalization ~ FCF + D.E, data = companies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2473.2 -1556.9   223.2  1069.0  3128.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  7225.432   2612.950   2.765  0.02788 * 
## FCF             5.810      1.112   5.226  0.00122 **
## D.E         -5332.840   2675.653  -1.993  0.08650 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2201 on 7 degrees of freedom
## Multiple R-squared:  0.8043, Adjusted R-squared:  0.7483 
## F-statistic: 14.38 on 2 and 7 DF,  p-value: 0.003318
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = companies_model) 
## 
##                      Value  p-value                   Decision
## Global Stat        8.84147 0.065187    Assumptions acceptable.
## Skewness           0.04412 0.833632    Assumptions acceptable.
## Kurtosis           0.67686 0.410669    Assumptions acceptable.
## Link Function      8.08864 0.004454 Assumptions NOT satisfied!
## Heteroscedasticity 0.03184 0.858371    Assumptions acceptable.

We cannot conclude to nonnormality, nonlinearity because of the mitigated results (both visual and statistic).

A larger sample size would certainly improve the results.

We can rule out heteroscedasticity (changing variance) though.

Conclusion

Plot the results.

Market Capitalization is a proxy variable for measuring value creation. According to the model, some corporations, taking into account they indebtedness, create more value with their Free Cash Flow (investments).

We would need additional analyses on potential good investments; where actuals (blue) are above the predicted value (red).

Extension

A word on multicollinearity

Multicollinearity is not a major flaw. Of course, it inflates variance, making the regressors untrustworthy.

We must keep in mind that sometimes a cure can be worse than the disease. Reducing collinearity may not justify the time and resources required to do so.

Generally, for the purposes of forecasting, we can disregard collinearity. If, however, the validity of coefficients is crucial (as with price elasticity for example), we must try to reduce collinearity.

Pointer

Forecasting means predicting or extrapolating. We can extrapolate forward or backward. 

We can also rebuild a truncated dataset by interpolating the results. It is often the case when we re-engineer missing values, replace outliers or recompute measurement errors.

Forecasting

What would be the market capitalization of a corporation with 183 millions in FCF and a D/E ratio of 1? Add it to the plot.

Global analyses vs. partial analyses (again)

Plot partial relationships between regressors and the regressand and partial residuals:

Component-residuals plots or partial residual plots.
Added-variable plots.
Partial residuals plots.
Leverage plots.

##            Test stat Pr(>|t|)
## FCF            2.270    0.064
## D.E            0.902    0.402
## Tukey test     5.039    0.000

In light of these plots, we confirm that D.E brings less variability to the model, but also less value (less impact).

We introduce a new plot for partial residuals: the CERES plots for nonlinearity. Run the commands.

Ceres plots are a generalization of component+residual (partial residual) plots. They are less prone to leakage of nonlinearity among the predictors.

As a reminder, a c+r plot adds a line indicating where the line of best fit lies. A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable.

C+r plot can fail if there are nonlinear relationships among the regressors. The CERES plot shows a curve in the relationship of the regressand to a regressor, in spite of any nonlinear relationships among the regressors. If these relationships are linear, a c+r plot is obtained. However, if the relationships are nonlinear, the more general CERES plot is obtained.

6, Qualitative Variables as Explanatory Variables

Marcellus and the holding - part 2

The R case

What is a qualitative regressor

We can use continuous and categorical/discrete regressors and regressands. In computer science, a continuous number would be a ‘float’ and a discrete number would be an ‘integer’. Sometimes, discrete numbers can have a limited scope. We call these variables ‘qualitative variables’:

Multinomial (A to E),
Ordinal (1st to 3rd),
Binomial (1 or 0).

So far, most of the models involved continuous regressands (sales, price, market capitalization) and a mix of continuous (costs, area, FCF) and discrete (workforce, number of offices and entrances, age) regressors.

We now involve a qualitative regressor: a binomial regressor, more commonly known as a dummy variable.

Goal

We have a dataset of large French corporations: their market capitalization (in millions of Euros), their operational_profits (in millions of Euros) and whether the government, the state, is the majority stockholder (the dummy variable equals 1).

Print the head of the dataset.

##     companies capitalization operational_profit state
## 1 Air Liquide      163122205            1661800     0
## 2      Danone      201318713            1874000     0
## 3   Carrefour      263273826            3274000     0

Find out if state-owned corporations are better capitalized.

Regression

Run a simple linear regression and print the results.

## 
## Call:
## lm(formula = capitalization ~ operational_profit, data = companies2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -75256232 -27827402  16846591  20573067  60005399 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.291e+08  2.254e+07  10.163 5.28e-05 ***
## operational_profit 5.598e+00  2.321e+00   2.413   0.0524 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46340000 on 6 degrees of freedom
## Multiple R-squared:  0.4924, Adjusted R-squared:  0.4078 
## F-statistic:  5.82 on 1 and 6 DF,  p-value: 0.0524

Add a dummy (state) to the model, run another regression, and print the results.

## 
## Call:
## lm(formula = capitalization ~ operational_profit + state, data = companies2)
## 
## Residuals:
##         1         2         3         4         5         6         7 
## -54379604 -17372929  36732201  -9602252  38693097  25280889  -3672765 
##         8 
## -15678637 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.082e+08  2.104e+07   9.893  0.00018 ***
## operational_profit 5.607e+00  1.887e+00   2.972  0.03109 *  
## state1             5.555e+07  2.751e+07   2.019  0.09946 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37670000 on 5 degrees of freedom
## Multiple R-squared:  0.7204, Adjusted R-squared:  0.6086 
## F-statistic: 6.442 on 2 and 5 DF,  p-value: 0.04134

Adding state improves the model. The $R^2$ and adjusted-$R^2$ are better.

Coefficients are good: profit are positively related to capitalization. The state coefficient indicates an upswing in market capitalization when the State is the majority shareholder. We have significant p-values at the 5% level, except for the state variable, which is significant at the 10% level (weak).

Plot the results.

Conclusion

Show the data distributions.

State-owned corporations are more profitable and better capitalized. These corporations are public - have floating stocks on a public exchange - but the majority shareholder is the government.

Nevertheless, buying stocks and building a portfolio depends on many other criteria: dividend payouts, levels of risk, sectors of activity, correlations with other industries, etc.

Extension

We barely scratch the surface of dummy variables.

Here is a list of other possibilities with dummies variables (D) with examples:

ANOVA models where a regressand is a function of D. For example, we want to test salary discrimination between department A and department B.
ANVOCA models where we have a continuous/discrete regressor and a D. For example, we want to test salary discrimination between department A and department B by taking into account years of experience (X). If we add another D, we could also take into account the gender.
Structural models where a D tags a situation. Remember the wine case (champagne2). We could have tagged the market regimes (0 = pre, 1 = post).
We can also tag a seasonal pattern (for example, ‘Back to school’, Christmas). D are abundantly used with times series and panel data.
Comparison models. For example, we have two cross sections or two time periods. We want to fit the right model to the two datasets. There can only be one significant model. We have a base case.

And we have four alternatives.

‘Model 5.1’ has only normal regressors. We would obtain two coincident regressions from each dataset (similar intercepts and slopes). ‘Model 5.2’ adds one D. We would have two parallel regressions with distinct intercepts. ‘Model 5.3’ has an interaction effect X*D. We would have concurrent regressions with two distinct slopes. ‘Model 5.4’ has both X*D and D. We would have dissimilar intercepts, slopes, and regressions. 6. Two D is an additive effect. 7. X*D is an interaction or multiplicative effect. 8. D*D is also an interaction effect where the second D amplifies the first D. There are interactions or multiplicative effect tests. 9. In log-lin models, you interpret the coefficient with $e^{\beta_D} - 1$. 11. D can appear in other functional forms (polynomial for example). 12. D can replace the intercept. For example, say we have a categorical variable with 3 levels (m = 3). We would need (m - 1) D. We compare the results with the reference category and keep a common intercept. We run the regression through the origin; $R^2$ becomes meaningless.

Finally, with D, be sure to test for heteroscedasticity.

7, Measuring/Forecasting with Dynamic Models: Distributed-Lag Models

Marcellus and the holding - part 3

The R case

Dynamic models

The next cases address patterns within a time period. Nevertheless, time series and forecasting is a vast field with its own techniques such as finding local trends, capturing seasonality, smoothing time periods with moving averages, etc. Please refer to packages such as xts, zoo, and ts objects’ functions. We want to focus on two things: distributed-lag models and the autoregressive process.

Goal

First, test the persistence of publicity expenses (in thousands of Euros) over sales (in thousands of Euros). In other words, measure the payoff of publicity expenditures made by an SME. Second, measure the persistence effect on sales. If there is one, how long does it last?

Dynamic models and lags

Dynamic models work best with time-series objects along with the zoo and xts packages.

Since the time period we are dealing with is rather short, we choose to work with a short cross-sectional dataset and generate leading/lagging variables (expenses) with a custom function. year is simply a label for subsetting the data.

Pointer

With ts package (and objects), use the diff and lag functions.

Visual exploration

Plot the data.

Plot the sales. This time, use a ‘local polynomial regression fitting’ (loess) to draw the line.

Now, use a ‘locally weighted scatter plot smoothing’ (lowess) to draw the line.

Pointer

Do no be confused between loess and lowess.

With ts objects (from the ts package), we can apply autoregressive and moving-average smoothing with the filter function; add add lead/lads with the lag and diff functions.

Plot the relationships between the sales and expenses (a proxy for the regression).

Regular regression

Run the regression and print the results (Model 1).

## 
## Call:
## lm(formula = sales ~ expenses, data = publicity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9882  -9.4222  -0.6722   6.9764  17.2193 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.356     12.126   0.854    0.426    
## expenses      19.802      1.139  17.379 2.33e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.73 on 6 degrees of freedom
## Multiple R-squared:  0.9805, Adjusted R-squared:  0.9773 
## F-statistic:   302 on 1 and 6 DF,  p-value: 2.327e-06

The $R^2$ and p-values all look good. Collinearity is not a problem in the current situation.

Plot the results with a 95% confidence intervals.

Distributed-lag models

Run a 1-period lag (Model 2) and a 2-period lag (Model 3) regressions and print the results.

##   month sales expenses expenses_lag1 expenses_lag2
## 1     1   106        5             4             3
## 2     2   137        7             5             4
## 3     3   160        7             7             5
## 4     4   180        9             7             7
## 5     5   214       10             9             7
## 6     6   250       12            10             9

## 
## Call:
## lm(formula = sales ~ expenses + expenses_lag1, data = publicity)
## 
## Residuals:
##       1       2       3       4       5       6       7       8 
##  1.7279  0.3423 -1.3392 -1.3841 -2.0880  1.5264  1.8225 -0.6080 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.7970     1.9507   2.459   0.0573 .  
## expenses       10.0224     0.6627  15.124 2.29e-05 ***
## expenses_lag1  12.3407     0.8048  15.334 2.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.854 on 5 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9994 
## F-statistic:  6161 on 2 and 5 DF,  p-value: 3.313e-09

## 
## Call:
## lm(formula = sales ~ expenses + expenses_lag1 + expenses_lag2, 
##     data = publicity)
## 
## Residuals:
##       1       2       3       4       5       6       7       8 
##  1.7396  0.4185 -1.3168 -1.5374 -2.0269  1.4739  1.8063 -0.5573 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.9505     2.7054   1.830 0.141255    
## expenses        9.9322     1.1985   8.287 0.001157 ** 
## expenses_lag1  12.2786     1.1085  11.077 0.000378 ***
## expenses_lag2   0.1781     1.8606   0.096 0.928363    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.071 on 4 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9993 
## F-statistic:  3294 on 3 and 4 DF,  p-value: 3.07e-07

After analyzing the $R^2$ and p-values, the best model appears to be Model 2.

Use the visreg package to plot partial and complete regression results with confidence intervals.

Change the parameters and choose what to plot.

Conclusion

First, there appears to be a persistence effect. The expenses_lag1 coefficient is positive and higher that the expenses coefficient. ‘Positive’ means we have a persistence effect and ‘higher’ means the effect is increasing over time. Publicity is about making noise. Expenditures from the previous period have more effect on the current period than expenditure from the current period.

Second, the persistence effect only lasts one period. The expenses_lag2 coefficient is insignificant (both statistically and numerically). There is a sudden fall in the effect of publicity beyond 1 period.

Print the results for Model 2.

## (Intercept) 
##    4.796979

## expenses 
## 10.02243

## expenses_lag1 
##      12.34074

##   month sales expenses expenses_lag1 expenses_lag2   beta_0   beta_1
## 1     1   106        5             4             3 4.796979 10.02243
## 2     2   137        7             5             4 4.796979 10.02243
## 3     3   160        7             7             5 4.796979 10.02243
## 4     4   180        9             7             7 4.796979 10.02243
## 5     5   214       10             9             7 4.796979 10.02243
## 6     6   250       12            10             9 4.796979 10.02243
## 7     7   285       13            12            10 4.796979 10.02243
## 8     8   335       17            13            12 4.796979 10.02243
##       lag_0   beta_2     lag_1      sum
## 1  50.11217 12.34074  49.36295 104.2721
## 2  70.15704 12.34074  61.70369 136.6577
## 3  70.15704 12.34074  86.38516 161.3392
## 4  90.20191 12.34074  86.38516 181.3841
## 5 100.22434 12.34074 111.06664 216.0880
## 6 120.26921 12.34074 123.40737 248.4736
## 7 130.29165 12.34074 148.08885 283.1775
## 8 170.38139 12.34074 160.42959 335.6080

Plot these results.

Decompose the predicted sales knowing $t$ is the current period and $t-1$ is the previous period:

actual: actual sales for period $t$.
intercept: $\beta_0$, the predicted sales baseline.
lag_0: $\beta_1 \times expenses_t$, the current period effect on predicted sales.
lag_1: $\beta_2 \times expenses_{t-1}$, the previous period effect on predicted sales.

There are many reasons behind persistence:

Psychological reasons (habit, inertia).
Technological reasons (transition, substitution, gestation period).
Institutional reasons (contractual obligation, locked-in effect or vested period).

More proofs

We can counter-validate the choice of Model 2 with the Akaike and Bayesian (Schwarz’s) criteria. We encountered the Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) before when selecting the best real estate valuation model.

\[AIC = 2k - 2ln(\hat{L})\]

\[BIC = ln(n)k - 2ln(\hat{L})\]

Where $\hat{L}$ is the maximized value of the likelihood function, $k$ is the number of regressors, and $n$ is the number of observations.

We also add the Mallow’s $C_p$ criterion.

\[C_p = \frac{SSE_p}{S^2} - n + 2p\]

Where $SSE$ is the sum of squared with $p$ regressors, $S^2$ is the residual mean square, $n$ is the number of observations.

The smaller the criterion, the better. Compute the AIC and BIC:

Model 1: no lag.
Model 2: one lag.
Model 3: two lags.

## [1] 65.79748

## [1] 66.0358

## [1] 36.82331

## [1] 37.14107

## [1] 38.80501

## [1] 39.20222

Compute $C_p$.

##         lik        infl        vari           n          cp 
## -176.609914    6.431493    5.452037    8.000000  358.082813

##       lik      infl      vari         n        cp 
## -7.444625  7.922790  6.807849  8.000000 22.734830

##        lik       infl       vari          n         cp 
## -49.100703   7.999999   7.999999   8.000000 106.201405

Model 2 has the lowest AIC, BIC and $C_p$, confirming our previous choice.

In all, use all the statistical measures, i.e., $R^2$, AIC, BIC and Mallow’s $C_p$, for model selection.

AIC is the best all-round statistic. Many R functions are based on AIC: when evaluating generalized linear models (there will be a logit/probit case below), Weighted Least Squares (there will be a WLS case below), and we saw the use of stepAIC. extractAIC offers more parameters for computing generalized AIC.

Extension

There are other approaches for estimating distributed-lag models:

Assuming that $\beta$ are all of the same sign, the Koyck approach assumes that they decline geometrically.
The Almon approach or polynomial distributed-lag.
We can run nonlinear distributed-lag models with the dlnm package.

8, Measuring/Forecasting with Dynamic Models: Autoregressive Models

Marcellus and the holding - part 4

The R case

Goal 1

Compute the average return over time. A stock or index may experience short-term periods of positive returns and downturns. Over the long run, what is the real return?

Print the head of the dataset.

##   year   index
## 1 1987 1000.00
## 2 1988 1573.94
## 3 1989 2001.08
## 4 1990 1517.93
## 5 1991 1756.66
## 6 1992 1857.78

Compute returns

Time-series objects and specialized packages such as zoo and xts are ideal for modeling leading, lagging variables and computing differences and percentage differences such as returns. Nonetheless, we can work with cross-sectional data and custom function.

Compute the return for each period, print the head of the dataset, and plot the results (a line plot and a chart with high-density vertical lines).

##   year   index   index_ret
## 1 1987 1000.00          NA
## 2 1988 1573.94  0.57394000
## 3 1989 2001.08  0.27138264
## 4 1990 1517.93 -0.24144462
## 5 1991 1756.66  0.15727339
## 6 1992 1857.78  0.05756379

Run a regression on the returns (only the intercept) and plot the results.

## 
## Call:
## lm(formula = index_ret[2:n] ~ 1, data = index)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51681 -0.09537  0.06726  0.14704  0.48392 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.09002    0.05695   1.581     0.13
## 
## Residual standard error: 0.261 on 20 degrees of freedom

Conclusion

The red line (the intercept) represents the positive average returns over the period of time. We are in the money.

Goal 2

Find out if the returns have an ARCH(1) process.

Autoregressive Conditional Heteroscedasticity (ARCH)

Homoscedasticity means variance of the residuals is constant over time. Heteroscedasticity is a common feature of cross-sectional data.

On the other hand, when the residuals are correlated over time, we have serial correlation or autocorrelation, a common feature of time series.

When we have some kind of autocorrelation in the variance of errors, we have an Autoregressive Conditional Heteroscedasticity process. The ARCH(1) process:

\[e_i^2 = \beta_0 + \beta_1 e_{i-1}^2\]

Extract the residuals, transform them, and test for an ARCH(1) process.

## 
## Call:
## lm(formula = index_model1_residuals_sq[2:l] ~ index_model1_residuals_sq_lag1[2:l])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06326 -0.04587 -0.03136  0.01286  0.20369 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          0.06425    0.02176   2.952  0.00853
## index_model1_residuals_sq_lag1[2:l] -0.14326    0.24925  -0.575  0.57257
##                                       
## (Intercept)                         **
## index_model1_residuals_sq_lag1[2:l]   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07582 on 18 degrees of freedom
## Multiple R-squared:  0.01802,    Adjusted R-squared:  -0.03653 
## F-statistic: 0.3303 on 1 and 18 DF,  p-value: 0.5726

Only the intercept is significant at the 5% level. A significant p-value for the t-values and the F-statistic would mean autocorrelation.

Conclusion

There is no significant evidence of an ARCH(1) process.

If it had been the case, we would have to replace the OLS model with a Generalized Least Squared model (GLS or GLM).

Volatility

Patterns in volatility are often observed in times of crises. By examining daily, even hourly returns, we can find these patterns.

A variant of the ARCH(1) process is the GARCH(1,1) process:

\[\sigma(\epsilon)_t^2 = \beta_0 + \beta_1 \epsilon_{t-1}^2 + \beta_2 \sigma(\epsilon)_{t-1}^2\]

If the coefficients were significant, the sum of $\epsilon_{t-1}^2$ and $\sigma(\epsilon)_{t-1}^2$ would measure persistence in the variance, i.e., the durability of volatility periods.

Pointer

- ARCH(p): the variance of residuals depends on the squared residuals from previous periods (1 period or more).
- GARCH(n,m): the variance of residuals depends on both the squared residuals and residuals from (the) previous period(s).
- Autoregressive Integrated Moving Average models: AR(p), MA(q), ARMA(p,q), ARIMA(p,d,q).
- Vector Autoregression (VAR) models: geometric lag, polynomial lag, etc.
- Please refer to the ts package and objects.

Extension

There are other autoregressive models:

The Koyck approach.
The adaptive expectation.
The partial adjustment models.

Dynamic models can also be estimated with instrumental variables.

Nevertheless, dynamic and time-series econometrics is a vast field of its own we cannot cover here.

9, Specification: the Polynomial Model to Measure/Forecast Profitability

Marcellus and the holding - part 5

The R case

Goal

When financial statements are not public, we have to resort to creative ways to gather market intelligence about companies. We would like to rank the profitability of a dozen of companies in one industry. Some companies are private and do not release their financial statements. The remaining companies are public. We have access to their financial statements. We also have a report about market shares in the industry. We might be able to establish a connection between market shares and profitability of the public companies. We could then apply the model to estimate the profitability of the private companies. First, find out the link between market shares and profitability in this industry.

The dataset is about six public manufacturers: market_share vs.profitability (profits as a percentage of sales).

Plot the data.

The linear form

Run the regression, print and plot the results.

## 
## Call:
## lm(formula = profitability ~ market_share, data = profitability)
## 
## Residuals:
##       1       2       3       4       5       6 
##  11.101   3.940  -7.517 -13.864  -2.508   8.848 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    8.7377     7.3199   1.194   0.2985  
## market_share   0.4322     0.1765   2.449   0.0705 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.86 on 4 degrees of freedom
## Multiple R-squared:  0.5999, Adjusted R-squared:  0.4999 
## F-statistic: 5.998 on 1 and 4 DF,  p-value: 0.07051

We get poor $R^2$ and p-values.

Functional forms

There are situations where the connection between the regressand and the regressors is nonlinear:

Parabolic/quadratic: $f(x) = x + x^2$, $f(x) = x + x^{0.5}$.
Cubic and higher polynomial forms.
Exponential: $f(x) = e^{x}$.
Logarithmic: $f(x) = ln(x)$.
Power: $f(x) = x^{0.8}$, $f(x) = x^{1.2}$.
Reciprocal: $f(x) = \frac{1}{x}$.
etc.

For example, we saw above that publicity expenses can have a time-based persistent effect. The effect can also shift with the amount spent. Increasing expenses increases sales to a point where the marginal gains from additional expenses start declining (decreasing returns).

\[sales = \beta_0 publicity^{\beta_1}\]

Where $\beta_0 > 1$ and $0 < \beta_1 < 1$.

The polynomial form

In some industries, smaller manufacturers targeting niche markets or larger manufacturers dominating the market are more profitable than mid-size manufacturers. We have a U-shaped phenomenon.

Polynomial forms can be of many orders. Let’s try the quadratic form (order 2 or second degree):

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2\]

Create the new variable and print the head of the dataset.

##   market_share profitability market_share_sq
## 1            5            22              25
## 2           10            17             100
## 3           18             9             324
## 4           35            10            1225
## 5           55            30            3025
## 6           75            50            5625

Run the regression and print the results.

## 
## Call:
## lm(formula = profitability ~ market_share + market_share_sq, 
##     data = profitability)
## 
## Residuals:
##       1       2       3       4       5       6 
##  2.1178  0.6424 -3.5816 -2.1720  5.0516 -2.0581 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     24.302680   4.235373   5.738   0.0105 *
## market_share    -0.973672   0.294095  -3.311   0.0454 *
## market_share_sq  0.017917   0.003647   4.912   0.0162 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.171 on 3 degrees of freedom
## Multiple R-squared:  0.9558, Adjusted R-squared:  0.9263 
## F-statistic: 32.41 on 2 and 3 DF,  p-value: 0.009303

$R^2$ are improved and all p-values are significant at the 5% level.

Plot the results.

Conclusion

The quadratic form captures the connection between market share and profitability in the industry under study.

However, by introducing a new variable in the model, we also introduced a violation to the OLS: collinearity between market_share and market_share_sq.

##    market_share market_share_sq 
##            TRUE            TRUE

We can tolerate multicollinearity for our purpose: interpolating. Interpolating is forecasting. We do not measure price elasticity. We have leeway with the coefficients.

Using the model and the market shares of 12 companies (from the report), we can interpolate the private manufacturers’ profitability. It is not a perfect picture, but a fair estimate.

Extension

Any polynomial function has (local) minimum and maximum also called extremum.

Our model is a second-degree function and we can compute the extremum of profitability with $\frac{\beta_1}{-2 \beta_2}$.

Compute the extremum (for profitability and market_share).

## [1] 27.17234

## [1] 11.07421

Plot the results with the extremum.

10, Specification: the Polynomial Model to Measure/Forecast Marginal Costs

Marcellus and the holding - part 6

The R case

Goal

Many big players in the industry are profitable. In the short term, it might not last. These companies have reached full capacities. They are on the brink of a restructuring. Their stocks used to be good investments, not anymore.

Short-term marginal costs and average costs tend to be U-shaped. According to the law of diminishing marginal returns, as production increases, unit costs first diminish. However, beyond a threshold, unit costs start increasing.

We want to show how smarter it would be to invest in smaller companies gaining ground. They might not be as profitable as the big players, but as production increases, these growing companies will improve their marginal costs to become more profitable. Their stocks should increase.

The dataset is about nine companies: marginal_cost (in thousands of Euros) vs. produced_quantity (in millions).

Plot the data.

The linear form

Run the regression.

## 
## Call:
## lm(formula = marginal_cost ~ produced_quantity, data = production)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.439 -2.789 -1.872  3.428 11.911 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        17.5222     5.7259   3.060   0.0183 * 
## produced_quantity   0.7567     0.1753   4.316   0.0035 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.79 on 7 degrees of freedom
## Multiple R-squared:  0.7269, Adjusted R-squared:  0.6878 
## F-statistic: 18.63 on 1 and 7 DF,  p-value: 0.003498

$R^2$ are good and p-values are significant at the 5% level.

Plot the residuals.

The top-left plots show U-shaped residuals. The top-right chart does not suggest linearity.

The right fitted regression line can be approximated with the lowess function.

The polynomial form

From the all the plots above, we need a quadratic form (power 2 or second order). Create the new variable and print the head of the dataset.

##   marginal_cost produced_quantity produced_quantity_sq
## 1            37                10                  100
## 2            27                15                  225
## 3            31                20                  400
## 4            27                25                  625
## 5            36                30                  900
## 6            42                35                 1225

Run the regression and print the results.

## 
## Call:
## lm(formula = marginal_cost ~ produced_quantity + produced_quantity_sq, 
##     data = production)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2567 -1.8909  0.7831  1.8745  3.3758 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          44.347619   6.169139   7.189 0.000366 ***
## produced_quantity    -1.438139   0.458603  -3.136 0.020172 *  
## produced_quantity_sq  0.036580   0.007511   4.870 0.002793 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.295 on 6 degrees of freedom
## Multiple R-squared:  0.9449, Adjusted R-squared:  0.9265 
## F-statistic:  51.4 on 2 and 6 DF,  p-value: 0.0001677

$R^2$ and p-values are all improved.

Plot the residuals.

The top plots suggest a quadratic form.

Plot the results.

Conclusion

The model confirms the theory. In practice, when a company reaches its optimal production point, overcapacity wears out the production means: machines break down, maintenance costs skyrocket, employees get tired, turnover increases, manager get frenzy, costly mistakes happen, etc. The bigger producers are entering this phase.

In the long run, a company can improve its productivity with technology and best practices such as the Lean approach. However, it is also costly to restructure a company. We should wait until the end of the next restructuring cycle to consider investing in the big players and rather focus on a couple of underdogs with bright prospects.

Extension

We could go further with a cubic model (power 3) to model total costs: as production increases, total costs of production increase. Following the decrease in marginal costs (returns to scale), total costs increase at a declining rate. As marginal costs start increasing again, total costs follow the same trend. Finally, both marginal costs and total costs end up increasing at an increasing rate.

Plot the data and approximate the regression form.

We have a polynomial form of order 3.

Create the new variables and print the head of the dataset.

##   output total_cost output_sq output_cu
## 1      1        193         1         1
## 2      2        226         4         8
## 3      3        240         9        27
## 4      4        244        16        64
## 5      5        257        25       125
## 6      6        260        36       216

Run the regression and print the results.

## 
## Call:
## lm(formula = total_cost ~ output + output_sq + output_cu, data = production2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4263 -0.7416  0.3744  1.4635  4.4350 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 141.76667    6.37532   22.24 5.41e-07 ***
## output       63.47766    4.77861   13.28 1.13e-05 ***
## output_sq   -12.96154    0.98566  -13.15 1.19e-05 ***
## output_cu     0.93959    0.05911   15.90 3.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.285 on 6 degrees of freedom
## Multiple R-squared:  0.9983, Adjusted R-squared:  0.9975 
## F-statistic:  1202 on 3 and 6 DF,  p-value: 1.001e-08

As expected, all $\beta > 0$, except for $\beta_2$ (output_sq).

Verify that $\beta_2^2 < 3 \times \beta_1\beta_3$.

## [1] 168.0015

## [1] 178.9286

## [1] TRUE

Plot the results.

11, Specification: the Growth Model to Measure/Forecast Stock Returns

Marcellus and the investment fund - part 1

The R case

Goal

We have a start-up with a rising stock price (the company recently went public).

Show how growth is accelerating over time.

The growth form - a log-lin model

We need a form to capture the stock price take-off:

\[Y = \beta_0 e^{\beta_1 T}\]

Where $e$ is a natural logarithm and $T$ is time. In this kind of model, $\beta_1$ represents the growth rate (positive or negative).

We linearize the form to obtain a log-lin model:

\[ln Y = ln \beta_0 + \beta_1 T + \epsilon\]

This log-lin model captures the fast growth of start-ups.

Pointer

We choose a natural logarithm (ln) or log in R. The inverse of ln is e. We could also have picked another logarithm with log(x, base) where base = 2 stands for binary and base = 10 stands for decimal. It depends on the situation, the analysis, the know-how.

Linearize the data (taking the log) and plot it.

Run the regression and print the results.

## 
## Call:
## lm(formula = log(stock) ~ quarter, data = startup)
## 
## Residuals:
##        1        2        3        4        5 
## -0.08261  0.07336  0.06170 -0.01305 -0.03940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.57280    0.08064   31.90 6.77e-05 ***
## quarter      0.34302    0.02431   14.11 0.000771 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07689 on 3 degrees of freedom
## Multiple R-squared:  0.9851, Adjusted R-squared:  0.9802 
## F-statistic:   199 on 1 and 3 DF,  p-value: 0.0007715

We obtain high $R^2$ and significant p-values at the 1% level.

Plot the results; the linearized model.

Plot the original model.

Conclusion

The $\beta_1$ coefficient is the quarterly growth rate (34%). No collinearity is possible. The rate is likely to work for the short-term forecasts. Following the IPO, we are in the early growth phase.

In the longer term, the model will likely fail to capture future growth. Growth will slow down and numerous events will impact the start-up: financial crises, legal changes, competition, technological change, etc. The start-up might also miss its targeted sales, causing investor to dump the stock.

Extension

The MASS package has several functions for dealing with nonlinear models. loglm, loglm1, and loglin function can fit log-lin models.

There are two additional nonlinear models:

The lin-log model:

\[Y = \beta_0 + \beta_1lnX + \epsilon\]

The reciprocal model:

\[Y = \beta_0 + \beta_1 \bigg( \frac{1}{X} \bigg) + \epsilon\]

12, Specification: the Logistic Model to Model Lifecycles

Marcellus and the investment fund - part 2

The R case

Lifecycle

Market penetration, adoption of new technologies, and subscriber acquisition are all life-cycle situations.

A tech start-up is pushing an innovative product to the market.

We have data about the household_adoption_rate over time. We choose to work with cross-sectional data although it is a case for time series. Plot the data.

Goal

First, build a life-cycle model and show how slow beginnings can turn into an explosive growth, then slow down to reach a plateau.

Second, reverse-engineer the adoption rate with the regression coefficients and estimate the adoption rate. We have a truncated dataset and we miss the early phase as it was impossible to collect the numbers. Extrapolate the adoption rate (prior to time = 1, before the adoption reached 20%).

Third, forecast the adoption rate (extrapolate beyond time = 11) under the assumption that we can reach 100% of households. The important question we seek to answer is if sales have passed the inflection point. After all, a product whose sales have passed the inflection point or is moving away from this point has more limited prospects.

The logistic form

The logistic form resembles an S-curve:

\[D = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}\]

$D$ is the ‘aDoption rate’ (a percentage). This exponential model has two growth phases:

Early phase; increasing growth where $\beta_1 < 0$, being the growth rate.
Inflection point; at $- \beta_0 / \beta_1$.
Late phase; decreasing growth as we reach the ‘market potential’.

We guess 4 as the inflection point since it looks like the turning point. Plot the three stages.

We linearize the form:

\[ Y = ln \bigg( \frac{D}{1 - D} \bigg) = \beta_0 + \beta_1 X\]

Create the new variables and print the head of the dataset.

##   time household_adoption_rate lifecyle_rate
## 1    1                    0.20    -1.3862944
## 2    2                    0.30    -0.8472979
## 3    3                    0.46    -0.1603427

Run the regression and print the results.

## 
## Call:
## lm(formula = lifecyle_rate ~ time, data = equipment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55409 -0.22524  0.07309  0.26993  0.35881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.09759    0.21355  -5.140 0.000612 ***
## time         0.26539    0.03149   8.429 1.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3302 on 9 degrees of freedom
## Multiple R-squared:  0.8876, Adjusted R-squared:  0.8751 
## F-statistic: 71.04 on 1 and 9 DF,  p-value: 1.455e-05

$R^2$ are strong and we have significant p-values at the 1% level. The growth rate is $\beta_1$ = 26.5%.

Conclusions

Compute the inflection point ($- \beta_0 / \beta_1$).

## [1] 4.135782

With the coefficient, reverse-engineer the adoption rate. Print the head of the dataset.

##   time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1    1                    0.20    -1.3862944         -0.832202
## 2    2                    0.30    -0.8472979         -0.566813
## 3    3                    0.46    -0.1603427         -0.301424

Extend the time period and create a longer data frame. Print the head of the dataset

##   time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1   -4                    0.00     0.0000000          0.000000
## 2   -3                    0.00     0.0000000          0.000000
## 3   -2                    0.00     0.0000000          0.000000
## 4   -1                    0.00     0.0000000          0.000000
## 5    0                    0.00     0.0000000          0.000000
## 6    1                    0.20    -1.3862944         -0.832202
## 7    2                    0.30    -0.8472979         -0.566813
## 8    3                    0.46    -0.1603427         -0.301424

Fill the new data frame. Print the head and tail of the dataset.

##   time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1   -4                    0.00     0.0000000         -2.159147
## 2   -3                    0.00     0.0000000         -1.893758
## 3   -2                    0.00     0.0000000         -1.628369
## 4   -1                    0.00     0.0000000         -1.362980
## 5    0                    0.00     0.0000000         -1.097591
## 6    1                    0.20    -1.3862944         -0.832202
## 7    2                    0.30    -0.8472979         -0.566813
## 8    3                    0.46    -0.1603427         -0.301424

##    time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 19   14                       0             0          2.617855
## 20   15                       0             0          2.883244
## 21   16                       0             0          3.148633

Extrapolate and forecast the adoption rate. Print the head of the dataset.

##   time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1   -4                      NA             0         -2.159147
## 2   -3                      NA             0         -1.893758
## 3   -2                      NA             0         -1.628369
##   re_adoption_rate
## 1        0.1034796
## 2        0.1308166
## 3        0.1640539

Plot the results.

This is not a perfect fit.

How could we improve the logistic forecasting/extrapolating model?

For one thing, we would benefit from a longer sample or a denser sample (by month). We can also fit in a seasonality index with monthly data for a better estimation.

We can also rework the ultimate penetration assumption (we rarely reach 100% or 1). The new form would include the upper limit L:

\[D = \frac{L}{1 + e^{-(\beta_0 + \beta_1 X)}}\]

Since it is a truncated dataset, we could use a Tobit model to estimate the early phase before running the logistic model.

Extension

Other lifecycle models

For the logistic form, we could consider the Gompertz model and the Bass diffusion model.

The rule of thumb is if future sales benefit from previous sales, the logistic curve should be used to generate forecasts, whereas if future sales do not benefit from previous sales, the Gompertz curve should be used.

The main goal of life-cycle models is to measure if we are before or after the inflection point. We do not management a growing business as we manage a mature business. Bright prospects are before the inflection point. At start-up growth will come in the early phase. If the strategy is to buy a growth stock, whenever the start-up has crossed the inflection point, it is time to sell.

Unlike the above models, the Bass model and its variants predict a product future sales before a product comes to market. The Bass model also gives an explanation of how knowledge of new products spreads throughout the market place. These analyses are crucial. Do we invest or not? Are we going to have a chance to get a ROI or not?

The reverse logistic model

Some cases are reverse S-curves (when $\beta_1 < 0$). We start with a rapid decline, then the fall slows down until it reaches a floor.

In very specific cases, this situation bears the name of ‘half-life model’. We cannot know the ending date, but we can only estimate the time when the amount will be half of what it was at the beginning. The long descent is bound by an asymptote and will never reach zero. Half-life models explain radioactive disintegration or medication diffusion in the body when months after an injection, we can still find traces in a blood sample.

In business, it applies to subscriber churn. Without new acquisition, the subscriber base declines according to a churn rate. The concept of churn is the opposite of compound interests. Larger subscriber bases undergo large churn. As the subscriber base is eroded, the churn rate effect diminishes. A half-life model can forecast the time when 1000 subscriber base reaches 500. Beyond 500, it seems to be an infinite descent.

Pointer

A close relative to reverse lifecycle models is the reciprocal form (f(x) = 1/x). In theory, the form has an asymptotic value (a floor). The most important application belongs to macroeconomics: the Phillips curve. It is the relationship between the money rate percent rate of change (Y), a proxy for inflation, and the unemployment rate in percent (X).

13, Specification: the Power Model to Model Price Elasticity

Marcellus and digressions on business management - part 1

The R case

Goal

We want to apply the concept of elasticity to model the profits-quantity elasticity of a fast-growing tech start-up. Production has exploded since the last quarters.

Plot the data.

Build a model to measure the elasticity coefficient; the connection with between quantity and profits. Apply the model to forecast short-term profits as a production function.

The power form - a log-log model

\[{profit} = \beta_0 {qty}^{\beta_1}\]

Taking the first derivative gives the elasticity function:

\[\frac{\delta {profit}}{\delta {qty}} = \beta_0 \times \beta_1 {qty}^{\beta_1 - 1}\]

The power form has similarities with the growth form. With the growth form, the linear model is a log-lin model. With the power form, the linear model is a log-log model:

\[ln (profit) = ln \beta_0 + \beta_1 ln (qty)\]

With slope $\beta_1(\frac{profit}{qty})$ and elasticity $\beta_1$.

Linearize the model. Print the head of the dataset, run the regression and print the results.

##   quarter quantity profits
## 1       1        2       2
## 2       2        4       9
## 3       3        6      19

## 
## Call:
## lm(formula = log(profits) ~ log(quantity), data = sme)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.058169 -0.009584 -0.003166  0.000217  0.073215 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.62138    0.03501  -17.75 4.45e-07 ***
## log(quantity)  1.98038    0.01455  136.08 3.05e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03658 on 7 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 1.852e+04 on 1 and 7 DF,  p-value: 3.053e-13

$R^2$ are strong and we have significant p-values at the 1% level.

Plot the results; the linearized model.

The original model.

Conclusion

Extract the coefficients. Adjust the intercept with exp(intercept).

## [1] 0.5372043

## [1] 1.980378

The elasticity model with coefficients:

\[profit = 0.54~qty^{1.98}\]

Taking the first derivative gives:

\[\frac{\delta profit}{\delta qty} = 0.54 \times 1.98 \times qty^{0.98}\]

Or simply:

\[\frac{\delta profit}{\delta qty} = 1.0692 \times qty^{0.98}\]

What would a 1% increase in $QTY$ do?

\[\frac{\delta profit}{\delta qty} = 1.0692 \times 1% = 1.0692%\]

It would increase $profit$ by 1.0692% because of the short-term increasing returns to scale for the start-up tech.

A 10% increase in $qty$ would turn into a 10.21% increase in $profit$, not 10.682%; a slight diminution.

As time go by, we would need to measure changing coefficients with new datasets. The new coefficients would likely illustrate different elasticities and, maybe, show decreasing returns.

Extension

Elasticity

The classical view on elasticity is that consumption (quantity demanded) is a function of price and (consumer) revenues:

\[C = \beta_0 P^{\beta_1} R^{\beta_2}\]

As price increases and revenue decrease, we would expect a decrease in quantity consumed (assuming it is a normal good).

Now, the increase depends on elasticity ($\varepsilon$). With high elasticity, price hikes will cause demand to plummet while with inelasticity, price hikes will not change the demand by much.

Elasticity can be constant or not.

Elasticity depends on many factors. One of the factors is the nature of the good. Contrary to a normal good ($0 < \varepsilon < 1$), Veblen goods are superior or luxury goods. Given constant revenues, as prices rise, some consumers seek these goods (rare paintings, luxury cars, jewels). These good are highly elastic ($\varepsilon > 1$). On the other hand, Giffen goods are inferior goods (cheap food, used cars, second-hand furniture have $\varepsilon < 0$). This time, as price increase, consumers buy more!

We want to point out the range of $\varepsilon$ (around 0, rarely beyond 2 or -2). For this reason, eliminating multicollinearity is crucial when estimating $\varepsilon$. We want the best estimation possible.

The linear form is:

\[ln C = ln \beta_0 + \beta_1 ln P + \beta_2 ln R\]

Consumption means quantity demanded. First, $\beta_1 ln P = b_1$ represents the consumption-price elasticity of a normal good:

\[b_1 = \frac{\Delta C / C}{\Delta P / P}\]

For a 1% increase in price, there is a $\beta_1$ decrease in consumption, assuming $\beta_1 < 0$ (decreasing returns).

Second, $\beta_2 ln R = b_2$ represents the consumption-revenue elasticity:

\[b_2 = \frac{\Delta C / C}{\Delta R / R}\]

For a 1% increase in revenue, there is a $\beta_2$ increase in consumption, assuming $\beta_2 > 0$ (increasing returns).

There is a lot more to consider (combine changes in $P$ and $R$ and the substitution/complement effects in face of choices).

Understanding elasticity is important when deciding on a pricing policy. We have to take into account the type of product, consumers, competition, timing, market segments, and many other factors.

Elasticity and pricing

In marketing, pricing strategy is one of the 4P (product, place (location and distribution), promotion (and publicity) and pricing).

Elasticity analysis in one of the most important marketing toolbox’s analytic tool. Demand and elasticity models can become highly complex when they include factors such as cost structure, competition and substitution, legal constraints, and business objectives. Elasticity analysis is crucial for coining a pricing policy (penetration, skimming, competition, product line, bundle, premium, optional, cost-plus, cost-based, etc).

14, Specification: the Cobb-Douglas Model and More on Elasticity

Marcellus and digressions on business management - part 2

The R case

The Cobb-Douglas function

This other elasticity function is mainly applied in microeconomics to explain the industrial changes underpinnings:

\[Y = \beta_0 K^{\beta_1} L^{\beta_2}\]

Where $Y$ stands for ‘production’, $K$ for ‘capital’, and $L$ for ‘labour’. The two assets are a simplification of the production means. The Cobb-Douglas function is a ‘multiplicative’ form capturing returns to scale (whether increasing or decreasing).

In essence, the function measures the effect of adding/removing units of capital (investing/divesting in material resources) and units of labour (hiring/laying off human resources).

Both $\beta$ are the elasticity coefficients explaining how hiring more workers increases production with decreasing returns to scale as capital remains constant; and how more capital investments improves productivity with increasing returns to scale when the workforce remains constant.

Goal

We have a dataset showing the production of 10 companies in an industry with their respective workforce ($L$) and assets ($K$). Find out the elasticity coefficients ($\beta$(. Which one is elastic, which one is inelastic? What would be impact on production for a change in $L$, in $K$?

Print the head of the dataset.

##   firms production workforce assets
## 1     1    2827381    224900 570900
## 2     2    2463879    287000 251270
## 3     3    2597816    436474 143600

The power form - a log-log model

This time, we use the decimal logarithm ($10^{\beta},~log_{10}$), not the natural logarithm ($e^{\beta},~log~or~ln$).

Linearize the model. Print the head of the dataset.

##   firms production workforce assets production_log workforce_log
## 1     1    2827381    224900 570900       6.451384      5.351989
## 2     2    2463879    287000 251270       6.391619      5.457882
## 3     3    2597816    436474 143600       6.414608      5.639958
##   assets_log
## 1   5.756560
## 2   5.400141
## 3   5.157154

Run the regression and print the results.

## 
## Call:
## lm(formula = sme2$production_log ~ sme2$workforce_log + sme2$assets_log)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.006687 -0.001019  0.001057  0.002407  0.002740 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.364964   0.027473   49.68 3.50e-10 ***
## sme2$workforce_log 0.590148   0.004635  127.31 4.86e-13 ***
## sme2$assets_log    0.334455   0.004364   76.63 1.70e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003513 on 7 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9997 
## F-statistic: 1.562e+04 on 2 and 7 DF,  p-value: 1.683e-13

$R^2$ are strong and we have significant p-values at the 1% level.

Conclusion

Add the coefficient to the Cobb-Douglas function:

\[Y = 10^{1.365} \times K^{0.590} \times L^{0.334}\]

Or simply:

\[Y = 23174\times K^{0.590} \times L^{0.334}\]

$\beta$ are the elasticity ratios.
$\beta_K$ is 0.590. If we invest and increase capital by 10%, production will increase by 5.9%.
$\beta_L$ is 0.334. If we hire and increase the workforce by 10%, production will increase by 3.34%.
$\beta_K+\beta_L$ = 0.925 < 1; decreasing returns to scale. Double the usage of $K$ and $L$ will less than double output $Y$.

Extension

Choose a form, pick a model

A time series may appear to be growing geometrically. This first segment looks like an exponential form. As time passes, we add data and the increase seems to stabilize. With additional data, the last segment looks more like a logarithmic form. The whole dataset starts with an increasing growth, then plateau, and finishes with a decreasing growth.

What model do we pick? A polynomial model? It would clearly fit the data and minimize the residuals, but it might also overfit the situation. Why? In the longer term, the growth might pick up again. The declining growth could also be a temporary downturn.

Should we not pick a linear model then? Knowing we would obtain large residuals as the regression line cuts through all this volatility. Or could we not incorporate dummy variables to take into account cycles?

Selecting the right model is sometimes tricky and time-consuming. Apart from the scientific method and the statistical toolbox, we ought to keep in mind reminders, pointers, rules of thumbs. Of course, they come with experience. The following is a minimalist checklist of things to remember in econometrics.

Explore the dataset. Preprocess the data. Watch for measurement errors, nonresponse, NA, reporting errors, computing errors, etc; because garbage in is garbage out.
A variable might not be observable, may be truncated or the data may simply be unreliable. Opt for alternatives such as proxy variables (expenditures over income for example).
Select the appropriate data, divide the dataset, run build separate models. Do no push the glory and try building a universal model. Keep things simple.
Move from general to specific.
When evaluating a model, theory is a guide in econometrics. With this approach, we attempt to measure causality.
On the other hand, with the rise of Big Data and computer power, some data-mining approaches mainly focus on the effect. The goal then is to test all the possible combinations; similar to pairwise correlations. In a huge matrix, we pair up each variable with each other. Using loops and iterations, we test ‘everything’ and we study the results (with for(var in vec) {expr} and while(cond){expr}).
Employ recipes, cookbooks, existing models. Reproduce existing scripts.
Employ visualization techniques in all the analysis steps.
Always examine the whole (residuals) and the parts (partial residuals). Dig in.
Pick the right functional form; the Ramsey’s RESET test might help.
In addition, check if adding, omitting a variable improve the model. Remember the Occam’s razor and the law of parsimony. Use the statistical (ANOVA), heuristic (MASS package) and algorithmic techniques (leaps and relaimpo packages) to compare models.
Prefer simplicity over overfitting.
Use consistent benchmarks ($R^2$, Mallow’s $C_p$, AIC, BIC).
Check if coefficients have correct signs, if they are statistically significant (hypothesis testing).
Test, test, test (normality, linearity, etc.).
Automate routine tasks for exploratory plotting and running batteries of tests (using for(var in vec) {expr} and while(cond){expr}).
Data mining and other data science algorithms are great allies. We should not be shackled to the econometric toolbox.

The following is a step-by-step approach:

Define population.
Sampling and data collection.
Descriptive statistics.
Hypothesis testing.
Correlation.
Regression.

15, Specification: Logit and Probit Models to Model a Probability?

Marcellus and the usuary loans - part 1

The R case

A qualitative regressands

We worked with qualitative variables as regressors (dummy variables). We will now work with a qualitative variable as a regressand. The regressand can be multinomial (bond ratings AAA, AA, A), ordinal (customer 1st to 3rd choice) or binomial (1 or 0, whether or not an event takes place).

The goal of the regressors, whether continuous, discrete and/or qualitative is predicted the ‘states’ of the regressand and classify whether $Y$ takes the value of AAA or AA or A, or 1st, 2nd or 3rd, or if an event occurs.

Goal

We have applicants requesting a loan. Some are approved, some are rejected. We want to measure the probability of being approved. We want to be able to use the model to predict an applicant’s chance of being approved and how a changing conditions can affect this probability.

The logit model

We need a probability model that remains in the 0-1 interval as regressors influence the probability of the event occurrence (the regressand). However, the relationship between the probability and the regressors is nonlinear, that is, one which approaches zero at slower and slower rates as regressors lose influence (get smaller) and approaches one at slower and slower rates as regressors gain influence (get larger). We have an S-shaped curve just like with the life-cycle model: the logistic form.

We start with the Odds Ratio (OR): the probability of occurrence of an event over non-occurrence:

\[OR = \frac{P}{1-P}\]

Where $P$ is $Pr(Y=1)$.

An OR = 2 means 2 to 1 chance of occurrence. An OR = 0.25 means 4 to 1 chance of non-occurrence.

We take the logarithm of OR, we get the logistic OR or logit:

\[ln(OR) = ln \bigg( \frac{P}{1 - P} \bigg) = logit\]

The linear form:

\[logit = L = ln \bigg( \frac{P}{1 - P} \bigg) = \beta_0 + \beta1 X_1\]

Then,

\[\bigg( \frac{P}{1 - P} \bigg) = e^{\beta_0 + \beta1 X_1}\]

And,

\[P = \frac{e^{\beta_0 + \beta1 X_1}}{1+e^{\beta_0 + \beta1 X_1}}\]

Or simply:

\[P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1)}}\]

The general form is:

\[P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_n X_n)}}\]

To estimate the linear model, we need regressors ($X_n)$ and the logit values $L$. First, print the loan dataset.

##   amount  n n_i
## 1  80000 10   6
## 2  90000 20   9
## 3 100000 30  10
## 4 110000 60  21
## 5 120000 80  20

The dataset is a table of five tranches (by amount requested in $US), the number of requests by tranche (n) and number of approvals by tranche (n_i).

The logit should be $L = ln(\frac{1}{0})$ if a loan request is approved and $L = ln(\frac{0}{1})$ is the request is rejected.

We compute $P = \frac{n_i}{n}$ for each tranche, the relative frequency estimating the true $P$.

## amount
##  80000  90000  1e+05 110000 120000 
##   0.60   0.45   0.33   0.35   0.25

The loan dataset is a summary of 200 transactions (approvals and rejections) grouped together by tranches.

Print the number of rows, the head and the tail of the transactions.

## [1] 200

##   approved amount
## 1        1  80000
## 2        1  80000
## 3        1  80000
## 4        1  80000
## 5        1  80000

##     approved amount
## 196        0 120000
## 197        0 120000
## 198        0 120000
## 199        0 120000
## 200        0 120000

For the model (and for simplicity), amount is the only regressor. We will solely rely on one $X$ to explain the probability of loan approval!

We have a fairly large sample. Assuming each observation is distributed independently as a binomial variable (1 or 0), the residuals should follow a normal distribution with mean $\mu = 0$ and variance $\sigma^2 = \frac{1}{n \times P(1-P)}$

The short way

Using the Generalized Least Squared (Model), glm, with the logistic function as the link function, run a regression on the 200 transactions and print the results.

## 
## Call:
## glm(formula = approved ~ amount, family = binomial(link = "logit"), 
##     data = loan2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2483  -0.8765  -0.7692   1.3738   1.6506  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  2.628e+00  1.371e+00   1.916   0.0553 .
## amount      -3.079e-05  1.264e-05  -2.436   0.0148 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 253.67  on 199  degrees of freedom
## Residual deviance: 247.68  on 198  degrees of freedom
## AIC: 251.68
## 
## Number of Fisher Scoring iterations: 4

Levels of significance: 10% for the intercept, 5% for the regressor. We note z-values for the coefficients and not t-values. In a logit, we compare the computed z-values ($(\beta/s.e.)^2$) and use the $\chi^2$ distribution. The Wald test replaces the t-test in glm and maximum likelihood.

We also compare logit models with the AIC. However, we need more statistics; print additional metrics.

##           llh       llhNull            G2      McFadden          r2ML 
## -123.84016553 -126.83572714    5.99112323    0.02361765    0.02951139 
##          r2CU 
##    0.04106181

##         McFadden     Adj.McFadden        Cox.Snell       Nagelkerke 
##     2.361765e-02    -3.499318e-05     2.951139e-02     4.106181e-02 
## McKelvey.Zavoina           Effron            Count        Adj.Count 
##     3.850954e-02     3.075997e-02     6.800000e-01     3.030303e-02 
##              AIC    Corrected.AIC 
##     2.516803e+02     2.517412e+02

The pseudo-$R^2$ or the McFadden’s $R^2$ is very weak. We read the pseudo-$R^2$ as $\frac{no.~of~true~predictions}{no.~of~predictions}$.

It appears the lack of explanatory variables does a poor job at explaining the probability of loan approval. Of course, we could have added extra variables such as household revenue, credit scoring, cash-down, etc. We will come back to the model explanatory power below. Instead, we continue the analysis.

Compute the predicted probability (for a loan request of being approved) for each tranche.

##   amount  n n_i         P predicted_P
## 1  80000 10   6 0.6000000   0.5411886
## 2  90000 20   9 0.4500000   0.4643705
## 3 100000 30  10 0.3333333   0.3892066
## 4 110000 60  21 0.3500000   0.3189639
## 5 120000 80  20 0.2500000   0.2560834

Conclusion to the short way

The probability of loan approval decreases by:

## [1] -3.078885e-05

for every dollar increase or -3.079% for every US$1000 increase. As the requested loan amount increases, the odds of being approved decline.

Print the complete results for P.

## amount
##  80000  90000  1e+05 110000 120000 
##   0.54   0.46   0.39   0.32   0.26

Now, prove it with:

\[P = \frac{1}{1 + e^{-(2.628 + (-3.079e-05) \times X)}}\]

Compute the predicted probability of being approved if we request a $X$ = US$100,000 loan.

## [1] 0.3892066

Fit the result into:

\[OR = \frac{P}{1-P}\] \[OR = \frac{0.3892066}{1-0.3892066}\] The OR is equal to:

## [1] 0.6372148

Or about 2 to 3 chance of non-occurrence.

Generate more predictions (say we want to know the probabilities for amounts going from US$1 to US$200,000.

## round(PredictedProb, 2)
##   0.03   0.04   0.05   0.07   0.09   0.12   0.16    0.2   0.26   0.32 
## 200000 190000 180000 170000 160000 150000 140000 130000 120000 110000 
##   0.39   0.46   0.54   0.62   0.69   0.75    0.8   0.85   0.88   0.91 
## 100000  90000  80000  70000  60000  50000  40000  30000  20000  10000 
##   0.93 
##      0

Close to a requested amount of 0, the model makes no sense. It is simply for a matter a sensitivity analysis. Most loan requests are between US$80K and US$200K.

Plot the results with confidence intervals.

We have a reverse logistic curve since the coefficient is -3.079% for every US$1000.

Here is an additional measure on how well the model fits. This can be particularly useful when comparing competing models. The output produced by summary(loan_logit_model) include indices of fit (shown below the coefficients): the null and deviance residuals and the AIC.

Goodness of fit and robustness

The OLS $R^2$ are defined as a goodness-of-fit measures.

The model estimates from a logistic regression are maximum likelihood estimates. Logit models are not calculated to minimize variance, so the OLS approach to goodness-of-fit does not apply.

We saw above that alternative $R^2$. Pseudo-$R^2$, have been developed. These pseudo-$R^2$ are on a similar scale, ranging from 0 to 1; higher values indicating better model fit. However, pseudo-$R^2$ cannot be interpreted as one would interpret an OLS $R^2$ and different pseudo-$R^2$ can arrive at very different values. In fact, pseudo-$R^2$ vary greatly from each other within the same model and pseudo-$R^2$ make broad comparisons of pseudo-$R^2$ invalid.

Pseudo-$R^2$ only have meaning when compared to other pseudo-$R^2$ of the same type, on the same dataset, predicting the same outcome. The only change would be to add or remove regressors (variations of the same model).

For logit, we need an alternative F-test. We need to ask whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The gap between the two models is called the deviance ($D$).

The model with regressors is the estimated model.
The model with just an intercept is the saturated model.

\[D = -2(ln V_{estimated~model}-lnV_{saturated~model})\]

Estimating the maximum likelihood consists in maximizing a likelihood function. If the saturated model perfectly fits the data, the maximum is 1; and $ln(1) = 0$.

The equation becomes:

\[D = -2(ln V_{estimated~model})\]

A high $D$ is a poor model. However, we face the same problem than with pseudo-$R^2$: $D$ is only meaningful when comparing models of the same type: one model having removed one or more regressors from the original model for example.

What if we measure the change in deviance between one model, the initial model, A, and the second model, B:

\[\delta_{A-B} = D_{(A)} - D_{(B)} = decline~of~deviance\]

B is the estimated model with regressors.
A is empty, with only the intercept.

The test statistic becomes the difference between the residual ‘deviance’ from the ‘estimated model with predictors’ and from the ‘saturated model with just an intercept’. The test statistic is distributed $\chi^2$ with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model).

Compute the decline of deviance, the degree of freedom and the p-value.

## [1] 5.991123

## [1] 1

## [1] 0.01437804

The test is significant at the 5% level. In other words, we reject the ‘no decline of deviance’ hypothesis; we have a meaningful model when we add the regressor.

If there were more than one regressor, it would be smart to test for multicollinearity as well.

The long way

This time, we open the engine to see what is going on. The fact there is only one regressor will keep things simple.

From our loan table:

##   amount  n n_i         P
## 1  80000 10   6 0.6000000
## 2  90000 20   9 0.4500000
## 3 100000 30  10 0.3333333
## 4 110000 60  21 0.3500000
## 5 120000 80  20 0.2500000

Compute additional variables to estimate the logit model.

The steps (print the head of the dataset as we follow along):

For each tranche, compute the actual probability of being approved ($P$).
For each tranche, obtain the logit as $L = ln (\frac{P}{1-P})$.

##   amount  n n_i   P         L
## 1  80000 10   6 0.6 0.4054651

For the binomial distribution, the average is $n \times P$, the variance is $n \times P(1-P)$ and the estimated variance of the residuals is $\sigma^2_{\epsilon} = \frac{1}{n \times P(1-P)}$. Resolve the problem of heteroscedasticity with the residuals. Use the weighted transformation. Apply the weight where $w = n \times P(1-P)$.
Compute $\sqrt{w}$.
Compute $weighted~L = L \times \sqrt{w}$.
Compute $weighted~amount = amount\times\sqrt{w}$.

##   amount  n n_i   P         L   w   sqrt_w       w_L w_amount
## 1  80000 10   6 0.6 0.4054651 2.4 1.549193 0.6281438 123935.5

Estimate the Weighted Least Squares (OLS on transformed data):

\[weighted~L = \beta_0 \sqrt{w} + \beta_1 \times\ weighted~amount\]

Because $\beta_0$ become a multiplicative factor, we remove it from the linear model. A logit model takes no intercept. We replace the intercept (‘0’ means ‘through the origin’) with sqrt_w. Run the regression and compare the results with the previous model.

Run the regression and print the results.

## $coefficients
##              Estimate   Std. Error   t value   Pr(>|t|)
## sqrt_w    2.631122741 7.290674e-01  3.608888 0.03653227
## w_amount -0.000030795 6.712809e-06 -4.587498 0.01945483

Compare the results to the ‘short-way’ results.

##                  Estimate   Std. Error   z value   Pr(>|z|)
## (Intercept)  2.628237e+00 1.371381e+00  1.916489 0.05530284
## amount      -3.078885e-05 1.263854e-05 -2.436108 0.01484624

They are in the same ballpark.

Extract the full summary.

## 
## Call:
## lm(formula = w_L ~ 0 + sqrt_w + w_amount, data = loan)
## 
## Residuals:
##       1       2       3       4       5 
##  0.3686 -0.1340 -0.6320  0.5072 -0.1330 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)  
## sqrt_w    2.631e+00  7.291e-01   3.609   0.0365 *
## w_amount -3.079e-05  6.713e-06  -4.587   0.0195 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5254 on 3 degrees of freedom
## Multiple R-squared:  0.9695, Adjusted R-squared:  0.9491 
## F-statistic: 47.64 on 2 and 3 DF,  p-value: 0.005333

Remember this time, we run an OLS. $R^2$ are strong, but meaningless since we ran the regression through the origin! The F-statistic’s p-value is 0.005333. The ‘short-way’ $\chi^2$ test had a p-value = 0.01437804. Both are significant at least at the 5% level. The other p-values are significant at the 5% level.

Conclusion to the long way

One of the objectives is to proof the ‘short way’ and confirm the first model.

Given:

$\beta_0$ for sqrt_w = $\sqrt{n\hat{P}(1-\hat{P})}$
$\beta_1$ for w_amount = $amount \times \sqrt{n\hat{P}(1-\hat{P})}$
Where $\hat{P}$ stands for the predicted probabilities.

As opposed to the actual (calculated) probability P:

##   amount  n n_i         P
## 1  80000 10   6 0.6000000
## 2  90000 20   9 0.4500000
## 3 100000 30  10 0.3333333
## 4 110000 60  21 0.3500000
## 5 120000 80  20 0.2500000

We want to find the predicted probabilities.

The beginning equation:

\[ln \bigg( \frac{P}{1 - P} \bigg) = L = \beta_0 + \beta1 X_1\]

Becomes:

\[L \times \sqrt{n\hat{P}(1 - \hat{P})} = predicted~L = \beta_0\sqrt{n\hat{P}(1 - \hat{P})} + \beta_1 \times amount \times \sqrt{n\hat{P} (1 - \hat{P})}\]

Insert the estimated coefficients

## [1] 2.631123

## [1] -3.0795e-05

The equation becomes:

\[predicted~L = 2.631123 \times \sqrt{n\hat{P}(1 - \hat{P})} - (3.0795e-05) \times amount \times \sqrt{n\hat{P} (1 - \hat{P})}\]

And,

\[\frac{predicted~L}{\sqrt{n\hat{P}(1 - \hat{P})}} = unweighted~predicted~L\]

After rework, the equation becomes:

\[\hat{P} = \frac{1}{1+e^{-(unweighted~predicted~L)}}\]

Where $\hat{P} \approx$ PredictedProb from previous model (the ‘short way’).

No matter loan_logit_model or loan_logit_model2, extract the coefficient and run the numbers.

If we request a $100,000 loan, the predicted probability ($\hat{P}$) of being approved is:

## [1] 0.3897467

0.3892066 for the ‘short way’.

From a US$100,000 request, every US$1000 increase changes the predicted probability by $(-3.0795e-05) \times \hat{P}(1 - \hat{P})$ vs. -3.078885e-05 for the ‘short way’ or -3.079% for every US$1000 increase.

The ‘long way’ confirms the ‘short way’. Let’s carry on with some sensitivity analyses.

Complete the computation: $(-3.0795e-05) \times \hat{P}_{\$100,000}(1 - \hat{P}_{\$100,000}) \times \$1000$:

## [1] -0.007324412

From a US$120,000 request, every US$1000 increase changes the probability by $(-3.0795e-05) \times \hat{P}(1 - \hat{P})$ or simply $(-3.0795e-05) \times \hat{P}_{\$120,000}(1 - \hat{P}_{\$120,000}) \times \$1000$:

## [1] -0.005872739

There is a slowdown in the decline as the amount request increases (it makes sense since the default risk increases and lenders are more risk-averse with larger amounts).

From a US$200,000 request, every US$1000 increase will change the probability by:

## [1] -0.0008534973

Confirming the slowdown in the decline as we get closer to a ‘floor’.

The probit model

We could perform additional analyses with a probit model. Probit (or normit) models rely on a normal distribution.

Run the regression and compare the coefficient with the first logit model (the ‘short way’).

##                  Estimate   Std. Error   z value   Pr(>|z|)
## (Intercept)  1.611554e+00 8.442947e-01  1.908758 0.05629333
## amount      -1.890658e-05 7.747235e-06 -2.440429 0.01466982

##                  Estimate   Std. Error   z value   Pr(>|z|)
## (Intercept)  2.628237e+00 1.371381e+00  1.916489 0.05530284
## amount      -3.078885e-05 1.263854e-05 -2.436108 0.01484624

The coefficients are in the same ballpark.

Compute the probabilities.

## round(fit, 2)
##   0.02   0.04   0.05   0.08   0.11   0.15    0.2   0.26   0.32   0.39 
## 390000 180000 170000 160000 150000 140000 130000 120000 110000 100000 
##   0.46   0.54   0.61   0.68   0.75    0.8   0.85   0.89   0.92   0.95 
##  90000  80000  70000  60000  50000  40000  30000  20000  10000      0

Plot the results in comparison with the logit model.

Probit models have a propensity for being steeper with narrower tails.

Pointer

Logit/probit models usually involve regressions with one qualitative variable and more than one regressor.

For estimating loan approval, credit default, subscriber churn, machine breakdown, and other kinds of events, we can also count log-log model.

Logit/probit are close relative of survival models also known as duration models or under time event analysis. A logit/probit explain the probability of occurrence, what factors have an impact on occurrence and by how much. Duration models assume the event occurred and aim at finding what factors influence duration and measuring their impact.  

As per duration, we should consider a technique from astrophysics: the Copernican principal to predict of duration (estimate the length of time that a product or event remains in existence).

Extension

GLM

There are other possibilities than the logit and probit link function with GLM. Depending on the situation, we might need another link function, another distribution, even a nonparametric distribution.

glm(y ~ x, data = dataset, family = familytype(link = 'linkfunction'))

There are other models working with qualitative regressands:

Polytomic.
Multinomial.
Ordinal or ordered.
Categorical.
Count.
Censored, truncated, Tobit.
Mixed logistic.
Latent class.
etc.

Type ?glm for other modeling options and ?family for other family types and link functions. Check out glm.nb and polr from the MASS package.

R has a lot more in bank in terms of packages and functions:

Box-Cox transformation for time series from the car and MASS packages.
Generalized Least Squares (GLS), Generalized Nonlinear Least Squares (GNLS), and linear/nonlinear fixed/random effects models (autocorrelation for example) from the nlme and MASS packages.
Nonlinear Least Square (NLS) from the stats package.
‘loess’ functions and scatter.smooth from the stats package.
Spline regressions from the splines package.
Robust regressions from the MASS package.
Structural equation models (Two-Stage Least Squares ) from the sem package and Simultaneous Equation estimation from the systemfit package.
Partial Least Squares Regression (PLSR) and Principal Component Regression form the pls package.
Quantile regressions from the quantreg package.
Generalized Additive Model (GAM) from the gam package.
Survival Analysis from the survival package.
Classification and Regression Trees from the tree and rpart packages.
Beta regressions from the betareg package.

16, Heteroscedasticity

Marcellus and the usuary loans - part 2

The R case

Goal

Nonconstant residuals variances are features of cross-sectional dataset. Residuals spread can increase, decrease, increase and decrease, etc. The problem distorts results and produce inaccurate (lower) t- and F-values as they appear to be insignificant.

We want to measure the connection between households’ avg_m_revenues (average monthly revenues), a categorical variable with three levels (2000, 4000, 6000) and monthly savings per household.

Print the head of the dataset.

##   avg_m_revenues savings
## 1           2000     160
## 2           2000     180
## 3           2000     190

Find out if the preliminary results are misleading. Detect heteroscedasticity and implement a solution.

The preliminary model

Run the regression, print and plot the results.

## 
## Call:
## lm(formula = savings ~ avg_m_revenues, data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.982 -53.817   9.018  36.183 154.688 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.023e+02  3.886e+01   2.633 0.017438 *  
## avg_m_revenues 4.716e-02  9.469e-03   4.981 0.000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.02 on 17 degrees of freedom
## Multiple R-squared:  0.5934, Adjusted R-squared:  0.5695 
## F-statistic: 24.81 on 1 and 17 DF,  p-value: 0.000114

Sometimes, when we plot large dataset, categorical data are difficult to read. Data come out bunched up. The jitter function add some noise to graphics without changing the actual data.

Use the jitter function.

Visual inspection

Plot a Q-Q chart.

Or…

These are the patterns of heteroscedasticity.

Print the residuals plots.

In the quadrant, the left plots show a growing variance. The top-right Q-Q plot is not linear at the bottom. The bottom-right plot (Cook’s distance) shows how high revenues have a large influence on the regression.

## 
## Suggested power transformation:  -0.9364038

The final plot displays how high revenues influence the coefficients: biased intercept and slope.

Plot means with error bars.

We can clearly see an increasing variance (nonconstant variance).

Tests

First, run the Breusch-Pagan (car package) test under the null hypothesis of ‘homoscedasticity’.

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5.968958    Df = 1     p = 0.01455988

We can reject the null hypothesis at the 5% level.

Second, run the Goldfeld-Quandt test and rerun another Breusch-Pagan test, this time, from the lmtest package.

## 
##  Goldfeld-Quandt test
## 
## data:  savings_model
## GQ = 8.7983, df1 = 8, df2 = 7, p-value = 0.004799
## alternative hypothesis: variance increases from segment 1 to 2

## 
##  studentized Breusch-Pagan test
## 
## data:  savings_model
## BP = 5.7448, df = 1, p-value = 0.01654

Both tests reject the null hypothesis.

The Breusch_pagan test are criticized since it requires a prior knowledge of what might be causing heteroscedasticity (graphical analysis).

The Goldfeld-Quandt test is more flexible: we can divide the sample observations into two groups and compare them. An evidence of heteroscedasticity would be based on a comparison of the residual sum of squares (RSS) using the F-statistic. The larger the F-statistic, the more likely you have different variance for the two groups.

We can determine the appropriate criteria to separate the sample into Group A and Group B. For example, parameter point = 0.5 set the breakpoint halfway. We can also omit central observation with fraction. Finally, we can test increasing, decreasing variances or both (the default is increasing variances).

Pointer

There are more procedures, all criticized: the Park test, the Glejser test, the Harvey-Godfrey test, the Spearman's rank correlation test, and the White test.

Remedies

Use the Weighted Least Squares (similar to what we used with the logit model).

Weighted Least Squared (WLS)

We divide regressors by the standard deviation (the gls function computes it (for Generalize Least Squares where WLS is a specific case). There exist different transformations though.

Run several regressions with different weights and compare them with the AIC.

## [1] 211.6107

## [1] 211.5195

The varPower() weight (for ‘Power of variance covariate’) generates the best AIC.

Pointer

The parameters are: varExp(), varPower(), varConstPower(), varIdent(), varFixed(), varComb()

Print the results.

## Generalized least squares fit by REML
##   Model: savings ~ avg_m_revenues 
##   Data: savings 
##        AIC      BIC    logLik
##   211.5195 214.8524 -101.7598
## 
## Variance function:
##  Structure: Power of variance covariate
##  Formula: ~fitted(.) 
##  Parameter estimates:
##    power 
## 2.061803 
## 
## Coefficients:
##                   Value Std.Error  t-value p-value
## (Intercept)    118.2766 24.644036 4.799399   2e-04
## avg_m_revenues   0.0423  0.008913 4.746549   2e-04
## 
##  Correlation: 
##                (Intr)
## avg_m_revenues -0.921
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.4854143 -0.7539166  0.2110880  0.7450446  1.6651669 
## 
## Residual standard error: 0.0005050905 
## Degrees of freedom: 19 total; 17 residual

The intercept is higher (118.3 over 102.3) since we flatten out larger observations, giving more leverage to the smaller observations. All p-values are improved, especially for the intercept.

Plot the results.

The plot shows constant variance as a result.

The problem, we get no $R^2$, only AIC, BIC and LogLik. However, with WLS, $R^2$ remain quite similar to what we get with OLS.

Conclusion

Taking into account that richer households show more diversity in term of financial discipline, we take measures to standardize the data. We have to curb the increasing variability of monthly savings prior to estimating the connections between revenues and savings.

With adjusted coefficients, we can measure that households save at least US$118 per month (baseline). In addition, richer households have a propensity to save more.

With a US$2000 monthly revenues, an average household saves an additional US$84.6 (0.0423 $\times$ US$2000); an additional US$169.2 for a US$4000 monthly revenues, and an additional US$253.8 for a US$6000 monthly revenues.

Extension

Apart from the WLS, there are other solutions to heteroscedasticity:

Principal component analysis (PCA) and regressions (PCR); also, a remedy to multicollinearity.
Elastic net regressions from the MASS package (ridge regression and ‘least absolute shrinkage and selection operator’ or LASSO).

17, Panel Data: Mixing Cross-Sectional with Time-Series Data

Marcellus and the investment fund - part 3

The R case

Goal

We have panel data:

4 company (A, B, C, D).
20 year.
2 regressors (x1, x2).
1 regressand (y).
Dummy variables.

Print the head of the dataset.

##   year company     y     x1    x2 d1 d2 d3
## 1    1       A 33.10 1170.6  97.8  0  0  0
## 2    2       A 45.00 2015.8 104.4  0  0  0
## 3    3       A 77.02 2803.3 118.0  0  0  0

The year variable is made of 20 periods repeated for each 4 company. For each year and each company, we have a set of y, x1, and x2.

We want to screen companies. The dataset is trivial. y, x1 and x2 could be any financial metrics (like our case with market capitalization, indebtedness and D/E ratio where we wanted to spot top performers). The goal is to introduce panel data, common practices and potential problems.

- y can be the gross investment (addition to plants and equipments, plus maintenance and repairs).
- x1 can be the value of the firm (price of common and preferred shares times the number of shares outstanding plus the total book balue of debt).
- x2 can be the stock of plant and equipment (accumulated sum and net additions to assets minus depreciation allowance).

A word of panel data

Panel data or longitudinal data are multi-dimensional collections of observations (variables, readings, results). They combine cross-sectional data and time series. For that reason, it is well-advised to work to the zoo and xts packages and ts objects.

Panel data are flexible. No matter how the dataset is organized, we can always adjust it with the dplyr and tidyr packages. Panel data enable the analyst to focus on the differences between individuals or firms in a panel study and/or the changes in observed phenomena for one subject over the course of the study. Analysis of panel data can be complex as they are plagued with problems from both cross sections and time series: heteroscedasticity, autocorrelation and, of course, multicollinearity.

A first model

Run a regression without differentiating between companies and years. We pile up 80 observations.

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\] Print the results.

## 
## Call:
## lm(formula = y ~ x1 + x2, data = panel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -320.83 -100.24   -0.13   68.21  336.25 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -62.45486   29.57784  -2.112    0.038 *  
## x1            0.10910    0.01371   7.956 1.24e-11 ***
## x2            0.30575    0.04926   6.207 2.52e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.3 on 77 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7498 
## F-statistic: 119.4 on 2 and 77 DF,  p-value: < 2.2e-16

$R^2$ are good and we have significant p-values at the 1% level, except for the intercept (5% level).

We do not have any heteroscedasticity or multicollinearity problems here. It is always well-advised to test though.

Time series and panel data are notorious for serial correlation or autocorrelation. Autocorrelation inflates variance, making $R^2$, t- and F-values invalid.

Test for autocorrelation (where the $H_0: \rho = 0$ or ‘no autocorrelation’) with the Durbin-Watson test. The test produces two critical values: $d_{L, \alpha}$ and $d_{U, \alpha}$.

##  lag Autocorrelation D-W Statistic p-value
##    1        0.816305     0.3622964       0
##  Alternative hypothesis: rho != 0

For n = 80 (observations), k = 2 (regressors), $\alpha$ = 0.05, the Durbin-Watson table gives $d_{L, 0.05} = 1.586$ and $d_{U, 0.05} = 1.688$.

The computed D-W statistic, $d$, is below $d_{L}$. We reject the null hypothesis; we accept the alternative hypothesis ($\rho \neq 0$ or rho != 0). According to the results, we have evidence of positive first-order (lag = 1) serial correlation.

The diagnostic table:

0 < $d$ < $d_{L}$; evidence of positive first-order (lag = 1) serial correlation.
$d_{L}$ <= $d$ <=$d_{U}$; indecision zone.
$d_{U}$ < $d$ < $4 - d_{U}$; no autocorrelation.
$4 - d_{U}$ <= $d$ <= $4 - d_{L}$; indecision zone.
$4 - d_{U}$ < $d$ < 4; evidence of negative first-order (lag = 1) serial correlation.

In addition, for the D-W test to detect first-order serial correlation, the model must include a constant and no lagged regressor dependent on the regressand.

There are other tests to detect autocorrelation: graphical methods such as time sequence plot, runs test or Geary test; analytical techniques such as the modified d test, the large-sample test, the Breusch-Godfrey test of higher-order autocorrelation (lag >= 1), Durbin's h test in the presence of lagged dependent variables. We can perform these tests with the ts and tseries packages.

Run the Breusch-Godfrey test of order 1.

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  panel_model1
## LM test = 53.751, df = 1, p-value = 2.275e-13

We confirm the D-W test and reject the ‘no autocorrelation’ hypothesis.

There exist remedies to autocorrelation. We can change the model specification for example.

A second model

Differentiate between companies by adding three dummy variables-three columns to the dataset (all ‘0’ accounts for A, column d1, the first dummy, has ‘1’ to account for B or ‘0’ for the other companies, column d2 has ‘1’ to account for C or ‘0’ for the other companies, and so on for D).

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 D_1 + \beta_4 D_2 + \beta_5 D_3 + \epsilon\] Run the regression, print the results, and run autocorrelation tests.

## 
## Call:
## lm(formula = y ~ x1 + x2 + d1 + d2 + d3, data = panel)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -182.995  -48.589    9.516   37.200  195.581 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -246.81201   35.66791  -6.920 1.39e-09 ***
## x1             0.10814    0.01744   6.202 2.90e-08 ***
## x2             0.34814    0.02656  13.109  < 2e-16 ***
## d1           339.68506   23.89366  14.217  < 2e-16 ***
## d2           157.98102   46.26987   3.414  0.00104 ** 
## d3           187.53271   31.40690   5.971 7.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75 on 74 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9305 
## F-statistic: 212.4 on 5 and 74 DF,  p-value: < 2.2e-16

##  lag Autocorrelation D-W Statistic p-value
##    1       0.4948028     0.9626941       0
##  Alternative hypothesis: rho != 0

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  panel_model2
## LM test = 20.334, df = 1, p-value = 6.502e-06

$R^2$ are stronger because we have more explanatory variables. We have significant p-values at the 0.1% level, except 1 p-value at the 1% level.

For n = 80 and k = 5, the Durbin-Watson table gives $d_{L, 0.05}$ = 1.507 and $d_{U, 0.05}$ = 1.772.

The calculated D-W statistic, $d$, is below $d_{L}$: we still have evidence of positive first-order (lag = 1) serial correlation. The Breusch-Godfrey test confirms the conclusion.

Compare both models.

## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1 + x2 + d1 + d2 + d3
##   Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
## 1     77 1558444                                 
## 2     74  416230  3   1142213 67.69 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is significant at the 0.1% level. Both models are statistically different and the Model 2 is better as indicated by its higher $R^2$. However, we still have the presence of autocorrelation.

A third model

We could add dummy variables to differentiate between time periods: take into account different tax regimes or technological changes for example.

Conclusion

Our goal is reached.

We introduced panel data.
We showed how to test for autocorrelation (a problem we had not encountered before).
We showed one remedy to serial correlation: changing the functional form.

If we cannot remedy to autocorrelation by changing the functional form, we would have to change the model. Here a few pointers.

When serial correlation ($\rho$) structure is known:

Use generalized difference equations (the Prais-Winsten transformation).

When the serial correlation structure is unknown (most cases):

We need a correlation parameter. We can base correlation on the D-W statistic or on Durbin’s two-set method of estimating correlation or on Theil-Nagar modified d statistic.
Use first-order difference or moving average regressions without intercepts. We must assume the correlation parameter is +1. We can check it out with Berenblutt-Webb test.
Use the Cochrane-Orcutt iterative procedures (one of the most popular).

These should cover the majority of situations.

Extension

Panel data are well-suited to perform comparisons over time: within a group (individuals or firms) and between groups. For example, we can compare if two samples of the same population have an equal mean, an equal variance; compare if two or more groups in terms of mean or variance equality. We can run post-hoc tests and pairwise comparisons (Tukey’s procedures), test simple effects, interaction effects, and random effects. In all, we can perform all kinds tests.

Another avenue is multilevel analyses (with the multilevel and nlme packages). These models measure the impact of a group on individuals or firms. For example, individuals may be nested within workgroups.