Foreword
R
R may appear opaque. Some people advocate the use of spreadsheets or visualization software such as Tableau or Qlik. These user-friendly tools do not have steep and long learning curves. Nonetheless, it is a matter of long-term investment. Implementing R scripts can save lots of time if we have to release reports on the recurring basis for example. Imagine the burden of completing the same report month after month with a spreadsheet when we can instead embed R scripts withing a process, automate things, and free resources for other tasks.
Documentation on R can be found:
Synonyms
In the following regressions, we use terms that are sometimes synonyms:
Datasets
Most of the cases rely on cross-sectional datasets. Even though we may come across time and dates in the observations, they are simply markers.
Some cases will address time series.
Another case will combine cross-sectional data with time series: panel data.
Models are simple and datasets are short. The methods remain the same with larger datasets or recursive analyses. We import data from spreadsheets. For large dataset, we can instead connect to databases or work with Big Data technologies (parallel computing and the likes). Scripts are shorts. For repetitive tasks, we can implement loops (using for(var in vec) {expr} and while(cond){expr}).
Ordinary Least Squares
The OLS model or method relies on several assumptions to work properly. The following is a list of the main assumptions (and their violations). We must be careful to test if the assumptions are violated and bring on remedial measures (if necessary since sometimes, violations can be tolerated).
Goal
Find out the most efficient firms; the producers which can maximize their sales with their means of production. It is an exploratory analysis to reduce the sample to study and proceed to deeper analyses (say we have a larger sample of 20 players and we want to focus on the 3 best).
Explore the dataset
Import the data: sales_m (in millions of Euros) of Champagne producers and their respective workforce and hectares (cultivated land surface) as explanatory variables.
Pointer
The XLConnect package fetches data directly into a Workbook. When dealing with several light datasets, the best option is keeping them in a unique workbook with several worksheets.
However, XLConnect is memory-intensive because workbooks are binary files. For larger datasets, a better option is to maintain separate csv files and load them into R with the readr package.
Print the head and tail of the dataset.
## house sales_m workforce hectares
## 1 Boizel 97 10 1
## 2 Canard Duchêne 103 30 44
## 3 Charles Heidsieck 157 180 65
## house sales_m workforce hectares
## 8 Laurent Perrier 208 150 150
## 9 Moët et Chandon 724 1100 1000
## 10 Pommery 288 80 230
Summarize the dataset.
## house sales_m workforce hectares
## Length:10 Min. : 97.0 Min. : 10.0 Min. : 1.00
## Class :character 1st Qu.:121.8 1st Qu.: 24.0 1st Qu.: 49.25
## Mode :character Median :208.0 Median : 60.0 Median : 121.00
## Mean :239.6 Mean : 170.2 Mean : 199.90
## 3rd Qu.:269.2 3rd Qu.: 132.5 3rd Qu.: 210.00
## Max. :724.0 Max. :1100.0 Max. :1000.00
Summaries are good a detecting missing values, inconsistencies, and outliers. For example, in workforce, we notice a high maximum, pushing the mean way over the median.
Use another view (psych package).
## vars n mean sd median trimmed mad min max range skew
## house* 1 10 NaN NA NA NaN NA Inf -Inf -Inf NA
## sales_m 2 10 239.6 183.98 208 196.88 118.61 97 724 627 1.70
## workforce 3 10 170.2 331.73 60 74.00 65.23 10 1100 1090 2.15
## hectares 4 10 199.9 291.65 121 124.75 119.35 1 1000 999 1.99
## kurtosis se
## house* NA NA
## sales_m 1.90 58.18
## workforce 3.19 104.90
## hectares 2.71 92.23
describeBy flips the view. We obtain more descriptive statistics: central tendency measures, standard deviation, range, skewness, kurtosis, and standard error.
Bring out more analytic power!
## champagne
##
## 4 Variables 10 Observations
## ---------------------------------------------------------------------------
## house
## n missing distinct
## 10 0 10
##
## Boizel (1, 0.1), Canard Duchêne (1, 0.1), Charles Heidsieck (1, 0.1),
## Charles Laffite (1, 0.1), Château Malakoff (1, 0.1), Delamotte (1, 0.1),
## Deutz (1, 0.1), Laurent Perrier (1, 0.1), Moët et Chandon (1, 0.1),
## Pommery (1, 0.1)
## ---------------------------------------------------------------------------
## sales_m
## n missing distinct Info Mean Gmd
## 10 0 8 0.988 239.6 177.7
##
## Value 97 103 110 157 208 213 288 724
## Frequency 1 1 1 1 2 1 2 1
## Proportion 0.1 0.1 0.1 0.1 0.2 0.1 0.2 0.1
## ---------------------------------------------------------------------------
## workforce
## n missing distinct Info Mean Gmd
## 10 0 8 0.988 170.2 262
##
## Value 10 22 30 60 80 150 180 1100
## Frequency 2 1 1 2 1 1 1 1
## Proportion 0.2 0.1 0.1 0.2 0.1 0.1 0.1 0.1
## ---------------------------------------------------------------------------
## hectares
## n missing distinct Info Mean Gmd
## 10 0 8 0.988 199.9 256.2
##
## Value 1 37 44 65 121 150 230 1000
## Frequency 1 1 1 1 2 1 2 1
## Proportion 0.1 0.1 0.1 0.1 0.2 0.1 0.2 0.1
## ---------------------------------------------------------------------------
The describe function reveals more metrics: number of observations, missing values, distinct values (important when dealing with categorical variables) and statistics on the distribution.
Visualize the dataset.
We can spot outliers.
Explore the results
Run the regression.
Print the coefficients and their confident intervals.
## (Intercept) workforce hectares
## 106.7472958 -0.1699307 0.8092792
## 2.5 % 97.5 %
## (Intercept) 86.8478521 126.64673957
## workforce -0.3267326 -0.01312889
## hectares 0.6309260 0.98763244
Print the predicted values: ‘Fitted’ sales_m. As opposed to the actual sales_m.
## 1 2 3 4 5 6 7 8
## 105.8573 137.2577 128.7629 289.1430 194.4742 202.9708 126.4948 202.6496
## 9 10
## 729.1027 279.2871
Plot the predicted values against the actual values of the regressand.
The mode does a good job at predicting the regressand. A perfect fit would be right on the line.
Print the residuals: the differences between the predicted and actual values:
## 1 2 3 4 5 6
## -8.857268 -34.257659 28.237086 -1.143037 18.525764 5.029228
## 7 8 9 10
## -16.494783 5.350432 -5.102707 8.712945
## 1 2 3 4 5 6
## -0.46034348 -2.37241633 2.27236684 -0.07161078 0.97759263 0.25578198
## 7 8 9 10
## -0.89577936 0.26388855 -2.20350550 0.47769086
## 1 2 3 4 5 6
## -0.48867287 -1.84069022 1.79938600 -0.07731545 0.98070167 0.27478205
## 7 8 9 10
## -0.90869540 0.28339241 -1.76945496 0.50642492
Plot the residuals.
t-tests and F-tests
One function does it all. Print the tests.
##
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.258 -7.919 1.943 7.872 28.237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.74730 8.41548 12.685 4.38e-06 ***
## workforce -0.16993 0.06631 -2.563 0.0374 *
## hectares 0.80928 0.07543 10.730 1.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared: 0.9907, Adjusted R-squared: 0.9881
## F-statistic: 374.4 on 2 and 7 DF, p-value: 7.648e-08
We get a t-test (p-value) for each coefficient and the model F-statistic (p-value). They are all significant at or below the 5% level.
The t-value, or t-statistic, is the significance of an individual coefficient.
The p-value is the probability associated with a statistical test. A coefficient is statistically significant when the p-value associated with a test is less than, say, 5%. 1% and lower is even better. Sometimes we can tolerate 10%.
With the F-statistic, when significant, we conclude that taken together, the regressors are collectively different from 0.
Alternatively, compute an ANOVA table.
## Analysis of Variance Table
##
## Response: sales_m
## Df Sum Sq Mean Sq F value Pr(>F)
## workforce 1 255417 255417 633.60 3.986e-08 ***
## hectares 1 46408 46408 115.12 1.343e-05 ***
## Residuals 7 2822 403
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Or…
## Df Sum Sq Mean Sq F value Pr(>F)
## workforce 1 255417 255417 633.6 3.99e-08 ***
## hectares 1 46408 46408 115.1 1.34e-05 ***
## Residuals 7 2822 403
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-test’s p-values are below the 1% significance level (3 asterisks means below 0.1%), which is highly significant. We reject the ‘all \(\beta = 0\)’ hypothesis. We have robust coefficients. However, we want to remain cautious…
Outliers, extreme values, and influential observations
It is essential to understand the impact of outliers on a model. First, we need to detect them. Second, we decide whether we keep them.
There are several methods for detecting outliers. As we did above, we can use descriptive functions such as summary. We also plotted the observations (boxplots, histograms, etc.) prior to running the regression. Finally, we plotted the regression residuals. There are other techniques.
Run the Bonferroni t-test for catching extreme values (from the car package). We test the assumption of ‘no extreme value’. The test is good for lm and glm assuming normal distributions.
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 2 -2.372416 0.055338 0.55338
The second observation (obs#2) is the most extreme but does not make the cut as an outlier. We accept the ‘no extreme value’ hypothesis (p-value is over 10%).
A value may be extreme, but does it have a significant impact on the regression?
To answer the question, plot the Cook’s distance (set the threshold to \(4 \times mean(cooksd)\)).
Alternatively.
The Cook’s distance computes the influence exerted by each observation on the regression. Observation 9 has more influence than any other observations.
Extract the observations above the threshold.
## 9
## 49.54683
Deal with an outlier
There are several solutions:
Normality
We plotted the predicted values against the actual values with a regression line…
…and the residual values.
We can plot an alternative view of these residuals. Plot a Q-Q chart (using qqPlot from the car package).
Not only do we obtain the regression line, we also get the confidence interval. When residuals values are located around the regression line and within the confidence interval, we have a normal and linear regression.
We can represent these results with a histogram and a distribution of studentized residuals (using the MASS package).
We are far from a perfect normal distribution. We have fat tails, even though most values are concentrated around 0 (the regression line).
Under the hypothesis of normality, data should be symmetrical, not platykurtic, not leptokurtic, not skewed to the left nor to the right.
Run the \(\chi^2\) goodness-of-fit test (actuals over the probability of predictions).
##
## Chi-squared test for given probabilities
##
## data: champagne$sales_m
## X-squared = 19.977, df = 9, p-value = 0.01805
We reject the ‘normality’ hypothesis at the 5% significance level.
Run the Jarque-Berra normality test (from the normtest package).
##
## Jarque-Bera test for normality
##
## data: fitted(champagne_model)
## JB = 12.247, p-value < 2.2e-16
We reject again the ‘normality’ hypothesis at the 5% significance level.
Run the Shapiro normality test.
##
## Shapiro-Wilk normality test
##
## data: fitted(champagne_model)
## W = 0.6761, p-value = 0.0004527
We reject the ‘normality’ hypothesis at the 5% significance level.
We can relax these three results knowing that we deal with a small sample. We always have to consider test results in a global perspective including a visual exploration of the dataset, the results, and a battery of tests.
Global test and wrap-up
The normtest package allows for other tests on kurtosis, skewness, etc. Instead, we can use the gvlma package that runs a battery of tests.
Run a global validation of the model assumptions (from the gvlma package).
##
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.258 -7.919 1.943 7.872 28.237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.74730 8.41548 12.685 4.38e-06 ***
## workforce -0.16993 0.06631 -2.563 0.0374 *
## hectares 0.80928 0.07543 10.730 1.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared: 0.9907, Adjusted R-squared: 0.9881
## F-statistic: 374.4 on 2 and 7 DF, p-value: 7.648e-08
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = champagne_model)
##
## Value p-value Decision
## Global Stat 7.25407 0.1231 Assumptions acceptable.
## Skewness 0.18300 0.6688 Assumptions acceptable.
## Kurtosis 0.01897 0.8905 Assumptions acceptable.
## Link Function 4.57713 0.0324 Assumptions NOT satisfied!
## Heteroscedasticity 2.47496 0.1157 Assumptions acceptable.
We obtain a more detail summary of statistics:
The only failing assumption is the link function. Residuals can be normally distributed or they can follow an alternative distribution. For example, a logit function for a binomial distribution or an inverse function for a gamma distribution. In other words, the results would suggest running the regression with another link function to fit a nonnormal distribution!
It can also reveal another problem…
Extract statistics using the influence…
## $hat
## 1 2 3 4 5 6 7
## 0.1850521 0.1407477 0.3891173 0.4578062 0.1147924 0.1690161 0.1826214
## 8 9 10
## 0.1157645 0.9793705 0.2657118
##
## $coefficients
## (Intercept) workforce hectares
## 1 -1.955594615 -0.006557502 0.009929118
## 2 -6.189522488 -0.012490588 0.021653354
## 3 9.899199277 0.078639350 -0.093353050
## 4 -0.001332613 0.004157981 -0.004588158
## 5 2.262288485 -0.005640924 0.003955039
## 6 0.501077166 -0.004932110 0.004720268
## 7 -3.553614199 -0.014595512 0.020108856
## 8 0.787650638 0.002235383 -0.002816517
## 9 19.073660021 -0.169639458 -0.074717411
## 10 0.355799817 -0.015896119 0.017690363
##
## $sigma
## 1 2 3 4 5 6 7 8
## 21.31341 15.57781 15.89873 21.67727 20.14168 21.56925 20.36732 21.56176
## 9 10
## 16.12286 21.28555
##
## $wt.res
## 1 2 3 4 5 6
## -8.857268 -34.257659 28.237086 -1.143037 18.525764 5.029228
## 7 8 9 10
## -16.494783 5.350432 -5.102707 8.712945
…and summary.
##
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.258 -7.919 1.943 7.872 28.237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.74730 8.41548 12.685 4.38e-06 ***
## workforce -0.16993 0.06631 -2.563 0.0374 *
## hectares 0.80928 0.07543 10.730 1.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared: 0.9907, Adjusted R-squared: 0.9881
## F-statistic: 374.4 on 2 and 7 DF, p-value: 7.648e-08
Signs on coefficients are confusing. More means of production should translate into more sales, but the workforce coefficient is negative; although significant at the 5% level. \(R^2\) and adjusted-\(R^2\) are very strong. \(R^2\) is the regression explanatory power, the overall fit of the model (useful when comparing models). 98% of the sales can be explained by the independent variables. Only 2% is random or unexplained.
All p-values and F-statistic are below the 5%, even the 1% level. However, we want to keep in mind that in the light of the above tests and a questionable coefficient, we must continue searching for problems.
Plot the residuals.
We take up all the previous findings:
Although observations 2, 3, 5 are further away from the regression line, observation 9 is the extreme value with the most impact on the regression.
More plots are available (type in method(plot)).
The last two plots are alternative Cook’s distance plots.
Something is still wrong with the results and we have not found it.
Multicollinearity
Test the ‘no multicollinearity’ hypothesis between coefficients with the variance inflation factors (vif).
## workforce hectares
## 10.80331 10.80331
## workforce hectares
## 3.286839 3.286839
## workforce hectares
## TRUE TRUE
The test is positive (TRUE). We reject the ‘no multicollinearity’ hypothesis. We found our culprit!
workforce and hectares are not independent. Intuitively, it makes sense that a large farming land implies a large workforce. Multicollinearity between these two variables explains the coefficients (high \(R^2\), one low t-value).
Multicollinearity causes larger variances, large covariances, and high standard errors of the regressors. The results have wider confidence intervals, distribution with fatter tails. These large swings around the mean explain the rejection of normality by the tests.
With multicollinearity, we have weaker or insignificant t-values, but high \(R^2\).
Pointer
It is a t-test. When the test threshold is 2 (at the 5% significance level). According to the student distribution, at the 5% significance level, t-values tends towards 1.96 as we add more observations. We round it to 2, a more conservative benchmark.
With 10 observations, 9 degrees of freedom, the t-value = 1.833. sqrt(vif) > 1.833. Even with this threshold, the test rejects the 'no multicollinearity' hypothesis.
Remedies to multicollinearity
As suggested by the failing assumption of the link function, we face a specification error. We need not change the link function, though.
##
## Call:
## lm(formula = sales_m ~ workforce + hectares, data = champagne)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.258 -7.919 1.943 7.872 28.237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.74730 8.41548 12.685 4.38e-06 ***
## workforce -0.16993 0.06631 -2.563 0.0374 *
## hectares 0.80928 0.07543 10.730 1.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 7 degrees of freedom
## Multiple R-squared: 0.9907, Adjusted R-squared: 0.9881
## F-statistic: 374.4 on 2 and 7 DF, p-value: 7.648e-08
All p-values but one are below the 1% level. workforce has the weakest t-value (the closest to 0). Simply remove workforce from the model and run another regression with hectares alone. Print the results.
##
## Call:
## lm(formula = sales_m ~ hectares, data = champagne)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.138 -17.639 0.664 21.475 29.583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.63142 10.20049 11.24 3.53e-06 ***
## hectares 0.62516 0.02988 20.92 2.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.15 on 8 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.9798
## F-statistic: 437.6 on 1 and 8 DF, p-value: 2.861e-08
The results are still robust without workforce. Multicollinearity simply disappeared with only one regressor.
Rerun the normality and global tests to be sure.
##
## Chi-squared test for given probabilities
##
## data: champagne$sales_m
## X-squared = 19.977, df = 9, p-value = 0.01805
##
## Jarque-Bera test for normality
##
## data: fitted(champagne_model)
## JB = 12.247, p-value = 5e-04
##
## Call:
## lm(formula = sales_m ~ hectares, data = champagne)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.138 -17.639 0.664 21.475 29.583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.63142 10.20049 11.24 3.53e-06 ***
## hectares 0.62516 0.02988 20.92 2.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.15 on 8 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.9798
## F-statistic: 437.6 on 1 and 8 DF, p-value: 2.861e-08
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = champagne_model2)
##
## Value p-value Decision
## Global Stat 8.15102 0.086200 Assumptions acceptable.
## Skewness 0.05058 0.822056 Assumptions acceptable.
## Kurtosis 0.73125 0.392477 Assumptions acceptable.
## Link Function 7.26055 0.007049 Assumptions NOT satisfied!
## Heteroscedasticity 0.10864 0.741695 Assumptions acceptable.
We still have a problem of nonnormality. The plausible cause is the small sample size (n < 30). A larger sample would enable us to conclude without any doubt whether the sample is normal or not.
Conclusion
Add the predicted sales to the dataset and plot the results.
Any firm with ‘actual sales > predicted sales’ is likely to be an outperformer because the regression averages the results. Pommery, for example, sells more than what is expected by the model.
Other remedies to multicollinearity
Removing one (or more) explanatory variable(s) is only one way of removing multicollinearity:
effects function might be of use).Multicollinearity, covariance, and correlation
Print the covariance matrix of the model coefficients.
## (Intercept) workforce hectares
## (Intercept) 70.8202727 0.204013691 -0.326320674
## workforce 0.2040137 0.004397218 -0.004764483
## hectares -0.3263207 -0.004764483 0.005689023
The covariance matrix simply shows a measure of the joint variability between coefficients but does not indicate if coefficients are highly correlated.
Print the Pearson correlation matrix of the dataset.
## sales_m workforce hectares
## sales_m 1.00 0.92 0.99
## workforce 0.92 1.00 0.95
## hectares 0.99 0.95 1.00
##
## n= 10
##
##
## P
## sales_m workforce hectares
## sales_m 2e-04 0e+00
## workforce 2e-04 0e+00
## hectares 0e+00 0e+00
The correlation matrix shows the high collinearity between workforce and hectares; a sign of potential multicollinearity.
More statistics
We can extract a lot more from the regression results.
## (Intercept) workforce hectares
## 1 1 10 1
## 2 1 30 44
## 3 1 180 65
## 4 1 22 230
## 5 1 60 121
## 6 1 10 121
## 7 1 60 37
## 8 1 150 150
## 9 1 1100 1000
## 10 1 80 230
## attr(,"assign")
## [1] 0 1 2
## $std.dev
## [1] 20.07782
##
## $hat
## [1] 0.1850521 0.1407477 0.3891173 0.4578062 0.1147924 0.1690161 0.1826214
## [8] 0.1157645 0.9793705 0.2657118
##
## $std.res
## [1] -0.48867287 -1.84069022 1.79938600 -0.07731545 0.98070167
## [6] 0.27478205 -0.90869540 0.28339241 -1.76945496 0.50642492
##
## $stud.res
## [1] -0.46034348 -2.37241633 2.27236684 -0.07161078 0.97759263
## [6] 0.25578198 -0.89577936 0.26388855 -2.20350550 0.47769086
##
## $cooks
## [1] 0.018075047 0.184995371 0.687464383 0.001682436 0.041573872
## [6] 0.005119069 0.061495548 0.003504794 49.546825102 0.030935223
##
## $dfits
## [1] -0.21936337 -0.96017691 1.81359302 -0.06580244 0.35203989
## [6] 0.11535533 -0.42341397 0.09548260 -15.18249574 0.28735523
##
## $correlation
## (Intercept) workforce hectares
## (Intercept) 1.0000000 0.3655876 -0.5140992
## workforce 0.3655876 1.0000000 -0.9525942
## hectares -0.5140992 -0.9525942 1.0000000
##
## $std.err
## [,1]
## (Intercept) 8.41547816
## workforce 0.06631152
## hectares 0.07542561
##
## $cov.scaled
## (Intercept) workforce hectares
## (Intercept) 70.8202727 0.204013691 -0.326320674
## workforce 0.2040137 0.004397218 -0.004764483
## hectares -0.3263207 -0.004764483 0.005689023
##
## $cov.unscaled
## (Intercept) workforce hectares
## (Intercept) 0.1756808097 5.060880e-04 -8.094897e-04
## workforce 0.0005060880 1.090799e-05 -1.181905e-05
## hectares -0.0008094897 -1.181905e-05 1.411252e-05
Pointer
Other functions: deviance(champagne_model), logLik(champagne_model), champagne_model$df
Goal
Import the data: year, average.price.per.bottle (average price per bottle (Euros)), grape.cost.per.kg (cost per kg of grapes (Euros)).
Print the head of the dataset.
## year average.price.per.bottle grape.cost.per.kg
## 1 1987 11.06 3.32
## 2 1988 11.15 3.66
## 3 1989 11.82 4.12
## 4 1990 12.48 4.88
## 5 1991 12.94 4.57
## 6 1992 11.50 3.65
The dataset is named champagne2 since we follow up on the previous case. This dataset is more inclusive and focuses on all wine appellations.
Plot the data.
Find out if a dataset does not hide two subsets; a phenomenon known as ‘structural break’.
Structural change
The cost of grapes represents 70% of the cost price per bottle. In 1990, the market was liberalized. Prices became more volatile. Hikes in the prices coupled with an economic downturn have shrunk the demand since 1991.
This liberalization (structural break) had an impact on the relationship between average.price.per.bottle and grape.cost.per.kg. Our hypothesis: there are two regimes:
Pointer
`year` is facultative. We deal with a cross-sectional dataset indexed by a date marker to help subsetting it. The ideal dataset to deal with would be a time series.
Run the regressions.
1987-2002.
##
## Call:
## lm(formula = average.price.per.bottle ~ grape.cost.per.kg, data = champagne2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94270 -0.35698 0.03706 0.53772 0.64311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.340 1.142 4.674 0.000358 ***
## grape.cost.per.kg 1.542 0.297 5.192 0.000137 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5417 on 14 degrees of freedom
## Multiple R-squared: 0.6582, Adjusted R-squared: 0.6337
## F-statistic: 26.95 on 1 and 14 DF, p-value: 0.0001366
1987-1991, market planning.
##
## Call:
## lm(formula = average.price.per.bottle ~ grape.cost.per.kg, data = champagne2_pre)
##
## Residuals:
## 1 2 3 4 5
## 0.1024 -0.2089 -0.0818 -0.3188 0.5071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0390 1.2150 5.793 0.0102 *
## grape.cost.per.kg 1.1803 0.2928 4.031 0.0274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.374 on 3 degrees of freedom
## Multiple R-squared: 0.8441, Adjusted R-squared: 0.7922
## F-statistic: 16.25 on 1 and 3 DF, p-value: 0.02745
1992-2002, market liberalization
##
## Call:
## lm(formula = average.price.per.bottle ~ grape.cost.per.kg, data = champagne2_post)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8420 -0.3452 -0.0034 0.5469 0.7001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7690 2.0949 2.277 0.0488 *
## grape.cost.per.kg 1.6702 0.5659 2.951 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5893 on 9 degrees of freedom
## Multiple R-squared: 0.4918, Adjusted R-squared: 0.4353
## F-statistic: 8.709 on 1 and 9 DF, p-value: 0.01619
We can cut through all these results with a Chow test (F-test) to validate or not the ‘no structural break’ hypothesis where the first segment’s coefficients should be equal to the second segment’s coefficient.
## [1] 0.004132164
The p-value is below 1%. The Chow test rejects the ‘no structural break’ hypothesis.
Conclusion
We have two subsets rather than one dataset.
Visualize the two regimes: the restricted model vs. the two unrestricted models.
We can see that the ‘pre’ intercept is higher. Under market planning, prices are higher and the market is more stable. For any cost price, we get a higher price, which translates into more predictable revenues for producers.
Under market liberalization, prices are lower, but the variance is higher, as the observations are more widely dispersed, accounting for more volatility.
Pointer
There are several types of structural breaks. Dissimilar regressions (the case above) with two intercepts and slopes. Concurrent regressions have a common intercept, but different slopes. Parallel regressions have a common slope, but different intercepts.
Volatility
Volatility is measured with the coefficients’ standard error and the regression’s \(\sigma\). Standard error and \(\sigma\) are related to variance and standard deviation, two measures of volatility.
Pre.
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.038981 1.2150274 5.793269 0.01023226
## grape.cost.per.kg 1.180297 0.2928131 4.030888 0.02744966
## $sigma
## [1] 0.3739763
Post.
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.769050 2.0948718 2.276535 0.04883840
## grape.cost.per.kg 1.670211 0.5659492 2.951167 0.01619019
## $sigma
## [1] 0.5892843
champagne2_model_pre has a higher intercept (higher average price), a flatter slope (less sensitive to the cost of grapes), and lower \(\sigma\) (less volatility).
We can also compare models with the Root Mean Squared Errors or RMSE. The RMSE is the standard deviation of the residuals (the spread of the points about the fitted regression line), a good measure of volatility.
\[\mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}{(y_i-\hat{y}_i)^2}}\]
Pre.
## [1] 1.098646
Post.
## [1] 1.361421
champagne2_model_pre has a lower RMSE (less volatility, better fit).
Finding the break
Most of the time, a visual exploration reveals the structural break. However, we might want to opt for analytical techniques to find the break. The strucchange package offers several functions to deal with structural breaks (fit, plot, test), including generalized fluctuation tests to detect breaks. For example, we can compute the CUSUM and MOSUM processes with the efp function (empirical fluctuation processes). Other functions emphasize on deviations from the norm (boundaries). We can implement Chow (F-) tests as well.
Double linear regressions and more
If the deviation falls back in line following a second structural break, we might have to decide whether we want to deal with three subsets (three regressions) or we keep a unique dataset (one regression).
Modeling the break
Just like an interaction effect, each subset might have their distinct intercept, distinct slope or both. We can capture this change with a dummy variable. We will cover dummy variable in another case.
Piecewise linear regressions
Alternatively, we can consider the piecewise time series (a general class of spline functions). The model is:
\[Y = \beta_0 + \beta_1X1 + \beta_2(X1-X^*)D + \epsilon\]
Where \(X^*\) is the knot (the hinge or the joint).
There are different forms of piecewise linear regressions.
Other than the model above, there are multiple ways to handle piecewise regressions. It all depends on what we want to achieve. We can then consider polynomial forms, use spline regressions, work with time-series functions (moving average for example), add dummy variables to capture the interaction effects, slope changes and intercept jumps, etc. We can also use the segmented package for segmented regressions.
Goal
Import the data: a sample of 11 buildings.
Print the head of the dataset.
## real_estate_value area offices entrances age
## 1 213000 25156 2 2.0 20
## 2 210000 25406 2 2.0 12
## 3 226500 25657 3 1.5 33
real_estate_value (in $US) is determined by the area in square feet, the number of offices and entrances, and the building’s age in years.
Discover the undervalued buildings. It is an exploratory analysis to reduce the number of targets and proceed to deeper analyses (say we have a larger set of 100 buildings and we want to focus on the 10 best).
Correlation
Check the explanatory variable independence with a Pearson correlation matrix,…
## real_estate_value area offices entrances
## real_estate_value 1.0000000 0.3593228 0.8832723 0.5162754
## area 0.3593228 1.0000000 0.2234715 0.6204393
## offices 0.8832723 0.2234715 1.0000000 0.3107144
## entrances 0.5162754 0.6204393 0.3107144 1.0000000
## age -0.4089111 0.2214926 -0.0522657 -0.0505677
## age
## real_estate_value -0.4089111
## area 0.2214926
## offices -0.0522657
## entrances -0.0505677
## age 1.0000000
…with a Spearman correlation matrix,…
## real_estate_value area offices entrances
## real_estate_value 1.00000000 0.3272727 0.9045340 0.294510775
## area 0.32727273 1.0000000 0.1909572 0.569709369
## offices 0.90453403 0.1909572 1.0000000 0.178809772
## entrances 0.29451078 0.5697094 0.1788098 1.000000000
## age -0.08200477 0.3325749 0.1989476 0.004839056
## age
## real_estate_value -0.082004769
## area 0.332574895
## offices 0.198947637
## entrances 0.004839056
## age 1.000000000
…and with a Kendall correlation matrix.
## real_estate_value area offices entrances
## real_estate_value 1.00000000 0.2000000 0.8090398 0.23164312
## area 0.20000000 1.0000000 0.1797866 0.44222778
## offices 0.80903983 0.1797866 1.0000000 0.15617376
## entrances 0.23164312 0.4422278 0.1561738 1.00000000
## age -0.03669879 0.2201928 0.1587632 0.04250511
## age
## real_estate_value -0.03669879
## area 0.22019275
## offices 0.15876322
## entrances 0.04250511
## age 1.00000000
area and entrance: over 0.60.area and entrance: rounded to 0.57.area and entrance: rounded to 0.44.Test the ‘no correlation’ hypothesis (t-test) among the variables (based on the Pearson correlation).
## real_estate_value area offices entrances age
## real_estate_value 1.00 0.36 0.88 0.52 -0.41
## area 0.36 1.00 0.22 0.62 0.22
## offices 0.88 0.22 1.00 0.31 -0.05
## entrances 0.52 0.62 0.31 1.00 -0.05
## age -0.41 0.22 -0.05 -0.05 1.00
##
## n= 11
##
##
## P
## real_estate_value area offices entrances age
## real_estate_value 0.2778 0.0003 0.1040 0.2118
## area 0.2778 0.5089 0.0417 0.5128
## offices 0.0003 0.5089 0.3524 0.8787
## entrances 0.1040 0.0417 0.3524 0.8826
## age 0.2118 0.5128 0.8787 0.8826
The top matrix confirms the correlation between area and entrance (0.62). The bottom matrix displays the p-values. area and entrance are below the 5% significance level. We reject the ‘no correlation’ hypothesis.
More specifically, test the ‘no correlation’ hypothesis between area and entrance.
##
## Pearson's product-moment correlation
##
## data: building$area and building$entrance
## t = 2.3734, df = 9, p-value = 0.04168
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03275536 0.88932133
## sample estimates:
## cor
## 0.6204393
We reject the hypothesis and we confirm the previous results.
Visualize the pairwise correlations with a scatter plot matrix (or splom).
Try another splom.
Due to a limited sample (15 observations), these sploms do not display any distinctive patterns.
Plot a ‘carrologram’, which is a more intuitive representation of pairwise correlations.
The darker blue shade for positive correlation and the 2/3 pie at the crossing of area and entrance reveals the positive correlation close to 0.60.
Multicollinearity
The presence of correlation is a sign of potential multicollinearity.
Compute three models:
area.entrance.Compare the results.
\(R^2\).
## $r.squared
## [1] 0.9903542
## $r.squared
## [1] 0.9704106
## $r.squared
## [1] 0.9801031
Adjusted-\(R^2\).
## $adj.r.squared
## [1] 0.9839236
## $adj.r.squared
## [1] 0.9577295
## $adj.r.squared
## [1] 0.9715758
F-statistic.
## $fstatistic
## value numdf dendf
## 154.0076 4.0000 6.0000
## $fstatistic
## value numdf dendf
## 76.52386 3.00000 7.00000
## $fstatistic
## value numdf dendf
## 114.9377 3.0000 7.0000
Coefficients, t-values and p-values.
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55266.300303 32140.372917 1.719529 1.363131e-01
## area 4.612014 1.309435 3.522140 1.248610e-02
## offices 19318.313733 1050.854689 18.383430 1.669850e-06
## entrances 3519.961492 1393.949896 2.525171 4.497194e-02
## age -331.122208 34.852129 -9.500774 7.750398e-05
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 167779.422 5744.52017 29.206864 1.420161e-08
## offices 19528.100 1701.25100 11.478671 8.561646e-06
## entrances 6573.309 1770.05110 3.713627 7.518847e-03
## age -291.234 53.44688 -5.449036 9.569803e-04
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8602.215670 34966.208964 0.2460151 8.127284e-01
## area 6.668365 1.363477 4.8907053 1.771817e-03
## offices 19865.617817 1367.265388 14.5294527 1.745583e-06
## age -351.264281 45.112410 -7.7864225 1.082962e-04
entrances is the weakest coefficient.area, we can see an improvement in the significance of entrances while all the other parameters remain robust.entrances, model 3 has better statistics than model 2.Test collinearity for the three models.
## area offices entrances age
## FALSE FALSE FALSE FALSE
## offices entrances age
## FALSE FALSE FALSE
## area offices age
## FALSE FALSE FALSE
All tests are negative (FALSE). We accept the ‘no multicollinearity’ hypothesis.
Again, correlation or collinearity a priori is simply an indicator of potential collinearity a posteriori.
Multicollinearity can be a problem. Can? We will address the issue in another case where we encounter multicollinearity again…
Select a model - Methods
Given the results, Model 1 appears to be the best.
However, before jumping to a conclusion, let’s cover the methods for selecting the best model.
When building a model, we want to measure if adding or removing variables improves the model.
First, we need benchmarks. Here is a checklist:
Above all, we want to pick the simplest and most explanatory model of all, following the concept of ‘parsimony’.
Pointer
Occam's razor, the 'law of parsimony', is a problem-solving principle attributed to William of Ockham (c. 1287-1347). The principle can be interpreted as stating among competing hypotheses, the one with the fewest assumptions should be selected.
In science, Occam's razor is used as a heuristic technique (discovery tool) to guide scientists in the development of theoretical models.
Second, we need a procedure to run and compare models. Depending on the approach, it can more or less automated. The following procedures focus on a few items from the checklist. Functional forms will be cover in other cases, but we will not address data quality.
Preprocessing data is a vast subject covering detection/ removal/replacement of measurement errors, missing values, outliers, extreme values, etc.
Select a model - ANOVA
We compare two models in terms of coefficients. We test if Model A’s coefficients are statistically different than Model B’s coefficients. If both models are not different, they belong to the same class. We boiled down the number of cases. We then pick the best model in each class and compare the winning of each class before declaring the overall winner.
The models:
area.entrance.The comparisons:
1 vs. 2.
## Analysis of Variance Table
##
## Model 1: real_estate_value ~ area + offices + entrances + age
## Model 2: real_estate_value ~ offices + entrances + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6 38997951
## 2 7 119629281 -1 -80631329 12.405 0.01249 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 vs. 3.
## Analysis of Variance Table
##
## Model 1: real_estate_value ~ area + offices + entrances + age
## Model 2: real_estate_value ~ area + offices + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6 38997951
## 2 7 80442941 -1 -41444990 6.3765 0.04497 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic is higher and more significant (lower p-value) in ‘1 vs. 2’ than in ‘1 vs. 3’. The F-test measure the statistical difference. The higher the F-statistic, the lower the p-value, the more significant the difference is.
Model 1 is more statistically different than Model 2 (lower p-value), but less statistically different than Model 3.
Model 1 and 3 are ‘closer’ to each other. The class winner would be Model 1 though (better statistics). In the final round, Model 1, which happens to have a higher predictive power, wins over Model 3.
The ANOVA method can be fastidious.
Select a model - Heuristic technique
Let’s automate things a bit.
We can employ the ‘backward’ method where we start from the most comprehensive model and remove variables one by one (we start removing one variable, starting with the less significant one, than a second variable, etc.) until we find an optimal model where we marginally improve the overall significance.
Another approach consists in introducing new variables in a similar manner and searching for the optimal model: the ‘forward’ method.
Nonetheless, the best procedure is the ‘stepwise’ method where we pick the first regressor, the most correlated with the regressand, calculate the residuals, then pick a second regressor, the most correlated with the residuals, and so on, until we marginally improve the overall significance.
Load the MASS package to implement the heuristic technique. The improvement criterion is to minimize the AIC coefficient (we will develop on AIC below).
The stepAIC function in package MASS offers a wider range of object classes as opposed to the minimal step, add1 and drop1 functions from the stats package.
Print the results.
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
real_estate_value ~ area + offices + entrances + age
Final Model:
real_estate_value ~ area + offices + entrances + age
Step Df Deviance Resid. Df Resid. Dev AIC
1 6 38997951 175.8924For one thing, this approach is faster. Second, the results show that when removing a variable, the AIC increases (a deterioration). Therefore, the best decision is not to remove any variable to minimize the AIC.
We would be better off if we could work with a larger sample (not 10 observations, but at least 30).
Nevertheless, the ‘stepwise optimization’ is controversial… We use a heuristic approach, that is ‘optimization’, normally used on deterministic values while statistics is based on estimated values and probabilities.
Select a model - Algorithmic technique 1
We should favour the ANOVA approach or the like. With the leaps package, we can add a ‘loop’ to speed up the procedure. Run a bunch of regressions using iterations and keep the 10 best.
## Subset selection object
## Call: regsubsets.formula(real_estate_value ~ area + offices + entrances +
## age, data = building, nbest = 10)
## 4 Variables (and intercept)
## Forced in Forced out
## area FALSE FALSE
## offices FALSE FALSE
## entrances FALSE FALSE
## age FALSE FALSE
## 10 subsets of each size up to 4
## Selection Algorithm: exhaustive
## area offices entrances age
## 1 ( 1 ) " " "*" " " " "
## 1 ( 2 ) " " " " "*" " "
## 1 ( 3 ) " " " " " " "*"
## 1 ( 4 ) "*" " " " " " "
## 2 ( 1 ) " " "*" " " "*"
## 2 ( 2 ) " " "*" "*" " "
## 2 ( 3 ) "*" "*" " " " "
## 2 ( 4 ) " " " " "*" "*"
## 2 ( 5 ) "*" " " " " "*"
## 2 ( 6 ) "*" " " "*" " "
## 3 ( 1 ) "*" "*" " " "*"
## 3 ( 2 ) " " "*" "*" "*"
## 3 ( 3 ) "*" "*" "*" " "
## 3 ( 4 ) "*" " " "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
The algorithm quickly produces 10 regressions; each regression can have up to 4 regressors. Each model is ranked according to one or many statistics. Rank the regressions according to: \(R^2\), adjusted-\(R^2\), BIC and AIC.
Let’s focus on two metrics: \(R^2\) and adjusted-\(R^2\). The most robust model (with the highest r2) has four regressors, confirming the above findings. The second-best model removed entrances; this is Model 3.
The other metrics are:
stepAIC function): Akaike’s Information Criterion.AIC and BIC are close relatives; they are suitable when comparing models fitted by maximum likelihood.
AIC and BIC are also good benchmarks for OLS. Since the Mallow’s \(C_p\) is estimated using OLS, it is equivalent to the AIC (and the BIC) in the special case of Gaussian (normal) linear regression or a MLE using a normal distribution.
As opposed to \(R^2\), the smaller these metrics, the better.
Pointer
The car package has the subsets function. It has similarities with the regsubsets function from the leaps package.
Select a model - Algorithmic technique 2
When we plot the Cook’s distance, we show the explanatory power of each observation in one model: the relative importance or leverage.
Focusing on Model 1, compute the relative importance, but for each regressor, with the relaimpo package. Choose four criteria to rank the regressors ('lmg', 'last', 'first', and 'pratt').
## Response variable: real_estate_value
## Total response variance: 404298409
## Analysis based on 11 observations
##
## 4 Regressors:
## area offices entrances age
## Proportion of variance explained by model: 99.04%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg last first pratt
## area 0.06664414 0.02775291 0.09613542 0.06914137
## offices 0.64838551 0.75604618 0.58090227 0.69325371
## entrances 0.12150659 0.01426516 0.19846167 0.07095651
## age 0.16346376 0.20193575 0.12450063 0.16664841
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs 4Xs
## area 8.696244 5.703305 4.404754 4.612014
## offices 21952.083333 20790.024850 19752.970125 19318.313733
## entrances 13351.127820 10701.658241 7309.851421 3519.961492
## age -335.470519 -344.902991 -336.238494 -331.122208
No matter the criterion ‘type’, the relative importance ranking is: offices, age, entrances, and area. Again, area is the weakest link.
We could repeat this calculation on Model 2 and Model 3.
Types:
'lmg'; the \(R^2\) contribution averaged over orderings among regressors.'last'; each variables contribution when included last in the regression; sometimes called ‘usefulness’.'first'; each variables contribution when included first, which is just the squared covariance between the regressand and the regressor.'pratt'; the product of the standardized coefficient and the correlation.Pointer
There are other types = c('pmvd', 'betasq', 'genizi', 'car').
We can also use a ‘bootstrap measures of relative importance’.
Warning! This method is picky and will only work if it detects a ‘positive definite covariance matrix’ or a data matrix/data frame with linearly independent columns.
Now, we have detected collinearity a priori between area and entrance. building_model2 does not include area. Therefore, for the example, we will run the bootstrap with Model 2 to satisfy the condition.
Print the results.
Response variable: real_estate_value
Total response variance: 404298409
Analysis based on 11 observations
3 Regressors:
offices entrances age
Proportion of variance explained by model: 97.04%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg last first pratt
offices 0.6865683 0.75187003 0.6426873 0.7151843
entrances 0.1649901 0.07869662 0.2195701 0.1352301
age 0.1484416 0.16943335 0.1377426 0.1495856
Average coefficients for different model sizes:
1X 2Xs 3Xs
offices 21952.0833 20682.3290 19528.100
entrances 13351.1278 9885.6865 6573.309
age -335.4705 -306.6348 -291.234
Confidence interval information ( 100 bootstrap replicates, bty= perc ):
Relative Contributions with confidence intervals:
Lower Upper
percentage 0.95 0.95 0.95
offices.lmg 0.6866 AB_ 0.3939 0.9022
entrances.lmg 0.1650 ABC 0.0235 0.4352
age.lmg 0.1484 _BC 0.0313 0.4081
offices.last 0.7519 A__ 0.4459 0.9478
entrances.last 0.0787 _BC 0.0084 0.2798
age.last 0.1694 _BC 0.0258 0.4587
offices.first 0.6427 AB_ 0.3328 0.9388
entrances.first 0.2196 ABC 0.0012 0.4876
age.first 0.1377 ABC 0.0013 0.4730
offices.pratt 0.7152 A__ 0.4288 1.0301
entrances.pratt 0.1352 _BC -0.0930 0.3918
age.pratt 0.1496 _BC -0.1976 0.3729
Letters indicate the ranks covered by bootstrap CIs.
(Rank bootstrap confidence intervals always obtained by percentile method)
CAUTION: Bootstrap confidence intervals can be somewhat liberal.
Differences between Relative Contributions:
Lower Upper
difference 0.95 0.95 0.95
offices-entrances.lmg 0.5216 -0.0132 0.8523
offices-age.lmg 0.5381 * 0.0476 0.8706
entrances-age.lmg 0.0165 -0.3079 0.3197
offices-entrances.last 0.6732 * 0.2918 0.9312
offices-age.last 0.5824 * 0.0141 0.9169
entrances-age.last -0.0907 -0.4158 0.2019
offices-entrances.first 0.4231 -0.1047 0.9012
offices-age.first 0.5049 -0.0946 0.9152
entrances-age.first 0.0818 -0.3997 0.4313
offices-entrances.pratt 0.5800 * 0.1542 1.0946
offices-age.pratt 0.5656 * 0.0712 1.1860
entrances-age.pratt -0.0144 -0.3206 0.5696
* indicates that CI for difference does not include 0.
CAUTION: Bootstrap confidence intervals can be somewhat liberal.All these numbers! Let’s make it simple and plot the results.
All the results from the calc.relimp function are All We now have a distribution. For each method, the bar charts show the leverage per regressor (the % contribution to \(R^2\)). In addition, we add a 95% confidence intervals (upper and lower limits) as shown with the fine lines.
Conclusion
Use ‘Model 1’ for valuation. Plot the results.
The model averages results. Accordingly, when the actual value is above the predicted value (undervalued by the model), the observation is likely to be a good candidate for further analyses. Buildings 1, 3, 4, 8, and 9 deserve a closer attention. Why do they appear undervalued? Are they really undervalued?
We can also perform sensitivity analyses with the coefficients.
## (Intercept) area offices entrances age
## 55266.300303 4.612014 19318.313733 3519.961492 -331.122208
Run the analyses and explore the scenarios.
## [1] 223759.1
## [1] 461.2014
## [1] 235649.4
## [1] 232338.2
## [1] 235649.4
## [1] 232799.4
## [1] 233000.4
## [1] 234425.4
## [1] 223759.1
## [1] 223428
We should not rely on results beyond 1 year. As a building ages, its value decreases (asset depreciation), but real estate valuation relies on many other factors (some with positive impacts) such as good location, inflation, and the neighbourhood economic development. Keep in mind it is a snapshot of the situation.
Global analyses vs. partial analyses
One of the traps of multivariate regressions is to pick a wrong model when the main indicators indicate goodness of fit or no violations to the assumptions. We must stress the importance of graphical analysis as pointed out by Anscombe (Anscombe’s Quartet).
With multivariate regressions, we would also want to visualize partial relationships between the regressors and the regressand, and explore partial residuals:
Marcellus and the wine business - part 1
Let’s step back and review our first case: champagne_model. It is a simpler multivariate regression to begin with.
A partial residual plot attempts to model the residuals of one regressor against the regressand. A component residual plot adds a line indicating where the line of best fit lies. A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable.
If hectares lines are similar, workforce lines are a bit off.
What happens to a regression coefficient when the other variables are held constant (ceteris paribus)? For each variable, we compute an OLS fit; with intercept zero and slope \(\beta_i\). The influences of individual data values on the estimation of a coefficient are easy to see.
hectares has a positive influence, while workforce has a negative one. As a reminder, there exists a higher collinearity between sales_m and hectares, than sales_m and workforce
## Test stat Pr(>|t|)
## workforce -2.124 0.078
## hectares -2.306 0.061
## Tukey test -2.250 0.024
This is the contribution on each regressor to the model variability (fitted values vs. residuals). hectares introduces less variability in the model than workforce.
In addition, we get a Tukey t-test. The test compares all possible pairs of means and poses a ‘not different’ hypothesis. We can reject the equality hypothesis at the 5% level; the difference between the means is statistically significant.
This is the explanatory power of each regressor to the model. High-leverage points have the potential to cause larger changes in the regressand estimation when they are deleted (measured on the y-axis). hectares vs. sales_m demonstrates much more sensitivity than workforce vs. sales_m.
Marcellus and real estate
In the light of the previous explanations, run the same plots with building_model1.
## Test stat Pr(>|t|)
## area -0.302 0.775
## offices -0.907 0.406
## entrances -0.736 0.495
## age -0.646 0.547
## Tukey test -0.244 0.807
Keep in mind that we have a rather small sample (n < 30) to draw conclusions.
Goal
Import the data: capitalization (in millions of Euros) of French corporations and their respective Free Cash Flow (FCF in millions of Euros) and Net Debt/Equity ratio (D.E or financial leverage).
Print the head of the dataset.
## companies capitalization FCF D.E
## 1 Air France 1580 34 0.67
## 2 Alcatel 3024 30 1.03
## 3 Alstom 946 -986 0.69
Find the impact of heavy indebtedness on a corporation’s capacity (reflected by the market capitalization).
Model
Run the regression; Free Cash Flow and the D/E ratio are the regressors.
Routine tests
Test for multicollinearity.
## FCF D.E
## FALSE FALSE
We quickly rule out one problem.
Test the coefficients. With summary.
##
## Call:
## lm(formula = capitalization ~ FCF + D.E, data = companies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2473.2 -1556.9 223.2 1069.0 3128.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7225.432 2612.950 2.765 0.02788 *
## FCF 5.810 1.112 5.226 0.00122 **
## D.E -5332.840 2675.653 -1.993 0.08650 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2201 on 7 degrees of freedom
## Multiple R-squared: 0.8043, Adjusted R-squared: 0.7483
## F-statistic: 14.38 on 2 and 7 DF, p-value: 0.003318
With anova.
## Analysis of Variance Table
##
## Response: capitalization
## Df Sum Sq Mean Sq F value Pr(>F)
## FCF 1 120070030 120070030 24.7893 0.001603 **
## D.E 1 19240970 19240970 3.9724 0.086495 .
## Residuals 7 33905304 4843615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All p-values are below the 5% level except for the D.E coefficient. \(R^2\) and adjusted-\(R^2\) are strong.
Normality, linearity, etc.
Plot the residuals.
Plot the Cook’s distance.
Q-Q plots are good all-round visualizations. Plot more Q-Q charts…
…about the model studentized residuals…
…about the residuals…
…about the predicted (‘fitted’) values.
Plot a distribution of the residuals (using a histogram and a density plot).
Extends the summary statistics.
##
## Call:
## lm(formula = capitalization ~ FCF + D.E, data = companies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2473.2 -1556.9 223.2 1069.0 3128.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7225.432 2612.950 2.765 0.02788 *
## FCF 5.810 1.112 5.226 0.00122 **
## D.E -5332.840 2675.653 -1.993 0.08650 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2201 on 7 degrees of freedom
## Multiple R-squared: 0.8043, Adjusted R-squared: 0.7483
## F-statistic: 14.38 on 2 and 7 DF, p-value: 0.003318
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = companies_model)
##
## Value p-value Decision
## Global Stat 8.84147 0.065187 Assumptions acceptable.
## Skewness 0.04412 0.833632 Assumptions acceptable.
## Kurtosis 0.67686 0.410669 Assumptions acceptable.
## Link Function 8.08864 0.004454 Assumptions NOT satisfied!
## Heteroscedasticity 0.03184 0.858371 Assumptions acceptable.
We cannot conclude to nonnormality, nonlinearity because of the mitigated results (both visual and statistic).
A larger sample size would certainly improve the results.
We can rule out heteroscedasticity (changing variance) though.
Conclusion
Plot the results.
Market Capitalization is a proxy variable for measuring value creation. According to the model, some corporations, taking into account they indebtedness, create more value with their Free Cash Flow (investments).
We would need additional analyses on potential good investments; where actuals (blue) are above the predicted value (red).
A word on multicollinearity
Multicollinearity is not a major flaw. Of course, it inflates variance, making the regressors untrustworthy.
We must keep in mind that sometimes a cure can be worse than the disease. Reducing collinearity may not justify the time and resources required to do so.
Generally, for the purposes of forecasting, we can disregard collinearity. If, however, the validity of coefficients is crucial (as with price elasticity for example), we must try to reduce collinearity.
Pointer
Forecasting means predicting or extrapolating. We can extrapolate forward or backward.
We can also rebuild a truncated dataset by interpolating the results. It is often the case when we re-engineer missing values, replace outliers or recompute measurement errors.
Forecasting
What would be the market capitalization of a corporation with 183 millions in FCF and a D/E ratio of 1? Add it to the plot.
Global analyses vs. partial analyses (again)
Plot partial relationships between regressors and the regressand and partial residuals:
## Test stat Pr(>|t|)
## FCF 2.270 0.064
## D.E 0.902 0.402
## Tukey test 5.039 0.000
In light of these plots, we confirm that D.E brings less variability to the model, but also less value (less impact).
We introduce a new plot for partial residuals: the CERES plots for nonlinearity. Run the commands.
Ceres plots are a generalization of component+residual (partial residual) plots. They are less prone to leakage of nonlinearity among the predictors.
As a reminder, a c+r plot adds a line indicating where the line of best fit lies. A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable.
C+r plot can fail if there are nonlinear relationships among the regressors. The CERES plot shows a curve in the relationship of the regressand to a regressor, in spite of any nonlinear relationships among the regressors. If these relationships are linear, a c+r plot is obtained. However, if the relationships are nonlinear, the more general CERES plot is obtained.
What is a qualitative regressor
We can use continuous and categorical/discrete regressors and regressands. In computer science, a continuous number would be a ‘float’ and a discrete number would be an ‘integer’. Sometimes, discrete numbers can have a limited scope. We call these variables ‘qualitative variables’:
So far, most of the models involved continuous regressands (sales, price, market capitalization) and a mix of continuous (costs, area, FCF) and discrete (workforce, number of offices and entrances, age) regressors.
We now involve a qualitative regressor: a binomial regressor, more commonly known as a dummy variable.
Goal
We have a dataset of large French corporations: their market capitalization (in millions of Euros), their operational_profits (in millions of Euros) and whether the government, the state, is the majority stockholder (the dummy variable equals 1).
Print the head of the dataset.
## companies capitalization operational_profit state
## 1 Air Liquide 163122205 1661800 0
## 2 Danone 201318713 1874000 0
## 3 Carrefour 263273826 3274000 0
Find out if state-owned corporations are better capitalized.
Regression
Run a simple linear regression and print the results.
##
## Call:
## lm(formula = capitalization ~ operational_profit, data = companies2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75256232 -27827402 16846591 20573067 60005399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.291e+08 2.254e+07 10.163 5.28e-05 ***
## operational_profit 5.598e+00 2.321e+00 2.413 0.0524 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46340000 on 6 degrees of freedom
## Multiple R-squared: 0.4924, Adjusted R-squared: 0.4078
## F-statistic: 5.82 on 1 and 6 DF, p-value: 0.0524
Add a dummy (state) to the model, run another regression, and print the results.
##
## Call:
## lm(formula = capitalization ~ operational_profit + state, data = companies2)
##
## Residuals:
## 1 2 3 4 5 6 7
## -54379604 -17372929 36732201 -9602252 38693097 25280889 -3672765
## 8
## -15678637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.082e+08 2.104e+07 9.893 0.00018 ***
## operational_profit 5.607e+00 1.887e+00 2.972 0.03109 *
## state1 5.555e+07 2.751e+07 2.019 0.09946 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37670000 on 5 degrees of freedom
## Multiple R-squared: 0.7204, Adjusted R-squared: 0.6086
## F-statistic: 6.442 on 2 and 5 DF, p-value: 0.04134
Adding state improves the model. The \(R^2\) and adjusted-\(R^2\) are better.
Coefficients are good: profit are positively related to capitalization. The state coefficient indicates an upswing in market capitalization when the State is the majority shareholder. We have significant p-values at the 5% level, except for the state variable, which is significant at the 10% level (weak).
Plot the results.
Conclusion
Show the data distributions.
State-owned corporations are more profitable and better capitalized. These corporations are public - have floating stocks on a public exchange - but the majority shareholder is the government.
Nevertheless, buying stocks and building a portfolio depends on many other criteria: dividend payouts, levels of risk, sectors of activity, correlations with other industries, etc.
We barely scratch the surface of dummy variables.
Here is a list of other possibilities with dummies variables (D) with examples:
champagne2). We could have tagged the market regimes (0 = pre, 1 = post).And we have four alternatives.
‘Model 5.1’ has only normal regressors. We would obtain two coincident regressions from each dataset (similar intercepts and slopes). ‘Model 5.2’ adds one D. We would have two parallel regressions with distinct intercepts. ‘Model 5.3’ has an interaction effect X*D. We would have concurrent regressions with two distinct slopes. ‘Model 5.4’ has both X*D and D. We would have dissimilar intercepts, slopes, and regressions. 6. Two D is an additive effect. 7. X*D is an interaction or multiplicative effect. 8. D*D is also an interaction effect where the second D amplifies the first D. There are interactions or multiplicative effect tests. 9. In log-lin models, you interpret the coefficient with \(e^{\beta_D} - 1\). 11. D can appear in other functional forms (polynomial for example). 12. D can replace the intercept. For example, say we have a categorical variable with 3 levels (m = 3). We would need (m - 1) D. We compare the results with the reference category and keep a common intercept. We run the regression through the origin; \(R^2\) becomes meaningless.
Finally, with D, be sure to test for heteroscedasticity.
Dynamic models
The next cases address patterns within a time period. Nevertheless, time series and forecasting is a vast field with its own techniques such as finding local trends, capturing seasonality, smoothing time periods with moving averages, etc. Please refer to packages such as xts, zoo, and ts objects’ functions. We want to focus on two things: distributed-lag models and the autoregressive process.
Goal
First, test the persistence of publicity expenses (in thousands of Euros) over sales (in thousands of Euros). In other words, measure the payoff of publicity expenditures made by an SME. Second, measure the persistence effect on sales. If there is one, how long does it last?
Dynamic models and lags
Dynamic models work best with time-series objects along with the zoo and xts packages.
Since the time period we are dealing with is rather short, we choose to work with a short cross-sectional dataset and generate leading/lagging variables (expenses) with a custom function. year is simply a label for subsetting the data.
Pointer
With ts package (and objects), use the diff and lag functions.
Visual exploration
Plot the data.
Plot the sales. This time, use a ‘local polynomial regression fitting’ (loess) to draw the line.
Now, use a ‘locally weighted scatter plot smoothing’ (lowess) to draw the line.
Pointer
Do no be confused between loess and lowess.
With ts objects (from the ts package), we can apply autoregressive and moving-average smoothing with the filter function; add add lead/lads with the lag and diff functions.
Plot the relationships between the sales and expenses (a proxy for the regression).
Regular regression
Run the regression and print the results (Model 1).
##
## Call:
## lm(formula = sales ~ expenses, data = publicity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9882 -9.4222 -0.6722 6.9764 17.2193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.356 12.126 0.854 0.426
## expenses 19.802 1.139 17.379 2.33e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.73 on 6 degrees of freedom
## Multiple R-squared: 0.9805, Adjusted R-squared: 0.9773
## F-statistic: 302 on 1 and 6 DF, p-value: 2.327e-06
The \(R^2\) and p-values all look good. Collinearity is not a problem in the current situation.
Plot the results with a 95% confidence intervals.
Distributed-lag models
Run a 1-period lag (Model 2) and a 2-period lag (Model 3) regressions and print the results.
## month sales expenses expenses_lag1 expenses_lag2
## 1 1 106 5 4 3
## 2 2 137 7 5 4
## 3 3 160 7 7 5
## 4 4 180 9 7 7
## 5 5 214 10 9 7
## 6 6 250 12 10 9
##
## Call:
## lm(formula = sales ~ expenses + expenses_lag1, data = publicity)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 1.7279 0.3423 -1.3392 -1.3841 -2.0880 1.5264 1.8225 -0.6080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7970 1.9507 2.459 0.0573 .
## expenses 10.0224 0.6627 15.124 2.29e-05 ***
## expenses_lag1 12.3407 0.8048 15.334 2.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.854 on 5 degrees of freedom
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9994
## F-statistic: 6161 on 2 and 5 DF, p-value: 3.313e-09
##
## Call:
## lm(formula = sales ~ expenses + expenses_lag1 + expenses_lag2,
## data = publicity)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 1.7396 0.4185 -1.3168 -1.5374 -2.0269 1.4739 1.8063 -0.5573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9505 2.7054 1.830 0.141255
## expenses 9.9322 1.1985 8.287 0.001157 **
## expenses_lag1 12.2786 1.1085 11.077 0.000378 ***
## expenses_lag2 0.1781 1.8606 0.096 0.928363
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.071 on 4 degrees of freedom
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9993
## F-statistic: 3294 on 3 and 4 DF, p-value: 3.07e-07
After analyzing the \(R^2\) and p-values, the best model appears to be Model 2.
Use the visreg package to plot partial and complete regression results with confidence intervals.
Change the parameters and choose what to plot.
Conclusion
First, there appears to be a persistence effect. The expenses_lag1 coefficient is positive and higher that the expenses coefficient. ‘Positive’ means we have a persistence effect and ‘higher’ means the effect is increasing over time. Publicity is about making noise. Expenditures from the previous period have more effect on the current period than expenditure from the current period.
Second, the persistence effect only lasts one period. The expenses_lag2 coefficient is insignificant (both statistically and numerically). There is a sudden fall in the effect of publicity beyond 1 period.
Print the results for Model 2.
## (Intercept)
## 4.796979
## expenses
## 10.02243
## expenses_lag1
## 12.34074
## month sales expenses expenses_lag1 expenses_lag2 beta_0 beta_1
## 1 1 106 5 4 3 4.796979 10.02243
## 2 2 137 7 5 4 4.796979 10.02243
## 3 3 160 7 7 5 4.796979 10.02243
## 4 4 180 9 7 7 4.796979 10.02243
## 5 5 214 10 9 7 4.796979 10.02243
## 6 6 250 12 10 9 4.796979 10.02243
## 7 7 285 13 12 10 4.796979 10.02243
## 8 8 335 17 13 12 4.796979 10.02243
## lag_0 beta_2 lag_1 sum
## 1 50.11217 12.34074 49.36295 104.2721
## 2 70.15704 12.34074 61.70369 136.6577
## 3 70.15704 12.34074 86.38516 161.3392
## 4 90.20191 12.34074 86.38516 181.3841
## 5 100.22434 12.34074 111.06664 216.0880
## 6 120.26921 12.34074 123.40737 248.4736
## 7 130.29165 12.34074 148.08885 283.1775
## 8 170.38139 12.34074 160.42959 335.6080
Plot these results.
Decompose the predicted sales knowing \(t\) is the current period and \(t-1\) is the previous period:
sales for period \(t\).There are many reasons behind persistence:
More proofs
We can counter-validate the choice of Model 2 with the Akaike and Bayesian (Schwarz’s) criteria. We encountered the Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) before when selecting the best real estate valuation model.
\[AIC = 2k - 2ln(\hat{L})\]
\[BIC = ln(n)k - 2ln(\hat{L})\]
Where \(\hat{L}\) is the maximized value of the likelihood function, \(k\) is the number of regressors, and \(n\) is the number of observations.
We also add the Mallow’s \(C_p\) criterion.
\[C_p = \frac{SSE_p}{S^2} - n + 2p\]
Where \(SSE\) is the sum of squared with \(p\) regressors, \(S^2\) is the residual mean square, \(n\) is the number of observations.
The smaller the criterion, the better. Compute the AIC and BIC:
## [1] 65.79748
## [1] 66.0358
## [1] 36.82331
## [1] 37.14107
## [1] 38.80501
## [1] 39.20222
Compute \(C_p\).
## lik infl vari n cp
## -176.609914 6.431493 5.452037 8.000000 358.082813
## lik infl vari n cp
## -7.444625 7.922790 6.807849 8.000000 22.734830
## lik infl vari n cp
## -49.100703 7.999999 7.999999 8.000000 106.201405
Model 2 has the lowest AIC, BIC and \(C_p\), confirming our previous choice.
In all, use all the statistical measures, i.e., \(R^2\), AIC, BIC and Mallow’s \(C_p\), for model selection.
AIC is the best all-round statistic. Many R functions are based on AIC: when evaluating generalized linear models (there will be a logit/probit case below), Weighted Least Squares (there will be a WLS case below), and we saw the use of stepAIC. extractAIC offers more parameters for computing generalized AIC.
There are other approaches for estimating distributed-lag models:
dlnm package.Goal 1
Compute the average return over time. A stock or index may experience short-term periods of positive returns and downturns. Over the long run, what is the real return?
Print the head of the dataset.
## year index
## 1 1987 1000.00
## 2 1988 1573.94
## 3 1989 2001.08
## 4 1990 1517.93
## 5 1991 1756.66
## 6 1992 1857.78
Compute returns
Time-series objects and specialized packages such as zoo and xts are ideal for modeling leading, lagging variables and computing differences and percentage differences such as returns. Nonetheless, we can work with cross-sectional data and custom function.
Compute the return for each period, print the head of the dataset, and plot the results (a line plot and a chart with high-density vertical lines).
## year index index_ret
## 1 1987 1000.00 NA
## 2 1988 1573.94 0.57394000
## 3 1989 2001.08 0.27138264
## 4 1990 1517.93 -0.24144462
## 5 1991 1756.66 0.15727339
## 6 1992 1857.78 0.05756379
Run a regression on the returns (only the intercept) and plot the results.
##
## Call:
## lm(formula = index_ret[2:n] ~ 1, data = index)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51681 -0.09537 0.06726 0.14704 0.48392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09002 0.05695 1.581 0.13
##
## Residual standard error: 0.261 on 20 degrees of freedom
Conclusion
The red line (the intercept) represents the positive average returns over the period of time. We are in the money.
Goal 2
Find out if the returns have an ARCH(1) process.
Autoregressive Conditional Heteroscedasticity (ARCH)
Homoscedasticity means variance of the residuals is constant over time. Heteroscedasticity is a common feature of cross-sectional data.
On the other hand, when the residuals are correlated over time, we have serial correlation or autocorrelation, a common feature of time series.
When we have some kind of autocorrelation in the variance of errors, we have an Autoregressive Conditional Heteroscedasticity process. The ARCH(1) process:
\[e_i^2 = \beta_0 + \beta_1 e_{i-1}^2\]
Extract the residuals, transform them, and test for an ARCH(1) process.
##
## Call:
## lm(formula = index_model1_residuals_sq[2:l] ~ index_model1_residuals_sq_lag1[2:l])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06326 -0.04587 -0.03136 0.01286 0.20369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.06425 0.02176 2.952 0.00853
## index_model1_residuals_sq_lag1[2:l] -0.14326 0.24925 -0.575 0.57257
##
## (Intercept) **
## index_model1_residuals_sq_lag1[2:l]
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07582 on 18 degrees of freedom
## Multiple R-squared: 0.01802, Adjusted R-squared: -0.03653
## F-statistic: 0.3303 on 1 and 18 DF, p-value: 0.5726
Only the intercept is significant at the 5% level. A significant p-value for the t-values and the F-statistic would mean autocorrelation.
Conclusion
There is no significant evidence of an ARCH(1) process.
If it had been the case, we would have to replace the OLS model with a Generalized Least Squared model (GLS or GLM).
Volatility
Patterns in volatility are often observed in times of crises. By examining daily, even hourly returns, we can find these patterns.
A variant of the ARCH(1) process is the GARCH(1,1) process:
\[\sigma(\epsilon)_t^2 = \beta_0 + \beta_1 \epsilon_{t-1}^2 + \beta_2 \sigma(\epsilon)_{t-1}^2\]
If the coefficients were significant, the sum of \(\epsilon_{t-1}^2\) and \(\sigma(\epsilon)_{t-1}^2\) would measure persistence in the variance, i.e., the durability of volatility periods.
Pointer
- ARCH(p): the variance of residuals depends on the squared residuals from previous periods (1 period or more).
- GARCH(n,m): the variance of residuals depends on both the squared residuals and residuals from (the) previous period(s).
- Autoregressive Integrated Moving Average models: AR(p), MA(q), ARMA(p,q), ARIMA(p,d,q).
- Vector Autoregression (VAR) models: geometric lag, polynomial lag, etc.
- Please refer to the ts package and objects.
There are other autoregressive models:
Dynamic models can also be estimated with instrumental variables.
Nevertheless, dynamic and time-series econometrics is a vast field of its own we cannot cover here.
Goal
When financial statements are not public, we have to resort to creative ways to gather market intelligence about companies. We would like to rank the profitability of a dozen of companies in one industry. Some companies are private and do not release their financial statements. The remaining companies are public. We have access to their financial statements. We also have a report about market shares in the industry. We might be able to establish a connection between market shares and profitability of the public companies. We could then apply the model to estimate the profitability of the private companies. First, find out the link between market shares and profitability in this industry.
The dataset is about six public manufacturers: market_share vs.profitability (profits as a percentage of sales).
Plot the data.
The linear form
Run the regression, print and plot the results.
##
## Call:
## lm(formula = profitability ~ market_share, data = profitability)
##
## Residuals:
## 1 2 3 4 5 6
## 11.101 3.940 -7.517 -13.864 -2.508 8.848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.7377 7.3199 1.194 0.2985
## market_share 0.4322 0.1765 2.449 0.0705 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.86 on 4 degrees of freedom
## Multiple R-squared: 0.5999, Adjusted R-squared: 0.4999
## F-statistic: 5.998 on 1 and 4 DF, p-value: 0.07051
We get poor \(R^2\) and p-values.
Functional forms
There are situations where the connection between the regressand and the regressors is nonlinear:
For example, we saw above that publicity expenses can have a time-based persistent effect. The effect can also shift with the amount spent. Increasing expenses increases sales to a point where the marginal gains from additional expenses start declining (decreasing returns).
\[sales = \beta_0 publicity^{\beta_1}\]
Where \(\beta_0 > 1\) and \(0 < \beta_1 < 1\).
The polynomial form
In some industries, smaller manufacturers targeting niche markets or larger manufacturers dominating the market are more profitable than mid-size manufacturers. We have a U-shaped phenomenon.
Polynomial forms can be of many orders. Let’s try the quadratic form (order 2 or second degree):
\[Y = \beta_0 + \beta_1 X + \beta_2 X^2\]
Create the new variable and print the head of the dataset.
## market_share profitability market_share_sq
## 1 5 22 25
## 2 10 17 100
## 3 18 9 324
## 4 35 10 1225
## 5 55 30 3025
## 6 75 50 5625
Run the regression and print the results.
##
## Call:
## lm(formula = profitability ~ market_share + market_share_sq,
## data = profitability)
##
## Residuals:
## 1 2 3 4 5 6
## 2.1178 0.6424 -3.5816 -2.1720 5.0516 -2.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.302680 4.235373 5.738 0.0105 *
## market_share -0.973672 0.294095 -3.311 0.0454 *
## market_share_sq 0.017917 0.003647 4.912 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.171 on 3 degrees of freedom
## Multiple R-squared: 0.9558, Adjusted R-squared: 0.9263
## F-statistic: 32.41 on 2 and 3 DF, p-value: 0.009303
\(R^2\) are improved and all p-values are significant at the 5% level.
Plot the results.
Conclusion
The quadratic form captures the connection between market share and profitability in the industry under study.
However, by introducing a new variable in the model, we also introduced a violation to the OLS: collinearity between market_share and market_share_sq.
## market_share market_share_sq
## TRUE TRUE
We can tolerate multicollinearity for our purpose: interpolating. Interpolating is forecasting. We do not measure price elasticity. We have leeway with the coefficients.
Using the model and the market shares of 12 companies (from the report), we can interpolate the private manufacturers’ profitability. It is not a perfect picture, but a fair estimate.
Any polynomial function has (local) minimum and maximum also called extremum.
Our model is a second-degree function and we can compute the extremum of profitability with \(\frac{\beta_1}{-2 \beta_2}\).
Compute the extremum (for profitability and market_share).
## [1] 27.17234
## [1] 11.07421
Plot the results with the extremum.
Goal
Many big players in the industry are profitable. In the short term, it might not last. These companies have reached full capacities. They are on the brink of a restructuring. Their stocks used to be good investments, not anymore.
Short-term marginal costs and average costs tend to be U-shaped. According to the law of diminishing marginal returns, as production increases, unit costs first diminish. However, beyond a threshold, unit costs start increasing.
We want to show how smarter it would be to invest in smaller companies gaining ground. They might not be as profitable as the big players, but as production increases, these growing companies will improve their marginal costs to become more profitable. Their stocks should increase.
The dataset is about nine companies: marginal_cost (in thousands of Euros) vs. produced_quantity (in millions).
Plot the data.
The linear form
Run the regression.
##
## Call:
## lm(formula = marginal_cost ~ produced_quantity, data = production)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.439 -2.789 -1.872 3.428 11.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.5222 5.7259 3.060 0.0183 *
## produced_quantity 0.7567 0.1753 4.316 0.0035 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.79 on 7 degrees of freedom
## Multiple R-squared: 0.7269, Adjusted R-squared: 0.6878
## F-statistic: 18.63 on 1 and 7 DF, p-value: 0.003498
\(R^2\) are good and p-values are significant at the 5% level.
Plot the residuals.
The top-left plots show U-shaped residuals. The top-right chart does not suggest linearity.
The right fitted regression line can be approximated with the lowess function.
The polynomial form
From the all the plots above, we need a quadratic form (power 2 or second order). Create the new variable and print the head of the dataset.
## marginal_cost produced_quantity produced_quantity_sq
## 1 37 10 100
## 2 27 15 225
## 3 31 20 400
## 4 27 25 625
## 5 36 30 900
## 6 42 35 1225
Run the regression and print the results.
##
## Call:
## lm(formula = marginal_cost ~ produced_quantity + produced_quantity_sq,
## data = production)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2567 -1.8909 0.7831 1.8745 3.3758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.347619 6.169139 7.189 0.000366 ***
## produced_quantity -1.438139 0.458603 -3.136 0.020172 *
## produced_quantity_sq 0.036580 0.007511 4.870 0.002793 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.295 on 6 degrees of freedom
## Multiple R-squared: 0.9449, Adjusted R-squared: 0.9265
## F-statistic: 51.4 on 2 and 6 DF, p-value: 0.0001677
\(R^2\) and p-values are all improved.
Plot the residuals.
The top plots suggest a quadratic form.
Plot the results.
Conclusion
The model confirms the theory. In practice, when a company reaches its optimal production point, overcapacity wears out the production means: machines break down, maintenance costs skyrocket, employees get tired, turnover increases, manager get frenzy, costly mistakes happen, etc. The bigger producers are entering this phase.
In the long run, a company can improve its productivity with technology and best practices such as the Lean approach. However, it is also costly to restructure a company. We should wait until the end of the next restructuring cycle to consider investing in the big players and rather focus on a couple of underdogs with bright prospects.
We could go further with a cubic model (power 3) to model total costs: as production increases, total costs of production increase. Following the decrease in marginal costs (returns to scale), total costs increase at a declining rate. As marginal costs start increasing again, total costs follow the same trend. Finally, both marginal costs and total costs end up increasing at an increasing rate.
Plot the data and approximate the regression form.
We have a polynomial form of order 3.
Create the new variables and print the head of the dataset.
## output total_cost output_sq output_cu
## 1 1 193 1 1
## 2 2 226 4 8
## 3 3 240 9 27
## 4 4 244 16 64
## 5 5 257 25 125
## 6 6 260 36 216
Run the regression and print the results.
##
## Call:
## lm(formula = total_cost ~ output + output_sq + output_cu, data = production2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4263 -0.7416 0.3744 1.4635 4.4350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.76667 6.37532 22.24 5.41e-07 ***
## output 63.47766 4.77861 13.28 1.13e-05 ***
## output_sq -12.96154 0.98566 -13.15 1.19e-05 ***
## output_cu 0.93959 0.05911 15.90 3.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.285 on 6 degrees of freedom
## Multiple R-squared: 0.9983, Adjusted R-squared: 0.9975
## F-statistic: 1202 on 3 and 6 DF, p-value: 1.001e-08
As expected, all \(\beta > 0\), except for \(\beta_2\) (output_sq).
Verify that \(\beta_2^2 < 3 \times \beta_1\beta_3\).
## [1] 168.0015
## [1] 178.9286
## [1] TRUE
Plot the results.
Goal
We have a start-up with a rising stock price (the company recently went public).
Show how growth is accelerating over time.
The growth form - a log-lin model
We need a form to capture the stock price take-off:
\[Y = \beta_0 e^{\beta_1 T}\]
Where \(e\) is a natural logarithm and \(T\) is time. In this kind of model, \(\beta_1\) represents the growth rate (positive or negative).
We linearize the form to obtain a log-lin model:
\[ln Y = ln \beta_0 + \beta_1 T + \epsilon\]
This log-lin model captures the fast growth of start-ups.
Pointer
We choose a natural logarithm (ln) or log in R. The inverse of ln is e. We could also have picked another logarithm with log(x, base) where base = 2 stands for binary and base = 10 stands for decimal. It depends on the situation, the analysis, the know-how.
Linearize the data (taking the log) and plot it.
Run the regression and print the results.
##
## Call:
## lm(formula = log(stock) ~ quarter, data = startup)
##
## Residuals:
## 1 2 3 4 5
## -0.08261 0.07336 0.06170 -0.01305 -0.03940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.57280 0.08064 31.90 6.77e-05 ***
## quarter 0.34302 0.02431 14.11 0.000771 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07689 on 3 degrees of freedom
## Multiple R-squared: 0.9851, Adjusted R-squared: 0.9802
## F-statistic: 199 on 1 and 3 DF, p-value: 0.0007715
We obtain high \(R^2\) and significant p-values at the 1% level.
Plot the results; the linearized model.
Plot the original model.
Conclusion
The \(\beta_1\) coefficient is the quarterly growth rate (34%). No collinearity is possible. The rate is likely to work for the short-term forecasts. Following the IPO, we are in the early growth phase.
In the longer term, the model will likely fail to capture future growth. Growth will slow down and numerous events will impact the start-up: financial crises, legal changes, competition, technological change, etc. The start-up might also miss its targeted sales, causing investor to dump the stock.
The MASS package has several functions for dealing with nonlinear models. loglm, loglm1, and loglin function can fit log-lin models.
There are two additional nonlinear models:
The lin-log model:
\[Y = \beta_0 + \beta_1lnX + \epsilon\]
The reciprocal model:
\[Y = \beta_0 + \beta_1 \bigg( \frac{1}{X} \bigg) + \epsilon\]
Lifecycle
Market penetration, adoption of new technologies, and subscriber acquisition are all life-cycle situations.
A tech start-up is pushing an innovative product to the market.
We have data about the household_adoption_rate over time. We choose to work with cross-sectional data although it is a case for time series. Plot the data.
Goal
First, build a life-cycle model and show how slow beginnings can turn into an explosive growth, then slow down to reach a plateau.
Second, reverse-engineer the adoption rate with the regression coefficients and estimate the adoption rate. We have a truncated dataset and we miss the early phase as it was impossible to collect the numbers. Extrapolate the adoption rate (prior to time = 1, before the adoption reached 20%).
Third, forecast the adoption rate (extrapolate beyond time = 11) under the assumption that we can reach 100% of households. The important question we seek to answer is if sales have passed the inflection point. After all, a product whose sales have passed the inflection point or is moving away from this point has more limited prospects.
The logistic form
The logistic form resembles an S-curve:
\[D = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}\]
\(D\) is the ‘aDoption rate’ (a percentage). This exponential model has two growth phases:
We guess 4 as the inflection point since it looks like the turning point. Plot the three stages.
We linearize the form:
\[ Y = ln \bigg( \frac{D}{1 - D} \bigg) = \beta_0 + \beta_1 X\]
Create the new variables and print the head of the dataset.
## time household_adoption_rate lifecyle_rate
## 1 1 0.20 -1.3862944
## 2 2 0.30 -0.8472979
## 3 3 0.46 -0.1603427
Run the regression and print the results.
##
## Call:
## lm(formula = lifecyle_rate ~ time, data = equipment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55409 -0.22524 0.07309 0.26993 0.35881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.09759 0.21355 -5.140 0.000612 ***
## time 0.26539 0.03149 8.429 1.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3302 on 9 degrees of freedom
## Multiple R-squared: 0.8876, Adjusted R-squared: 0.8751
## F-statistic: 71.04 on 1 and 9 DF, p-value: 1.455e-05
\(R^2\) are strong and we have significant p-values at the 1% level. The growth rate is \(\beta_1\) = 26.5%.
Conclusions
Compute the inflection point (\(- \beta_0 / \beta_1\)).
## [1] 4.135782
With the coefficient, reverse-engineer the adoption rate. Print the head of the dataset.
## time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1 1 0.20 -1.3862944 -0.832202
## 2 2 0.30 -0.8472979 -0.566813
## 3 3 0.46 -0.1603427 -0.301424
Extend the time period and create a longer data frame. Print the head of the dataset
## time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1 -4 0.00 0.0000000 0.000000
## 2 -3 0.00 0.0000000 0.000000
## 3 -2 0.00 0.0000000 0.000000
## 4 -1 0.00 0.0000000 0.000000
## 5 0 0.00 0.0000000 0.000000
## 6 1 0.20 -1.3862944 -0.832202
## 7 2 0.30 -0.8472979 -0.566813
## 8 3 0.46 -0.1603427 -0.301424
Fill the new data frame. Print the head and tail of the dataset.
## time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1 -4 0.00 0.0000000 -2.159147
## 2 -3 0.00 0.0000000 -1.893758
## 3 -2 0.00 0.0000000 -1.628369
## 4 -1 0.00 0.0000000 -1.362980
## 5 0 0.00 0.0000000 -1.097591
## 6 1 0.20 -1.3862944 -0.832202
## 7 2 0.30 -0.8472979 -0.566813
## 8 3 0.46 -0.1603427 -0.301424
## time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 19 14 0 0 2.617855
## 20 15 0 0 2.883244
## 21 16 0 0 3.148633
Extrapolate and forecast the adoption rate. Print the head of the dataset.
## time household_adoption_rate lifecyle_rate re_lifecycle_rate
## 1 -4 NA 0 -2.159147
## 2 -3 NA 0 -1.893758
## 3 -2 NA 0 -1.628369
## re_adoption_rate
## 1 0.1034796
## 2 0.1308166
## 3 0.1640539
Plot the results.
This is not a perfect fit.
How could we improve the logistic forecasting/extrapolating model?
For one thing, we would benefit from a longer sample or a denser sample (by month). We can also fit in a seasonality index with monthly data for a better estimation.
We can also rework the ultimate penetration assumption (we rarely reach 100% or 1). The new form would include the upper limit L:
\[D = \frac{L}{1 + e^{-(\beta_0 + \beta_1 X)}}\]
Since it is a truncated dataset, we could use a Tobit model to estimate the early phase before running the logistic model.
Other lifecycle models
For the logistic form, we could consider the Gompertz model and the Bass diffusion model.
The rule of thumb is if future sales benefit from previous sales, the logistic curve should be used to generate forecasts, whereas if future sales do not benefit from previous sales, the Gompertz curve should be used.
The main goal of life-cycle models is to measure if we are before or after the inflection point. We do not management a growing business as we manage a mature business. Bright prospects are before the inflection point. At start-up growth will come in the early phase. If the strategy is to buy a growth stock, whenever the start-up has crossed the inflection point, it is time to sell.
Unlike the above models, the Bass model and its variants predict a product future sales before a product comes to market. The Bass model also gives an explanation of how knowledge of new products spreads throughout the market place. These analyses are crucial. Do we invest or not? Are we going to have a chance to get a ROI or not?
The reverse logistic model
Some cases are reverse S-curves (when \(\beta_1 < 0\)). We start with a rapid decline, then the fall slows down until it reaches a floor.
In very specific cases, this situation bears the name of ‘half-life model’. We cannot know the ending date, but we can only estimate the time when the amount will be half of what it was at the beginning. The long descent is bound by an asymptote and will never reach zero. Half-life models explain radioactive disintegration or medication diffusion in the body when months after an injection, we can still find traces in a blood sample.
In business, it applies to subscriber churn. Without new acquisition, the subscriber base declines according to a churn rate. The concept of churn is the opposite of compound interests. Larger subscriber bases undergo large churn. As the subscriber base is eroded, the churn rate effect diminishes. A half-life model can forecast the time when 1000 subscriber base reaches 500. Beyond 500, it seems to be an infinite descent.
Pointer
A close relative to reverse lifecycle models is the reciprocal form (f(x) = 1/x). In theory, the form has an asymptotic value (a floor). The most important application belongs to macroeconomics: the Phillips curve. It is the relationship between the money rate percent rate of change (Y), a proxy for inflation, and the unemployment rate in percent (X).
Goal
We want to apply the concept of elasticity to model the profits-quantity elasticity of a fast-growing tech start-up. Production has exploded since the last quarters.
Plot the data.
Build a model to measure the elasticity coefficient; the connection with between quantity and profits. Apply the model to forecast short-term profits as a production function.
The power form - a log-log model
\[{profit} = \beta_0 {qty}^{\beta_1}\]
Taking the first derivative gives the elasticity function:
\[\frac{\delta {profit}}{\delta {qty}} = \beta_0 \times \beta_1 {qty}^{\beta_1 - 1}\]
The power form has similarities with the growth form. With the growth form, the linear model is a log-lin model. With the power form, the linear model is a log-log model:
\[ln (profit) = ln \beta_0 + \beta_1 ln (qty)\]
With slope \(\beta_1(\frac{profit}{qty})\) and elasticity \(\beta_1\).
Linearize the model. Print the head of the dataset, run the regression and print the results.
## quarter quantity profits
## 1 1 2 2
## 2 2 4 9
## 3 3 6 19
##
## Call:
## lm(formula = log(profits) ~ log(quantity), data = sme)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.058169 -0.009584 -0.003166 0.000217 0.073215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.62138 0.03501 -17.75 4.45e-07 ***
## log(quantity) 1.98038 0.01455 136.08 3.05e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03658 on 7 degrees of freedom
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
## F-statistic: 1.852e+04 on 1 and 7 DF, p-value: 3.053e-13
\(R^2\) are strong and we have significant p-values at the 1% level.
Plot the results; the linearized model.
The original model.
Conclusion
Extract the coefficients. Adjust the intercept with exp(intercept).
## [1] 0.5372043
## [1] 1.980378
The elasticity model with coefficients:
\[profit = 0.54~qty^{1.98}\]
Taking the first derivative gives:
\[\frac{\delta profit}{\delta qty} = 0.54 \times 1.98 \times qty^{0.98}\]
Or simply:
\[\frac{\delta profit}{\delta qty} = 1.0692 \times qty^{0.98}\]
What would a 1% increase in \(QTY\) do?
\[\frac{\delta profit}{\delta qty} = 1.0692 \times 1% = 1.0692%\]
It would increase \(profit\) by 1.0692% because of the short-term increasing returns to scale for the start-up tech.
A 10% increase in \(qty\) would turn into a 10.21% increase in \(profit\), not 10.682%; a slight diminution.
As time go by, we would need to measure changing coefficients with new datasets. The new coefficients would likely illustrate different elasticities and, maybe, show decreasing returns.
Elasticity
The classical view on elasticity is that consumption (quantity demanded) is a function of price and (consumer) revenues:
\[C = \beta_0 P^{\beta_1} R^{\beta_2}\]
As price increases and revenue decrease, we would expect a decrease in quantity consumed (assuming it is a normal good).
Now, the increase depends on elasticity (\(\varepsilon\)). With high elasticity, price hikes will cause demand to plummet while with inelasticity, price hikes will not change the demand by much.
Elasticity can be constant or not.
Elasticity depends on many factors. One of the factors is the nature of the good. Contrary to a normal good (\(0 < \varepsilon < 1\)), Veblen goods are superior or luxury goods. Given constant revenues, as prices rise, some consumers seek these goods (rare paintings, luxury cars, jewels). These good are highly elastic (\(\varepsilon > 1\)). On the other hand, Giffen goods are inferior goods (cheap food, used cars, second-hand furniture have \(\varepsilon < 0\)). This time, as price increase, consumers buy more!
We want to point out the range of \(\varepsilon\) (around 0, rarely beyond 2 or -2). For this reason, eliminating multicollinearity is crucial when estimating \(\varepsilon\). We want the best estimation possible.
The linear form is:
\[ln C = ln \beta_0 + \beta_1 ln P + \beta_2 ln R\]
Consumption means quantity demanded. First, \(\beta_1 ln P = b_1\) represents the consumption-price elasticity of a normal good:
\[b_1 = \frac{\Delta C / C}{\Delta P / P}\]
For a 1% increase in price, there is a \(\beta_1\) decrease in consumption, assuming \(\beta_1 < 0\) (decreasing returns).
Second, \(\beta_2 ln R = b_2\) represents the consumption-revenue elasticity:
\[b_2 = \frac{\Delta C / C}{\Delta R / R}\]
For a 1% increase in revenue, there is a \(\beta_2\) increase in consumption, assuming \(\beta_2 > 0\) (increasing returns).
There is a lot more to consider (combine changes in \(P\) and \(R\) and the substitution/complement effects in face of choices).
Understanding elasticity is important when deciding on a pricing policy. We have to take into account the type of product, consumers, competition, timing, market segments, and many other factors.
Elasticity and pricing
In marketing, pricing strategy is one of the 4P (product, place (location and distribution), promotion (and publicity) and pricing).
Elasticity analysis in one of the most important marketing toolbox’s analytic tool. Demand and elasticity models can become highly complex when they include factors such as cost structure, competition and substitution, legal constraints, and business objectives. Elasticity analysis is crucial for coining a pricing policy (penetration, skimming, competition, product line, bundle, premium, optional, cost-plus, cost-based, etc).
The Cobb-Douglas function
This other elasticity function is mainly applied in microeconomics to explain the industrial changes underpinnings:
\[Y = \beta_0 K^{\beta_1} L^{\beta_2}\]
Where \(Y\) stands for ‘production’, \(K\) for ‘capital’, and \(L\) for ‘labour’. The two assets are a simplification of the production means. The Cobb-Douglas function is a ‘multiplicative’ form capturing returns to scale (whether increasing or decreasing).
In essence, the function measures the effect of adding/removing units of capital (investing/divesting in material resources) and units of labour (hiring/laying off human resources).
Both \(\beta\) are the elasticity coefficients explaining how hiring more workers increases production with decreasing returns to scale as capital remains constant; and how more capital investments improves productivity with increasing returns to scale when the workforce remains constant.
Goal
We have a dataset showing the production of 10 companies in an industry with their respective workforce (\(L\)) and assets (\(K\)). Find out the elasticity coefficients (\(\beta\)(. Which one is elastic, which one is inelastic? What would be impact on production for a change in \(L\), in \(K\)?
Print the head of the dataset.
## firms production workforce assets
## 1 1 2827381 224900 570900
## 2 2 2463879 287000 251270
## 3 3 2597816 436474 143600
The power form - a log-log model
This time, we use the decimal logarithm (\(10^{\beta},~log_{10}\)), not the natural logarithm (\(e^{\beta},~log~or~ln\)).
Linearize the model. Print the head of the dataset.
## firms production workforce assets production_log workforce_log
## 1 1 2827381 224900 570900 6.451384 5.351989
## 2 2 2463879 287000 251270 6.391619 5.457882
## 3 3 2597816 436474 143600 6.414608 5.639958
## assets_log
## 1 5.756560
## 2 5.400141
## 3 5.157154
Run the regression and print the results.
##
## Call:
## lm(formula = sme2$production_log ~ sme2$workforce_log + sme2$assets_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.006687 -0.001019 0.001057 0.002407 0.002740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.364964 0.027473 49.68 3.50e-10 ***
## sme2$workforce_log 0.590148 0.004635 127.31 4.86e-13 ***
## sme2$assets_log 0.334455 0.004364 76.63 1.70e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003513 on 7 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.562e+04 on 2 and 7 DF, p-value: 1.683e-13
\(R^2\) are strong and we have significant p-values at the 1% level.
Conclusion
Add the coefficient to the Cobb-Douglas function:
\[Y = 10^{1.365} \times K^{0.590} \times L^{0.334}\]
Or simply:
\[Y = 23174\times K^{0.590} \times L^{0.334}\]
Choose a form, pick a model
A time series may appear to be growing geometrically. This first segment looks like an exponential form. As time passes, we add data and the increase seems to stabilize. With additional data, the last segment looks more like a logarithmic form. The whole dataset starts with an increasing growth, then plateau, and finishes with a decreasing growth.
What model do we pick? A polynomial model? It would clearly fit the data and minimize the residuals, but it might also overfit the situation. Why? In the longer term, the growth might pick up again. The declining growth could also be a temporary downturn.
Should we not pick a linear model then? Knowing we would obtain large residuals as the regression line cuts through all this volatility. Or could we not incorporate dummy variables to take into account cycles?
Selecting the right model is sometimes tricky and time-consuming. Apart from the scientific method and the statistical toolbox, we ought to keep in mind reminders, pointers, rules of thumbs. Of course, they come with experience. The following is a minimalist checklist of things to remember in econometrics.
for(var in vec) {expr} and while(cond){expr}).MASS package) and algorithmic techniques (leaps and relaimpo packages) to compare models.for(var in vec) {expr} and while(cond){expr}).The following is a step-by-step approach:
A qualitative regressands
We worked with qualitative variables as regressors (dummy variables). We will now work with a qualitative variable as a regressand. The regressand can be multinomial (bond ratings AAA, AA, A), ordinal (customer 1st to 3rd choice) or binomial (1 or 0, whether or not an event takes place).
The goal of the regressors, whether continuous, discrete and/or qualitative is predicted the ‘states’ of the regressand and classify whether \(Y\) takes the value of AAA or AA or A, or 1st, 2nd or 3rd, or if an event occurs.
Goal
We have applicants requesting a loan. Some are approved, some are rejected. We want to measure the probability of being approved. We want to be able to use the model to predict an applicant’s chance of being approved and how a changing conditions can affect this probability.
The logit model
We need a probability model that remains in the 0-1 interval as regressors influence the probability of the event occurrence (the regressand). However, the relationship between the probability and the regressors is nonlinear, that is, one which approaches zero at slower and slower rates as regressors lose influence (get smaller) and approaches one at slower and slower rates as regressors gain influence (get larger). We have an S-shaped curve just like with the life-cycle model: the logistic form.
We start with the Odds Ratio (OR): the probability of occurrence of an event over non-occurrence:
\[OR = \frac{P}{1-P}\]
Where \(P\) is \(Pr(Y=1)\).
An OR = 2 means 2 to 1 chance of occurrence. An OR = 0.25 means 4 to 1 chance of non-occurrence.
We take the logarithm of OR, we get the logistic OR or logit:
\[ln(OR) = ln \bigg( \frac{P}{1 - P} \bigg) = logit\]
The linear form:
\[logit = L = ln \bigg( \frac{P}{1 - P} \bigg) = \beta_0 + \beta1 X_1\]
Then,
\[\bigg( \frac{P}{1 - P} \bigg) = e^{\beta_0 + \beta1 X_1}\]
And,
\[P = \frac{e^{\beta_0 + \beta1 X_1}}{1+e^{\beta_0 + \beta1 X_1}}\]
Or simply:
\[P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1)}}\]
The general form is:
\[P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_n X_n)}}\]
To estimate the linear model, we need regressors (\(X_n)\) and the logit values \(L\). First, print the loan dataset.
## amount n n_i
## 1 80000 10 6
## 2 90000 20 9
## 3 100000 30 10
## 4 110000 60 21
## 5 120000 80 20
The dataset is a table of five tranches (by amount requested in $US), the number of requests by tranche (n) and number of approvals by tranche (n_i).
The logit should be \(L = ln(\frac{1}{0})\) if a loan request is approved and \(L = ln(\frac{0}{1})\) is the request is rejected.
We compute \(P = \frac{n_i}{n}\) for each tranche, the relative frequency estimating the true \(P\).
## amount
## 80000 90000 1e+05 110000 120000
## 0.60 0.45 0.33 0.35 0.25
The loan dataset is a summary of 200 transactions (approvals and rejections) grouped together by tranches.
Print the number of rows, the head and the tail of the transactions.
## [1] 200
## approved amount
## 1 1 80000
## 2 1 80000
## 3 1 80000
## 4 1 80000
## 5 1 80000
## approved amount
## 196 0 120000
## 197 0 120000
## 198 0 120000
## 199 0 120000
## 200 0 120000
For the model (and for simplicity), amount is the only regressor. We will solely rely on one \(X\) to explain the probability of loan approval!
We have a fairly large sample. Assuming each observation is distributed independently as a binomial variable (1 or 0), the residuals should follow a normal distribution with mean \(\mu = 0\) and variance \(\sigma^2 = \frac{1}{n \times P(1-P)}\)
The short way
Using the Generalized Least Squared (Model), glm, with the logistic function as the link function, run a regression on the 200 transactions and print the results.
##
## Call:
## glm(formula = approved ~ amount, family = binomial(link = "logit"),
## data = loan2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2483 -0.8765 -0.7692 1.3738 1.6506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.628e+00 1.371e+00 1.916 0.0553 .
## amount -3.079e-05 1.264e-05 -2.436 0.0148 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 253.67 on 199 degrees of freedom
## Residual deviance: 247.68 on 198 degrees of freedom
## AIC: 251.68
##
## Number of Fisher Scoring iterations: 4
Levels of significance: 10% for the intercept, 5% for the regressor. We note z-values for the coefficients and not t-values. In a logit, we compare the computed z-values (\((\beta/s.e.)^2\)) and use the \(\chi^2\) distribution. The Wald test replaces the t-test in glm and maximum likelihood.
We also compare logit models with the AIC. However, we need more statistics; print additional metrics.
## llh llhNull G2 McFadden r2ML
## -123.84016553 -126.83572714 5.99112323 0.02361765 0.02951139
## r2CU
## 0.04106181
## McFadden Adj.McFadden Cox.Snell Nagelkerke
## 2.361765e-02 -3.499318e-05 2.951139e-02 4.106181e-02
## McKelvey.Zavoina Effron Count Adj.Count
## 3.850954e-02 3.075997e-02 6.800000e-01 3.030303e-02
## AIC Corrected.AIC
## 2.516803e+02 2.517412e+02
The pseudo-\(R^2\) or the McFadden’s \(R^2\) is very weak. We read the pseudo-\(R^2\) as \(\frac{no.~of~true~predictions}{no.~of~predictions}\).
It appears the lack of explanatory variables does a poor job at explaining the probability of loan approval. Of course, we could have added extra variables such as household revenue, credit scoring, cash-down, etc. We will come back to the model explanatory power below. Instead, we continue the analysis.
Compute the predicted probability (for a loan request of being approved) for each tranche.
## amount n n_i P predicted_P
## 1 80000 10 6 0.6000000 0.5411886
## 2 90000 20 9 0.4500000 0.4643705
## 3 100000 30 10 0.3333333 0.3892066
## 4 110000 60 21 0.3500000 0.3189639
## 5 120000 80 20 0.2500000 0.2560834
Conclusion to the short way
The probability of loan approval decreases by:
## [1] -3.078885e-05
for every dollar increase or -3.079% for every US$1000 increase. As the requested loan amount increases, the odds of being approved decline.
Print the complete results for P.
## amount
## 80000 90000 1e+05 110000 120000
## 0.54 0.46 0.39 0.32 0.26
Now, prove it with:
\[P = \frac{1}{1 + e^{-(2.628 + (-3.079e-05) \times X)}}\]
Compute the predicted probability of being approved if we request a \(X\) = US$100,000 loan.
## [1] 0.3892066
Fit the result into:
\[OR = \frac{P}{1-P}\] \[OR = \frac{0.3892066}{1-0.3892066}\] The OR is equal to:
## [1] 0.6372148
Or about 2 to 3 chance of non-occurrence.
Generate more predictions (say we want to know the probabilities for amounts going from US$1 to US$200,000.
## round(PredictedProb, 2)
## 0.03 0.04 0.05 0.07 0.09 0.12 0.16 0.2 0.26 0.32
## 200000 190000 180000 170000 160000 150000 140000 130000 120000 110000
## 0.39 0.46 0.54 0.62 0.69 0.75 0.8 0.85 0.88 0.91
## 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000
## 0.93
## 0
Close to a requested amount of 0, the model makes no sense. It is simply for a matter a sensitivity analysis. Most loan requests are between US$80K and US$200K.
Plot the results with confidence intervals.
We have a reverse logistic curve since the coefficient is -3.079% for every US$1000.
Here is an additional measure on how well the model fits. This can be particularly useful when comparing competing models. The output produced by summary(loan_logit_model) include indices of fit (shown below the coefficients): the null and deviance residuals and the AIC.
Goodness of fit and robustness
The OLS \(R^2\) are defined as a goodness-of-fit measures.
The model estimates from a logistic regression are maximum likelihood estimates. Logit models are not calculated to minimize variance, so the OLS approach to goodness-of-fit does not apply.
We saw above that alternative \(R^2\). Pseudo-\(R^2\), have been developed. These pseudo-\(R^2\) are on a similar scale, ranging from 0 to 1; higher values indicating better model fit. However, pseudo-\(R^2\) cannot be interpreted as one would interpret an OLS \(R^2\) and different pseudo-\(R^2\) can arrive at very different values. In fact, pseudo-\(R^2\) vary greatly from each other within the same model and pseudo-\(R^2\) make broad comparisons of pseudo-\(R^2\) invalid.
Pseudo-\(R^2\) only have meaning when compared to other pseudo-\(R^2\) of the same type, on the same dataset, predicting the same outcome. The only change would be to add or remove regressors (variations of the same model).
For logit, we need an alternative F-test. We need to ask whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The gap between the two models is called the deviance (\(D\)).
\[D = -2(ln V_{estimated~model}-lnV_{saturated~model})\]
Estimating the maximum likelihood consists in maximizing a likelihood function. If the saturated model perfectly fits the data, the maximum is 1; and \(ln(1) = 0\).
The equation becomes:
\[D = -2(ln V_{estimated~model})\]
A high \(D\) is a poor model. However, we face the same problem than with pseudo-\(R^2\): \(D\) is only meaningful when comparing models of the same type: one model having removed one or more regressors from the original model for example.
What if we measure the change in deviance between one model, the initial model, A, and the second model, B:
\[\delta_{A-B} = D_{(A)} - D_{(B)} = decline~of~deviance\]
The test statistic becomes the difference between the residual ‘deviance’ from the ‘estimated model with predictors’ and from the ‘saturated model with just an intercept’. The test statistic is distributed \(\chi^2\) with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model).
Compute the decline of deviance, the degree of freedom and the p-value.
## [1] 5.991123
## [1] 1
## [1] 0.01437804
The test is significant at the 5% level. In other words, we reject the ‘no decline of deviance’ hypothesis; we have a meaningful model when we add the regressor.
If there were more than one regressor, it would be smart to test for multicollinearity as well.
The long way
This time, we open the engine to see what is going on. The fact there is only one regressor will keep things simple.
From our loan table:
## amount n n_i P
## 1 80000 10 6 0.6000000
## 2 90000 20 9 0.4500000
## 3 100000 30 10 0.3333333
## 4 110000 60 21 0.3500000
## 5 120000 80 20 0.2500000
Compute additional variables to estimate the logit model.
The steps (print the head of the dataset as we follow along):
## amount n n_i P L
## 1 80000 10 6 0.6 0.4054651
## amount n n_i P L w sqrt_w w_L w_amount
## 1 80000 10 6 0.6 0.4054651 2.4 1.549193 0.6281438 123935.5
\[weighted~L = \beta_0 \sqrt{w} + \beta_1 \times\ weighted~amount\]
Because \(\beta_0\) become a multiplicative factor, we remove it from the linear model. A logit model takes no intercept. We replace the intercept (‘0’ means ‘through the origin’) with sqrt_w. Run the regression and compare the results with the previous model.
Run the regression and print the results.
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## sqrt_w 2.631122741 7.290674e-01 3.608888 0.03653227
## w_amount -0.000030795 6.712809e-06 -4.587498 0.01945483
Compare the results to the ‘short-way’ results.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.628237e+00 1.371381e+00 1.916489 0.05530284
## amount -3.078885e-05 1.263854e-05 -2.436108 0.01484624
They are in the same ballpark.
Extract the full summary.
##
## Call:
## lm(formula = w_L ~ 0 + sqrt_w + w_amount, data = loan)
##
## Residuals:
## 1 2 3 4 5
## 0.3686 -0.1340 -0.6320 0.5072 -0.1330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## sqrt_w 2.631e+00 7.291e-01 3.609 0.0365 *
## w_amount -3.079e-05 6.713e-06 -4.587 0.0195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5254 on 3 degrees of freedom
## Multiple R-squared: 0.9695, Adjusted R-squared: 0.9491
## F-statistic: 47.64 on 2 and 3 DF, p-value: 0.005333
Remember this time, we run an OLS. \(R^2\) are strong, but meaningless since we ran the regression through the origin! The F-statistic’s p-value is 0.005333. The ‘short-way’ \(\chi^2\) test had a p-value = 0.01437804. Both are significant at least at the 5% level. The other p-values are significant at the 5% level.
Conclusion to the long way
One of the objectives is to proof the ‘short way’ and confirm the first model.
Given:
sqrt_w = \(\sqrt{n\hat{P}(1-\hat{P})}\)w_amount = \(amount \times \sqrt{n\hat{P}(1-\hat{P})}\)As opposed to the actual (calculated) probability P:
## amount n n_i P
## 1 80000 10 6 0.6000000
## 2 90000 20 9 0.4500000
## 3 100000 30 10 0.3333333
## 4 110000 60 21 0.3500000
## 5 120000 80 20 0.2500000
We want to find the predicted probabilities.
The beginning equation:
\[ln \bigg( \frac{P}{1 - P} \bigg) = L = \beta_0 + \beta1 X_1\]
Becomes:
\[L \times \sqrt{n\hat{P}(1 - \hat{P})} = predicted~L = \beta_0\sqrt{n\hat{P}(1 - \hat{P})} + \beta_1 \times amount \times \sqrt{n\hat{P} (1 - \hat{P})}\]
Insert the estimated coefficients
## [1] 2.631123
## [1] -3.0795e-05
The equation becomes:
\[predicted~L = 2.631123 \times \sqrt{n\hat{P}(1 - \hat{P})} - (3.0795e-05) \times amount \times \sqrt{n\hat{P} (1 - \hat{P})}\]
And,
\[\frac{predicted~L}{\sqrt{n\hat{P}(1 - \hat{P})}} = unweighted~predicted~L\]
After rework, the equation becomes:
\[\hat{P} = \frac{1}{1+e^{-(unweighted~predicted~L)}}\]
Where \(\hat{P} \approx\) PredictedProb from previous model (the ‘short way’).
No matter loan_logit_model or loan_logit_model2, extract the coefficient and run the numbers.
If we request a $100,000 loan, the predicted probability (\(\hat{P}\)) of being approved is:
## [1] 0.3897467
0.3892066 for the ‘short way’.
From a US$100,000 request, every US$1000 increase changes the predicted probability by \((-3.0795e-05) \times \hat{P}(1 - \hat{P})\) vs. -3.078885e-05 for the ‘short way’ or -3.079% for every US$1000 increase.
The ‘long way’ confirms the ‘short way’. Let’s carry on with some sensitivity analyses.
Complete the computation: \((-3.0795e-05) \times \hat{P}_{\$100,000}(1 - \hat{P}_{\$100,000}) \times \$1000\):
## [1] -0.007324412
From a US$120,000 request, every US$1000 increase changes the probability by \((-3.0795e-05) \times \hat{P}(1 - \hat{P})\) or simply \((-3.0795e-05) \times \hat{P}_{\$120,000}(1 - \hat{P}_{\$120,000}) \times \$1000\):
## [1] -0.005872739
There is a slowdown in the decline as the amount request increases (it makes sense since the default risk increases and lenders are more risk-averse with larger amounts).
From a US$200,000 request, every US$1000 increase will change the probability by:
## [1] -0.0008534973
Confirming the slowdown in the decline as we get closer to a ‘floor’.
The probit model
We could perform additional analyses with a probit model. Probit (or normit) models rely on a normal distribution.
Run the regression and compare the coefficient with the first logit model (the ‘short way’).
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.611554e+00 8.442947e-01 1.908758 0.05629333
## amount -1.890658e-05 7.747235e-06 -2.440429 0.01466982
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.628237e+00 1.371381e+00 1.916489 0.05530284
## amount -3.078885e-05 1.263854e-05 -2.436108 0.01484624
The coefficients are in the same ballpark.
Compute the probabilities.
## round(fit, 2)
## 0.02 0.04 0.05 0.08 0.11 0.15 0.2 0.26 0.32 0.39
## 390000 180000 170000 160000 150000 140000 130000 120000 110000 100000
## 0.46 0.54 0.61 0.68 0.75 0.8 0.85 0.89 0.92 0.95
## 90000 80000 70000 60000 50000 40000 30000 20000 10000 0
Plot the results in comparison with the logit model.
Probit models have a propensity for being steeper with narrower tails.
Pointer
Logit/probit models usually involve regressions with one qualitative variable and more than one regressor.
For estimating loan approval, credit default, subscriber churn, machine breakdown, and other kinds of events, we can also count log-log model.
Logit/probit are close relative of survival models also known as duration models or under time event analysis. A logit/probit explain the probability of occurrence, what factors have an impact on occurrence and by how much. Duration models assume the event occurred and aim at finding what factors influence duration and measuring their impact.
As per duration, we should consider a technique from astrophysics: the Copernican principal to predict of duration (estimate the length of time that a product or event remains in existence).
GLM
There are other possibilities than the logit and probit link function with GLM. Depending on the situation, we might need another link function, another distribution, even a nonparametric distribution.
glm(y ~ x, data = dataset, family = familytype(link = 'linkfunction'))
There are other models working with qualitative regressands:
Type ?glm for other modeling options and ?family for other family types and link functions. Check out glm.nb and polr from the MASS package.
R has a lot more in bank in terms of packages and functions:
car and MASS packages.nlme and MASS packages.stats package.scatter.smooth from the stats package.splines package.MASS package.sem package and Simultaneous Equation estimation from the systemfit package.pls package.quantreg package.gam package.survival package.tree and rpart packages.betareg package.Goal
Nonconstant residuals variances are features of cross-sectional dataset. Residuals spread can increase, decrease, increase and decrease, etc. The problem distorts results and produce inaccurate (lower) t- and F-values as they appear to be insignificant.
We want to measure the connection between households’ avg_m_revenues (average monthly revenues), a categorical variable with three levels (2000, 4000, 6000) and monthly savings per household.
Print the head of the dataset.
## avg_m_revenues savings
## 1 2000 160
## 2 2000 180
## 3 2000 190
Find out if the preliminary results are misleading. Detect heteroscedasticity and implement a solution.
The preliminary model
Run the regression, print and plot the results.
##
## Call:
## lm(formula = savings ~ avg_m_revenues, data = savings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.982 -53.817 9.018 36.183 154.688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.023e+02 3.886e+01 2.633 0.017438 *
## avg_m_revenues 4.716e-02 9.469e-03 4.981 0.000114 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.02 on 17 degrees of freedom
## Multiple R-squared: 0.5934, Adjusted R-squared: 0.5695
## F-statistic: 24.81 on 1 and 17 DF, p-value: 0.000114
Sometimes, when we plot large dataset, categorical data are difficult to read. Data come out bunched up. The jitter function add some noise to graphics without changing the actual data.
Use the jitter function.
Visual inspection
Plot a Q-Q chart.
Or…
These are the patterns of heteroscedasticity.
Print the residuals plots.
In the quadrant, the left plots show a growing variance. The top-right Q-Q plot is not linear at the bottom. The bottom-right plot (Cook’s distance) shows how high revenues have a large influence on the regression.
##
## Suggested power transformation: -0.9364038
The final plot displays how high revenues influence the coefficients: biased intercept and slope.
Plot means with error bars.
We can clearly see an increasing variance (nonconstant variance).
Tests
First, run the Breusch-Pagan (car package) test under the null hypothesis of ‘homoscedasticity’.
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5.968958 Df = 1 p = 0.01455988
We can reject the null hypothesis at the 5% level.
Second, run the Goldfeld-Quandt test and rerun another Breusch-Pagan test, this time, from the lmtest package.
##
## Goldfeld-Quandt test
##
## data: savings_model
## GQ = 8.7983, df1 = 8, df2 = 7, p-value = 0.004799
## alternative hypothesis: variance increases from segment 1 to 2
##
## studentized Breusch-Pagan test
##
## data: savings_model
## BP = 5.7448, df = 1, p-value = 0.01654
Both tests reject the null hypothesis.
The Breusch_pagan test are criticized since it requires a prior knowledge of what might be causing heteroscedasticity (graphical analysis).
The Goldfeld-Quandt test is more flexible: we can divide the sample observations into two groups and compare them. An evidence of heteroscedasticity would be based on a comparison of the residual sum of squares (RSS) using the F-statistic. The larger the F-statistic, the more likely you have different variance for the two groups.
We can determine the appropriate criteria to separate the sample into Group A and Group B. For example, parameter point = 0.5 set the breakpoint halfway. We can also omit central observation with fraction. Finally, we can test increasing, decreasing variances or both (the default is increasing variances).
Pointer
There are more procedures, all criticized: the Park test, the Glejser test, the Harvey-Godfrey test, the Spearman's rank correlation test, and the White test.
Remedies
Use the Weighted Least Squares (similar to what we used with the logit model).
Weighted Least Squared (WLS)
We divide regressors by the standard deviation (the gls function computes it (for Generalize Least Squares where WLS is a specific case). There exist different transformations though.
Run several regressions with different weights and compare them with the AIC.
## [1] 211.6107
## [1] 211.5195
The varPower() weight (for ‘Power of variance covariate’) generates the best AIC.
Pointer
The parameters are: varExp(), varPower(), varConstPower(), varIdent(), varFixed(), varComb()
Print the results.
## Generalized least squares fit by REML
## Model: savings ~ avg_m_revenues
## Data: savings
## AIC BIC logLik
## 211.5195 214.8524 -101.7598
##
## Variance function:
## Structure: Power of variance covariate
## Formula: ~fitted(.)
## Parameter estimates:
## power
## 2.061803
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 118.2766 24.644036 4.799399 2e-04
## avg_m_revenues 0.0423 0.008913 4.746549 2e-04
##
## Correlation:
## (Intr)
## avg_m_revenues -0.921
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.4854143 -0.7539166 0.2110880 0.7450446 1.6651669
##
## Residual standard error: 0.0005050905
## Degrees of freedom: 19 total; 17 residual
The intercept is higher (118.3 over 102.3) since we flatten out larger observations, giving more leverage to the smaller observations. All p-values are improved, especially for the intercept.
Plot the results.
The plot shows constant variance as a result.
The problem, we get no \(R^2\), only AIC, BIC and LogLik. However, with WLS, \(R^2\) remain quite similar to what we get with OLS.
Conclusion
Taking into account that richer households show more diversity in term of financial discipline, we take measures to standardize the data. We have to curb the increasing variability of monthly savings prior to estimating the connections between revenues and savings.
With adjusted coefficients, we can measure that households save at least US$118 per month (baseline). In addition, richer households have a propensity to save more.
With a US$2000 monthly revenues, an average household saves an additional US$84.6 (0.0423 \(\times\) US$2000); an additional US$169.2 for a US$4000 monthly revenues, and an additional US$253.8 for a US$6000 monthly revenues.
Apart from the WLS, there are other solutions to heteroscedasticity:
MASS package (ridge regression and ‘least absolute shrinkage and selection operator’ or LASSO).Goal
We have panel data:
company (A, B, C, D).year.x1, x2).y).Print the head of the dataset.
## year company y x1 x2 d1 d2 d3
## 1 1 A 33.10 1170.6 97.8 0 0 0
## 2 2 A 45.00 2015.8 104.4 0 0 0
## 3 3 A 77.02 2803.3 118.0 0 0 0
The year variable is made of 20 periods repeated for each 4 company. For each year and each company, we have a set of y, x1, and x2.
We want to screen companies. The dataset is trivial. y, x1 and x2 could be any financial metrics (like our case with market capitalization, indebtedness and D/E ratio where we wanted to spot top performers). The goal is to introduce panel data, common practices and potential problems.
- y can be the gross investment (addition to plants and equipments, plus maintenance and repairs).
- x1 can be the value of the firm (price of common and preferred shares times the number of shares outstanding plus the total book balue of debt).
- x2 can be the stock of plant and equipment (accumulated sum and net additions to assets minus depreciation allowance).
A word of panel data
Panel data or longitudinal data are multi-dimensional collections of observations (variables, readings, results). They combine cross-sectional data and time series. For that reason, it is well-advised to work to the zoo and xts packages and ts objects.
Panel data are flexible. No matter how the dataset is organized, we can always adjust it with the dplyr and tidyr packages. Panel data enable the analyst to focus on the differences between individuals or firms in a panel study and/or the changes in observed phenomena for one subject over the course of the study. Analysis of panel data can be complex as they are plagued with problems from both cross sections and time series: heteroscedasticity, autocorrelation and, of course, multicollinearity.
A first model
Run a regression without differentiating between companies and years. We pile up 80 observations.
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\] Print the results.
##
## Call:
## lm(formula = y ~ x1 + x2, data = panel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -320.83 -100.24 -0.13 68.21 336.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.45486 29.57784 -2.112 0.038 *
## x1 0.10910 0.01371 7.956 1.24e-11 ***
## x2 0.30575 0.04926 6.207 2.52e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.3 on 77 degrees of freedom
## Multiple R-squared: 0.7561, Adjusted R-squared: 0.7498
## F-statistic: 119.4 on 2 and 77 DF, p-value: < 2.2e-16
\(R^2\) are good and we have significant p-values at the 1% level, except for the intercept (5% level).
We do not have any heteroscedasticity or multicollinearity problems here. It is always well-advised to test though.
Time series and panel data are notorious for serial correlation or autocorrelation. Autocorrelation inflates variance, making \(R^2\), t- and F-values invalid.
Test for autocorrelation (where the \(H_0: \rho = 0\) or ‘no autocorrelation’) with the Durbin-Watson test. The test produces two critical values: \(d_{L, \alpha}\) and \(d_{U, \alpha}\).
## lag Autocorrelation D-W Statistic p-value
## 1 0.816305 0.3622964 0
## Alternative hypothesis: rho != 0
For n = 80 (observations), k = 2 (regressors), \(\alpha\) = 0.05, the Durbin-Watson table gives \(d_{L, 0.05} = 1.586\) and \(d_{U, 0.05} = 1.688\).
The computed D-W statistic, \(d\), is below \(d_{L}\). We reject the null hypothesis; we accept the alternative hypothesis (\(\rho \neq 0\) or rho != 0). According to the results, we have evidence of positive first-order (lag = 1) serial correlation.
The diagnostic table:
In addition, for the D-W test to detect first-order serial correlation, the model must include a constant and no lagged regressor dependent on the regressand.
There are other tests to detect autocorrelation: graphical methods such as time sequence plot, runs test or Geary test; analytical techniques such as the modified d test, the large-sample test, the Breusch-Godfrey test of higher-order autocorrelation (lag >= 1), Durbin's h test in the presence of lagged dependent variables. We can perform these tests with the ts and tseries packages.
Run the Breusch-Godfrey test of order 1.
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: panel_model1
## LM test = 53.751, df = 1, p-value = 2.275e-13
We confirm the D-W test and reject the ‘no autocorrelation’ hypothesis.
There exist remedies to autocorrelation. We can change the model specification for example.
A second model
Differentiate between companies by adding three dummy variables-three columns to the dataset (all ‘0’ accounts for A, column d1, the first dummy, has ‘1’ to account for B or ‘0’ for the other companies, column d2 has ‘1’ to account for C or ‘0’ for the other companies, and so on for D).
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 D_1 + \beta_4 D_2 + \beta_5 D_3 + \epsilon\] Run the regression, print the results, and run autocorrelation tests.
##
## Call:
## lm(formula = y ~ x1 + x2 + d1 + d2 + d3, data = panel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -182.995 -48.589 9.516 37.200 195.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -246.81201 35.66791 -6.920 1.39e-09 ***
## x1 0.10814 0.01744 6.202 2.90e-08 ***
## x2 0.34814 0.02656 13.109 < 2e-16 ***
## d1 339.68506 23.89366 14.217 < 2e-16 ***
## d2 157.98102 46.26987 3.414 0.00104 **
## d3 187.53271 31.40690 5.971 7.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75 on 74 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9305
## F-statistic: 212.4 on 5 and 74 DF, p-value: < 2.2e-16
## lag Autocorrelation D-W Statistic p-value
## 1 0.4948028 0.9626941 0
## Alternative hypothesis: rho != 0
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: panel_model2
## LM test = 20.334, df = 1, p-value = 6.502e-06
\(R^2\) are stronger because we have more explanatory variables. We have significant p-values at the 0.1% level, except 1 p-value at the 1% level.
For n = 80 and k = 5, the Durbin-Watson table gives \(d_{L, 0.05}\) = 1.507 and \(d_{U, 0.05}\) = 1.772.
The calculated D-W statistic, \(d\), is below \(d_{L}\): we still have evidence of positive first-order (lag = 1) serial correlation. The Breusch-Godfrey test confirms the conclusion.
Compare both models.
## Analysis of Variance Table
##
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1 + x2 + d1 + d2 + d3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 1558444
## 2 74 416230 3 1142213 67.69 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is significant at the 0.1% level. Both models are statistically different and the Model 2 is better as indicated by its higher \(R^2\). However, we still have the presence of autocorrelation.
A third model
We could add dummy variables to differentiate between time periods: take into account different tax regimes or technological changes for example.
Conclusion
Our goal is reached.
If we cannot remedy to autocorrelation by changing the functional form, we would have to change the model. Here a few pointers.
These should cover the majority of situations.
Panel data are well-suited to perform comparisons over time: within a group (individuals or firms) and between groups. For example, we can compare if two samples of the same population have an equal mean, an equal variance; compare if two or more groups in terms of mean or variance equality. We can run post-hoc tests and pairwise comparisons (Tukey’s procedures), test simple effects, interaction effects, and random effects. In all, we can perform all kinds tests.
Another avenue is multilevel analyses (with the multilevel and nlme packages). These models measure the impact of a group on individuals or firms. For example, individuals may be nested within workgroups.