Foreword
In many cases, the dependent variable is of the count type:
Sometimes, count data also include rare occurrences, such as:
These cases are discrete and non-negative integers (count values) measured over a certain finite time period.
The Poisson and negative binomial distributions take into account this unique feature of count data.
The first case addresses the most common count regression models. The case skims through the most common features of count models.
The second case dives a little bit deeper: using different count models, their variations, adding tests and elaborating of flaws and remedies.
The number of news stories in a month certainly is an event count. The monthly data are time dependent. We could address the case with a time series methods (ARIMA models) to account for time.
| Date | Energy | Unemploy | Approval | rmn1173 | rmn1173a |
|---|---|---|---|---|---|
| Jan-69 | 2 | 3.4 | 59 | 0 | 0 |
| Feb-69 | 6 | 3.4 | 60 | 0 | 0 |
| Mar-69 | 11 | 3.4 | 64 | 0 | 0 |
| Apr-69 | 5 | 3.4 | 60 | 0 | 0 |
| May-69 | 0 | 3.4 | 63 | 0 | 0 |
The dependent variable is Energy, is the number of television news stories related to energy policy in a given month (with the Date variable).
The independent variables, such the Unemployment rate and the president Approval rating, will be documented along the way.
TV coverage of energy policy is a function of 6 terms for presidential speeches, an indicator for the Arab oil embargo, an indicator for the Iran hostage crisis, the price of oil, presidential approval, and the unemployment rate.
##
## Call:
## glm(formula = Energy ~ rmn1173 + grf0175 + grf575 + jec477 +
## jec1177 + jec479 + embargo + hostages + oilc + Approval +
## Unemploy, family = poisson(link = log), data = pres.energy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.383 -2.994 -1.054 1.536 11.399
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.250093 0.329121 40.259 < 2e-16 ***
## rmn1173 0.694714 0.077009 9.021 < 2e-16 ***
## grf0175 0.468294 0.096169 4.870 1.12e-06 ***
## grf575 -0.130568 0.162191 -0.805 0.420806
## jec477 1.108520 0.122211 9.071 < 2e-16 ***
## jec1177 0.576779 0.155511 3.709 0.000208 ***
## jec479 1.076455 0.095066 11.323 < 2e-16 ***
## embargo 0.937796 0.051110 18.349 < 2e-16 ***
## hostages -0.094507 0.046166 -2.047 0.040647 *
## oilc -0.213498 0.008052 -26.515 < 2e-16 ***
## Approval -0.034096 0.001386 -24.599 < 2e-16 ***
## Unemploy -0.090204 0.009678 -9.321 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 6009.0 on 179 degrees of freedom
## Residual deviance: 2598.8 on 168 degrees of freedom
## AIC: 3488.3
##
## Number of Fisher Scoring iterations: 5
The coefficients are not very meaningful although most of them are significant.
We want to take the exponential of a coefficient (the link function is simply a logarithm) to compute the count ratio, which allows us to deduce the percentage change in the expected count for a change in the independent variable.
For instance, if we wanted to interpret the effect of presidential approval.
| Coeff. | exp(Coeff.) |
|---|---|
| -0.034096 | 0.9664787 |
As a quick way to get the count ratio and percentage change for every coefficient, we compute \((exp(Coef.) - 1)) * 100\); except for the intercept.
For a percentage point increase in the president’s approval rating, coverage of energy policy change by 0.9664787 or -3.3521266 % on average, controlling for all other variables.
The exponentiated coefficients.
| rmn1173 | grf0175 | grf575 | jec477 | jec1177 | jec479 |
|---|---|---|---|---|---|
| 2.003135 | 1.597266 | 0.877597 | 3.02987 | 1.780294 | 2.934259 |
| embargo | hostages | oilc | Approval | Unemploy |
|---|---|---|---|---|
| 2.554346 | 0.9098211 | 0.8077536 | 0.9664787 | 0.9137448 |
We can simply read off the percentage change for a one-unit increase in the input, controlling for the other variables.
An intriguing feature of the Poisson distribution is the variance is the same as the mean (‘equidispersion’). Hence, when we model the logarithm of the mean, our model is simultaneously modeling the variance.
Often, however, we find that the variance is wider than we would expect given the covariates — a phenomenon called ‘overdispersion’.
Statistics about Energy, the number of television news stories.
| mean | variance |
|---|---|
| 34.32222 | 1587.449 |
The means does not equal the variance.
NBRM offers a solution to this problem by estimating an extra dispersion parameter that allows the conditional variance to differ from the conditional mean.
Notice that the syntax is similar to the glm command, but there is no family option since the comment itself specifies that.
##
## Call:
## glm.nb(formula = Energy ~ rmn1173 + grf0175 + grf575 + jec477 +
## jec1177 + jec479 + embargo + hostages + oilc + Approval +
## Unemploy, data = pres.energy, init.theta = 2.149960724, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7702 -0.9635 -0.2624 0.3569 2.2034
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.299318 1.291013 11.851 < 2e-16 ***
## rmn1173 0.722292 0.752005 0.960 0.33681
## grf0175 0.288242 0.700429 0.412 0.68069
## grf575 -0.227584 0.707969 -0.321 0.74786
## jec477 0.965964 0.703611 1.373 0.16979
## jec1177 0.573210 0.702534 0.816 0.41455
## jec479 1.141528 0.694927 1.643 0.10045
## embargo 1.140854 0.350077 3.259 0.00112 **
## hostages 0.089438 0.197520 0.453 0.65069
## oilc -0.276592 0.030104 -9.188 < 2e-16 ***
## Approval -0.032082 0.005796 -5.536 3.1e-08 ***
## Unemploy -0.077013 0.037630 -2.047 0.04070 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.15) family taken to be 1)
##
## Null deviance: 393.02 on 179 degrees of freedom
## Residual deviance: 194.74 on 168 degrees of freedom
## AIC: 1526.4
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.150
## Std. Err.: 0.242
##
## 2 x log-likelihood: -1500.427
The coefficients can be interpreted in the same way than in the PRM because both model the logarithm of the mean. The key addition is the dispersion parameter Theta or \(\theta\) = 2.1499607 with a standard error of 0.2418571.
With equidispersion, \(\theta\) should remain close to 1. This indicates that overdispersion is present in this model.
Therefore, many of the inferences drawn vary between the PRM and NBRM.
The exponentiated coefficients are presented side-by-side.
PRM
| rmn1173 | grf0175 | grf575 | jec477 | jec1177 | jec479 |
|---|---|---|---|---|---|
| 2.003135 | 1.597266 | 0.877597 | 3.02987 | 1.780294 | 2.934259 |
BNRM
| rmn1173 | grf0175 | grf575 | jec477 | jec1177 | jec479 |
|---|---|---|---|---|---|
| 2.059147 | 1.33408 | 0.7964554 | 2.627319 | 1.773952 | 3.131549 |
PRM (cont’d)
| embargo | hostages | oilc | Approval | Unemploy |
|---|---|---|---|---|
| 2.554346 | 0.9098211 | 0.8077536 | 0.9664787 | 0.9137448 |
BNRM (cont’d)
| embargo | hostages | oilc | Approval | Unemploy |
|---|---|---|---|---|
| 3.12944 | 1.09356 | 0.7583639 | 0.9684274 | 0.9258779 |
As the results show, many of the discernible results from the PRM are not discernible in the NBRM.
The AIC.
| PRM | NBRM |
|---|---|
| 3488.283 | 1526.427 |
The AIC is substantially lower for the NBRM (a better fit).
Suppose we wanted to plot the effect of presidential Approval on the number of TV news stories on Energy, based on two models.
We set the value of all dummy variable predictors through their model value of 0, while the price of oilc and Unemployment are set to their mean.
If we fail to insert assignable values for the covariates, the predicted counts will not resemble the actual mean and the size of the effect will not be reasonable.
The way in which we use the predict command to forecast average counts with multiple predictors is exactly the same as for logit and probit models.
In our data the variable approval ranges from 24 % approval to 72.3 %.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 42.00 49.00 48.62 57.00 72.30
Thus, we construct a vector that includes the full range of approval as well as plausible values of all the other predictors:
approval <- seq(24, 72.3, by = 0.10)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 36.08 48.15 48.15 60.23 72.30
We create a matrix of hypothetical data values – setting the indicators variable to zero, and continuous variables to their means, and approval to its range of hypothetical values.
inputs.4 <-
cbind(1, 0, 0, 0, 0, 0, 0, 0, 0,
mean(pres.energy$oilc),
approval,
mean(pres.energy$Unemploy))The first line belongs to dummies: constant, rmn1173, grf0175, grf575, jec477, jec1177, jec479, embargo, hostages.
The second and fourth lines are the means of oilc and Unemployment.
The third line is our variable of interest; spanning from 24 % to 72.3 %.
We name the columns of the matrix after the variables in our model.
colnames(inputs.4) <- c("constant", "rmn1173", "grf0175", "grf575", "jec477", "jec1177", "jec479", "embargo", "hostages", "oilc", "Approval", "Unemploy")The matrix of predictor values is converted to a data frame.
inputs.4 <- as.data.frame(inputs.4)Once we have the data frame of predictors in place, we can use is the predict command to forecast the expected counts for the Poisson and negative binomial models:
forecast.poisson <-
predict(energy.poisson,
newdata = inputs.4,
type = "response")
forecast.nb <-
predict(energy.nb,
newdata = inputs.4,
type = "response")These two lines only differ in the model from which they drop coefficient estimates for the forecast.
The predictions from the two models are similar and show a similar negative effect on approval.
The NBRM has a slightly lower forecast at low values of approval. The NBRM has a slightly shallower effect on approval. The PRM and NVRM converge; the predicted counts overlap high values of approval.
We use a dataset from the AER package: RecreationDemand.
It is a cross-section dataset on the number of recreational boating trips to Lake Somerville, Texas, in 1980, based on a survey administered to 2,000 registered leisure boat owners in 23 counties in eastern Texas.
| trips | quality | ski | income | userfee | costC | costS | costH |
|---|---|---|---|---|---|---|---|
| 0 | 0 | yes | 4 | no | 67.59 | 68.620 | 76.800 |
| 0 | 0 | no | 9 | no | 68.86 | 70.936 | 84.780 |
| 0 | 0 | yes | 5 | no | 58.12 | 59.465 | 72.110 |
| 0 | 0 | no | 2 | no | 15.79 | 13.750 | 23.680 |
| 0 | 0 | yes | 3 | no | 24.02 | 34.033 | 34.547 |
trips.quality, a (subjective) quality ranking of the facility.ski, a factor indicating whether the individual engaged in water-skiing at the lake.income, household income.userfee, a factor indicating whether the individual paid a user’s fee at the lake,costC, costS, costHthree, cost variables representing opportunity costs.The regression model.
\[E(y|x) = \mu = e^{(x_i^T~\beta)} \]
Where \(\mu\) is the expected mean.
Count models are similar to logit when writing the formula: family = poisson(link = log) in the glm function:
##
## Call:
## glm(formula = trips ~ ., family = poisson(link = log), data = RecreationDemand)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -11.8465 -1.1411 -0.8896 -0.4780 18.6071
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.264993 0.093722 2.827 0.00469 **
## quality 0.471726 0.017091 27.602 < 2e-16 ***
## skiyes 0.418214 0.057190 7.313 2.62e-13 ***
## income -0.111323 0.019588 -5.683 1.32e-08 ***
## userfeeyes 0.898165 0.078985 11.371 < 2e-16 ***
## costC -0.003430 0.003118 -1.100 0.27131
## costS -0.042536 0.001670 -25.467 < 2e-16 ***
## costH 0.036134 0.002710 13.335 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4849.7 on 658 degrees of freedom
## Residual deviance: 2305.8 on 651 degrees of freedom
## AIC: 3074.9
##
## Number of Fisher Scoring iterations: 7
Almost all regressors are highly significant, but the coefficient estimates are not very meaningful.
We take the exponential of a coefficient (the link function is simply a logarithm) to compute the count ratio.
We then deduce the percentage change in the expected count for a change in the independent variable. For example, 1 % increase in quality changes trips by 1.602758 %, controlling for all other variables.
The count ratio for all coefficients.
| Variables | exp(Coeff.) |
|---|---|
| quality | 1.6027580 |
| ski | 1.5192453 |
| income | 0.8946496 |
| userfee | 2.4550945 |
| costC | 0.9965762 |
| costS | 0.9583556 |
| costH | 1.0367944 |
The Poisson distribution has the property that the variance equals the mean (‘equidispersion’).
However, the PRM is often plagued by overdispersion, meaning that the variance is larger than the linear predictor permits.
\[var(y|x) = \mu + \alpha \times h(\mu)\] Where \(h\) is a positive function of \(\mu\). Overdispersion (or underdispersion) corresponds to \(\alpha > 0\) (or \(\alpha < 0\)).
The reparameterization.
\[var(y|x) = (1 + \alpha) \times \mu = dispersion \times \mu\] The PRM becomes the quasi-Poisson model with a dispersion parameter.
We test for overdispersion.
\(H_0:\alpha\) = 0 or no overdispersion.
By default, the above parameterization is used. Alternatively, if the argument trafo is specified, the test is formulated in terms of the parameter \(\alpha\).
## $estimate
## alpha
## 1.316051
##
## $p.value
## [1] 0.001651024
\(H_0\) is rejected (the p-value is less than 5%).
The PRM for trips data is not well specified in that there appears to be a substantial amount of overdispersion.
One possible remedy is to consider a more flexible distribution that does not impose equality of mean and variance.
The NBRM is considered a mixture distribution arising from a Poisson distribution with random scale, the latter following a gamma distribution.
\[var(y;\mu;\theta) = \mu + \frac{1}{\theta}\mu^2;~\mu > 0,~\theta > 0\]
Where \(h(\mu) = \mu^2\) and \(\alpha = 1/\theta\) (from the previous form above). \(\mu\) is the mean.
The variance is always larger than the mean, in contrast to the Poisson distribution in which mean equals variance.
As \(\theta\) tends to infinity, \(\mu\) stays constant, the NBRM approaches the PRM.
The geometric distribution is the special case where \(\theta = 1\).
We could use a geometric regression via family = negative.binomial(theta = 1) in the glm function.
For unknown \(\theta\), the function glm.nb is available.
##
## Call:
## glm.nb(formula = trips ~ ., data = RecreationDemand, init.theta = 0.7292568331,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9727 -0.6256 -0.4619 -0.2897 5.0494
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.121936 0.214303 -5.235 1.65e-07 ***
## quality 0.721999 0.040117 17.998 < 2e-16 ***
## skiyes 0.612139 0.150303 4.073 4.65e-05 ***
## income -0.026059 0.042453 -0.614 0.539
## userfeeyes 0.669168 0.353021 1.896 0.058 .
## costC 0.048009 0.009185 5.227 1.72e-07 ***
## costS -0.092691 0.006653 -13.931 < 2e-16 ***
## costH 0.038836 0.007751 5.011 5.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.7293) family taken to be 1)
##
## Null deviance: 1244.61 on 658 degrees of freedom
## Residual deviance: 425.42 on 651 degrees of freedom
## AIC: 1669.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.7293
## Std. Err.: 0.0747
##
## 2 x log-likelihood: -1651.1150
With equidispersion, \(\theta\) should remain close to 1.
The shape parameter of the fitted negative binomial distribution is Theta or \(\theta\) = 0.7292568, suggesting a considerable amount of overdispersion and confirming the results from the overdispersion test.
We note that the negative binomial model represents a substantial improvement in terms of log-likelihood (logLik).
| PRM | NBRM |
|---|---|
| -1529.431 | -825.5576 |
The NBRM is closer to 0 (better model).
Instead of switching to a more general family of distributions, it is also possible to use the Poisson estimates along with a set of standard errors estimated under less restrictive assumptions: the sandwich variance estimators for GLMs.
The resulting estimates are often called Huber-White standard errors (from the PRM).
The sandwich estimator is based on first and second derivatives of the likelihood: the outer product of gradients (OPG) forms the “meat” and the second derivatives the “bread” of the sandwich. Both are asymptotically equivalent under correct specification.
Let us compare them: the Poisson (first row) results with the H-W results (second row).
The third row is Poisson/H-W.
| (Intercept) | quality | skiyes | income | userfeeyes | costC | costS | costH |
|---|---|---|---|---|---|---|---|
| 0.094 | 0.017 | 0.057 | 0.020 | 0.079 | 0.003 | 0.002 | 0.003 |
| 0.432 | 0.049 | 0.194 | 0.050 | 0.247 | 0.015 | 0.012 | 0.009 |
| 0.217 | 0.350 | 0.295 | 0.389 | 0.320 | 0.212 | 0.142 | 0.289 |
The third row is far from being equal to 1. Such ratio would indicate equality. However, the result indicates that the simple PRM is inadequate.
A typical problem with count data regressions arising in microeconometric and marketing applications is that the number of zeros is often much larger than a Poisson or negative binomial regression permit. We can visualize it with a histogram.
The RecreationDemand data contain a large number of zeros; in fact, no less than 63.2776935 % of all respondents report no trips to Lake Somerville, making 0 the ‘most frequent value’.
The actual count from RecreationDemand dataset (659 observations) vs. the expected count from the Poisson distribution.
The PRM misses the actuals by a great deal.
The Poisson regression fitted above only provides 42.0333839 % of zero observations, again suggesting that this model is not satisfactory`the PRM does fit the actual data distribution.
There a number of very high values (more than 10 trips).
This again underlines that there are problems with the previously considered specification. In fact, the variance-to-mean ratio for the trips variable equals 17.6425026, a value usually too large to be accommodated by covariates (a value confirming the dispersion \(\theta\) form above).
The actual count from RecreationDemand dataset (659 observations) vs. the expected count from the Poisson distribution (PRM) and the NBRM count (we estimated above).
We see an improvement with the NBRM. Can we do better?
Alternative models are needed.
One such model is the ZIPM, which suggests a mixture specification with a Poisson count component and an additional point mass at zero.
We consider a regression of trips on all further variables for the count part (using a negative binomial distribution) and model the inflation part as a function of quality and income.
\[f_{zeroinfl}(y) = p \times I_{\{0\}}(y) + (1-p) \times f_{count}(y; \mu)\] We fit zero-inflated regression models for count data via maximum likelihood; the zeroinfl function allows for Poisson, geometric, and negative binomial distributions for the count component \(f_{count}(y)\).
We employ all available regressors, while the inflation part, separated by |, only uses quality and income. The inflation part by default uses the logit model, but all other links available in glm may also be used upon setting the argument link appropriately.
##
## Call:
## zeroinfl(formula = trips ~ . | quality + income, data = RecreationDemand,
## dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.08885 -0.20037 -0.05696 -0.04509 40.01393
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.096634 0.256679 4.272 1.93e-05 ***
## quality 0.168911 0.053032 3.185 0.001447 **
## skiyes 0.500694 0.134488 3.723 0.000197 ***
## income -0.069268 0.043800 -1.581 0.113775
## userfeeyes 0.542786 0.282801 1.919 0.054944 .
## costC 0.040445 0.014520 2.785 0.005345 **
## costS -0.066206 0.007745 -8.548 < 2e-16 ***
## costH 0.020596 0.010233 2.013 0.044146 *
## Log(theta) 0.190175 0.112989 1.683 0.092352 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.7427 1.5561 3.691 0.000224 ***
## quality -8.3074 3.6816 -2.256 0.024041 *
## income -0.2585 0.2821 -0.916 0.359504
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 1.2095
## Number of iterations in BFGS optimization: 26
## Log-likelihood: -722 on 12 Df
We can compare the results with the other models; let us compare log-likelihood (logLik)
| PRM | NBRM | ZIPM |
|---|---|---|
| -1529.431 | -825.5576 | -721.9525 |
We have a clearly improved log-likelihood compared with the two previous models.
The Observed count from RecreationDemand dataset (659 observations) vs. the expected count from a Poisson distribution.
The actual count from RecreationDemand dataset vs. the expected count from ZIPM.
The expected trip counts shows that both the zeros and the remaining counts are captured much better with the ZIPM than in the previous models (especially for ‘0’ trips).
Another model that is able to capture excessively large (or small) numbers of zeros (or low values) is the NBHM. It is more widely used than the ZIPM.
The NBHM consists of two parts (hence it is also called a ‘two-part model’):
The NBHM is somewhat easier to fit than the ZIPM because the resulting likelihood decomposes into two parts that may be maximized separately.
Here it is possible to specify either:
The first variant appears to be more common; however, the Bernoulli specification is more intuitively interpretable as a hurdle. We employ the latter, which is equivalent to the geometric distribution.
##
## Call:
## hurdle(formula = trips ~ . | quality + income, data = RecreationDemand,
## dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.6097 -0.2071 -0.1845 -0.1642 12.1114
##
## Count model coefficients (truncated negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.84193 0.38278 2.200 0.02784 *
## quality 0.17170 0.07234 2.374 0.01762 *
## skiyes 0.62236 0.19013 3.273 0.00106 **
## income -0.05709 0.06452 -0.885 0.37629
## userfeeyes 0.57634 0.38508 1.497 0.13448
## costC 0.05707 0.02169 2.632 0.00850 **
## costS -0.07752 0.01155 -6.713 1.9e-11 ***
## costH 0.01237 0.01490 0.830 0.40640
## Log(theta) -0.53031 0.26114 -2.031 0.04228 *
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76631 0.36230 -7.635 2.25e-14 ***
## quality 1.50291 0.10032 14.981 < 2e-16 ***
## income -0.04467 0.07853 -0.569 0.569
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta: count = 0.5884
## Number of iterations in BFGS optimization: 18
## Log-likelihood: -765.1 on 12 Df
The actual count from RecreationDemand dataset vs. the expected count from NBHM.
Like the ZIPM, the NBHM results in a considerable improvement over the initial Poisson specification.
We compare the results with the other models; let us compare log-likelihood (logLik)
| PRM | NBRM | ZIPM | NBHM |
|---|---|---|---|
| -1529.431 | -825.5576 | -721.9525 | -765.0984 |
Overall, the best model remains the ZIPM (lowest log-likelihood).