Foreword

Output options: ‘tango’ syntax.
Snippets and results.
Sources: ‘Political Analysis Using R, Springer, 2015’ and ‘Applied Econometrics with R, Springer, 2008’.

Modeling count data

In many cases, the dependent variable is of the count type:

number of visits to a webpage in a promotion day,
number of patients in an emergency in a week,
number of tickets sold one month before the show,
number of cars through a street or a speed radar per week,
demand for BBQ chickens each day,
time frequency of tweets per hour,
etc.

Sometimes, count data also include rare occurrences, such as:

getting a car accident within a day of deliveries,
having the same network converter breaking down in two successive weeks and
counting the number of middle manager departures for one department in a year.

These cases are discrete and non-negative integers (count values) measured over a certain finite time period.

The Poisson and negative binomial distributions take into account this unique feature of count data.

The first case addresses the most common count regression models. The case skims through the most common features of count models.

The second case dives a little bit deeper: using different count models, their variations, adding tests and elaborating of flaws and remedies.

Case 1, Counting TV news stories

The number of news stories in a month certainly is an event count. The monthly data are time dependent. We could address the case with a time series methods (ARIMA models) to account for time.

Date	Energy	Unemploy	Approval
Jan-69	2	3.4	59
Feb-69	6	3.4	60
Mar-69	11	3.4	64
Apr-69	5	3.4	60
May-69	0	3.4	63

The dependent variable is Energy, is the number of television news stories related to energy policy in a given month (with the Date variable).

The independent variables, such the Unemployment rate and the president Approval rating, will be documented along the way.

The Poisson regression model (PRM)

TV coverage of energy policy is a function of 6 terms for presidential speeches, an indicator for the Arab oil embargo, an indicator for the Iran hostage crisis, the price of oil, presidential approval, and the unemployment rate.

## 
## Call:
## glm(formula = Energy ~ rmn1173 + grf0175 + grf575 + jec477 + 
##     jec1177 + jec479 + embargo + hostages + oilc + Approval + 
##     Unemploy, family = poisson(link = log), data = pres.energy)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -8.383  -2.994  -1.054   1.536  11.399  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 13.250093   0.329121  40.259  < 2e-16 ***
## rmn1173      0.694714   0.077009   9.021  < 2e-16 ***
## grf0175      0.468294   0.096169   4.870 1.12e-06 ***
## grf575      -0.130568   0.162191  -0.805 0.420806    
## jec477       1.108520   0.122211   9.071  < 2e-16 ***
## jec1177      0.576779   0.155511   3.709 0.000208 ***
## jec479       1.076455   0.095066  11.323  < 2e-16 ***
## embargo      0.937796   0.051110  18.349  < 2e-16 ***
## hostages    -0.094507   0.046166  -2.047 0.040647 *  
## oilc        -0.213498   0.008052 -26.515  < 2e-16 ***
## Approval    -0.034096   0.001386 -24.599  < 2e-16 ***
## Unemploy    -0.090204   0.009678  -9.321  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6009.0  on 179  degrees of freedom
## Residual deviance: 2598.8  on 168  degrees of freedom
## AIC: 3488.3
## 
## Number of Fisher Scoring iterations: 5

The coefficients are not very meaningful although most of them are significant.

We want to take the exponential of a coefficient (the link function is simply a logarithm) to compute the count ratio, which allows us to deduce the percentage change in the expected count for a change in the independent variable.

For instance, if we wanted to interpret the effect of presidential approval.

Coeff.	exp(Coeff.)
-0.034096	0.9664787

As a quick way to get the count ratio and percentage change for every coefficient, we compute \((exp(Coef.) - 1)) * 100\); except for the intercept.

For a percentage point increase in the president’s approval rating, coverage of energy policy change by 0.9664787 or -3.3521266 % on average, controlling for all other variables.

The exponentiated coefficients.

rmn1173	grf0175	grf575	jec477	jec1177	jec479
2.003135	1.597266	0.877597	3.02987	1.780294	2.934259

embargo	hostages	oilc	Approval	Unemploy
2.554346	0.9098211	0.8077536	0.9664787	0.9137448

We can simply read off the percentage change for a one-unit increase in the input, controlling for the other variables.

An intriguing feature of the Poisson distribution is the variance is the same as the mean (‘equidispersion’). Hence, when we model the logarithm of the mean, our model is simultaneously modeling the variance.

Often, however, we find that the variance is wider than we would expect given the covariates — a phenomenon called ‘overdispersion’.

Statistics about Energy, the number of television news stories.

mean	variance
34.32222	1587.449

The means does not equal the variance.

The negative binomial regression model (NBRM)

NBRM offers a solution to this problem by estimating an extra dispersion parameter that allows the conditional variance to differ from the conditional mean.

Notice that the syntax is similar to the glm command, but there is no family option since the comment itself specifies that.

## 
## Call:
## glm.nb(formula = Energy ~ rmn1173 + grf0175 + grf575 + jec477 + 
##     jec1177 + jec479 + embargo + hostages + oilc + Approval + 
##     Unemploy, data = pres.energy, init.theta = 2.149960724, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7702  -0.9635  -0.2624   0.3569   2.2034  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 15.299318   1.291013  11.851  < 2e-16 ***
## rmn1173      0.722292   0.752005   0.960  0.33681    
## grf0175      0.288242   0.700429   0.412  0.68069    
## grf575      -0.227584   0.707969  -0.321  0.74786    
## jec477       0.965964   0.703611   1.373  0.16979    
## jec1177      0.573210   0.702534   0.816  0.41455    
## jec479       1.141528   0.694927   1.643  0.10045    
## embargo      1.140854   0.350077   3.259  0.00112 ** 
## hostages     0.089438   0.197520   0.453  0.65069    
## oilc        -0.276592   0.030104  -9.188  < 2e-16 ***
## Approval    -0.032082   0.005796  -5.536  3.1e-08 ***
## Unemploy    -0.077013   0.037630  -2.047  0.04070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.15) family taken to be 1)
## 
##     Null deviance: 393.02  on 179  degrees of freedom
## Residual deviance: 194.74  on 168  degrees of freedom
## AIC: 1526.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.150 
##           Std. Err.:  0.242 
## 
##  2 x log-likelihood:  -1500.427

The coefficients can be interpreted in the same way than in the PRM because both model the logarithm of the mean. The key addition is the dispersion parameter Theta or \(\theta\) = 2.1499607 with a standard error of 0.2418571.

With equidispersion, \(\theta\) should remain close to 1. This indicates that overdispersion is present in this model.

Therefore, many of the inferences drawn vary between the PRM and NBRM.

The two models

The exponentiated coefficients are presented side-by-side.

PRM

rmn1173	grf0175	grf575	jec477	jec1177	jec479
2.003135	1.597266	0.877597	3.02987	1.780294	2.934259

BNRM

rmn1173	grf0175	grf575	jec477	jec1177	jec479
2.059147	1.33408	0.7964554	2.627319	1.773952	3.131549

PRM (cont’d)

embargo	hostages	oilc	Approval	Unemploy
2.554346	0.9098211	0.8077536	0.9664787	0.9137448

BNRM (cont’d)

embargo	hostages	oilc	Approval	Unemploy
3.12944	1.09356	0.7583639	0.9684274	0.9258779

As the results show, many of the discernible results from the PRM are not discernible in the NBRM.

The AIC.

PRM	NBRM
3488.283	1526.427

The AIC is substantially lower for the NBRM (a better fit).

Visualization

Suppose we wanted to plot the effect of presidential Approval on the number of TV news stories on Energy, based on two models.

We set the value of all dummy variable predictors through their model value of 0, while the price of oilc and Unemployment are set to their mean.

If we fail to insert assignable values for the covariates, the predicted counts will not resemble the actual mean and the size of the effect will not be reasonable.

The way in which we use the predict command to forecast average counts with multiple predictors is exactly the same as for logit and probit models.

In our data the variable approval ranges from 24 % approval to 72.3 %.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   42.00   49.00   48.62   57.00   72.30

Thus, we construct a vector that includes the full range of approval as well as plausible values of all the other predictors:

approval <- seq(24, 72.3, by = 0.10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   36.08   48.15   48.15   60.23   72.30

We create a matrix of hypothetical data values – setting the indicators variable to zero, and continuous variables to their means, and approval to its range of hypothetical values.

inputs.4 <-
  cbind(1, 0, 0, 0, 0, 0, 0, 0, 0,
        mean(pres.energy$oilc), 
        approval,
        mean(pres.energy$Unemploy))

The first line belongs to dummies: constant, rmn1173, grf0175, grf575, jec477, jec1177, jec479, embargo, hostages.

The second and fourth lines are the means of oilc and Unemployment.

The third line is our variable of interest; spanning from 24 % to 72.3 %.

We name the columns of the matrix after the variables in our model.

colnames(inputs.4) <-   c("constant", "rmn1173", "grf0175", "grf575", "jec477", "jec1177", "jec479", "embargo", "hostages", "oilc", "Approval", "Unemploy")

The matrix of predictor values is converted to a data frame.

inputs.4 <- as.data.frame(inputs.4)

Once we have the data frame of predictors in place, we can use is the predict command to forecast the expected counts for the Poisson and negative binomial models:

forecast.poisson <- 
  predict(energy.poisson, 
          newdata = inputs.4, 
          type = "response")

forecast.nb <- 
  predict(energy.nb, 
          newdata = inputs.4, 
          type = "response")

These two lines only differ in the model from which they drop coefficient estimates for the forecast.

The predictions from the two models are similar and show a similar negative effect on approval.

The NBRM has a slightly lower forecast at low values of approval. The NBRM has a slightly shallower effect on approval. The PRM and NVRM converge; the predicted counts overlap high values of approval.

Case 2, Counting recreational boating trips

We use a dataset from the AER package: RecreationDemand.

It is a cross-section dataset on the number of recreational boating trips to Lake Somerville, Texas, in 1980, based on a survey administered to 2,000 registered leisure boat owners in 23 counties in eastern Texas.

ski	income	userfee	costC	costS	costH
yes	4	no	67.59	68.620	76.800
no	9	no	68.86	70.936	84.780
yes	5	no	58.12	59.465	72.110
no	2	no	15.79	13.750	23.680
yes	3	no	24.02	34.033	34.547

The dependent variable is trips.
quality, a (subjective) quality ranking of the facility.
ski, a factor indicating whether the individual engaged in water-skiing at the lake.
income, household income.
userfee, a factor indicating whether the individual paid a user’s fee at the lake,
costC, costS, costHthree, cost variables representing opportunity costs.

Poisson regression model (PRM)

The regression model.

\[E(y|x) = \mu = e^{(x_i^T~\beta)} \]

Where \(\mu\) is the expected mean.

Count models are similar to logit when writing the formula: family = poisson(link = log) in the glm function:

## 
## Call:
## glm(formula = trips ~ ., family = poisson(link = log), data = RecreationDemand)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.8465   -1.1411   -0.8896   -0.4780   18.6071  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.264993   0.093722   2.827  0.00469 ** 
## quality      0.471726   0.017091  27.602  < 2e-16 ***
## skiyes       0.418214   0.057190   7.313 2.62e-13 ***
## income      -0.111323   0.019588  -5.683 1.32e-08 ***
## userfeeyes   0.898165   0.078985  11.371  < 2e-16 ***
## costC       -0.003430   0.003118  -1.100  0.27131    
## costS       -0.042536   0.001670 -25.467  < 2e-16 ***
## costH        0.036134   0.002710  13.335  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4849.7  on 658  degrees of freedom
## Residual deviance: 2305.8  on 651  degrees of freedom
## AIC: 3074.9
## 
## Number of Fisher Scoring iterations: 7

Almost all regressors are highly significant, but the coefficient estimates are not very meaningful.

We take the exponential of a coefficient (the link function is simply a logarithm) to compute the count ratio.

We then deduce the percentage change in the expected count for a change in the independent variable. For example, 1 % increase in quality changes trips by 1.602758 %, controlling for all other variables.

The count ratio for all coefficients.

Variables	exp(Coeff.)
quality	1.6027580
ski	1.5192453
income	0.8946496
userfee	2.4550945
costC	0.9965762
costS	0.9583556
costH	1.0367944

Poisson assumption and dispersion

The Poisson distribution has the property that the variance equals the mean (‘equidispersion’).

However, the PRM is often plagued by overdispersion, meaning that the variance is larger than the linear predictor permits.

\[var(y|x) = \mu + \alpha \times h(\mu)\] Where \(h\) is a positive function of \(\mu\). Overdispersion (or underdispersion) corresponds to \(\alpha > 0\) (or \(\alpha < 0\)).

The reparameterization.

\[var(y|x) = (1 + \alpha) \times \mu = dispersion \times \mu\] The PRM becomes the quasi-Poisson model with a dispersion parameter.

Dispersion test

We test for overdispersion.

\(H_0:\alpha\) = 0 or no overdispersion.

By default, the above parameterization is used. Alternatively, if the argument trafo is specified, the test is formulated in terms of the parameter \(\alpha\).

## $estimate
##    alpha 
## 1.316051 
## 
## $p.value
## [1] 0.001651024

\(H_0\) is rejected (the p-value is less than 5%).

The PRM for trips data is not well specified in that there appears to be a substantial amount of overdispersion.

One possible remedy is to consider a more flexible distribution that does not impose equality of mean and variance.

The negative binomial regression model (NBRM)

The NBRM is considered a mixture distribution arising from a Poisson distribution with random scale, the latter following a gamma distribution.

\[var(y;\mu;\theta) = \mu + \frac{1}{\theta}\mu^2;~\mu > 0,~\theta > 0\]

Where \(h(\mu) = \mu^2\) and \(\alpha = 1/\theta\) (from the previous form above). \(\mu\) is the mean.

The variance is always larger than the mean, in contrast to the Poisson distribution in which mean equals variance.

As \(\theta\) tends to infinity, \(\mu\) stays constant, the NBRM approaches the PRM.

The geometric distribution is the special case where \(\theta = 1\).

We could use a geometric regression via family = negative.binomial(theta = 1) in the glm function.

For unknown \(\theta\), the function glm.nb is available.

## 
## Call:
## glm.nb(formula = trips ~ ., data = RecreationDemand, init.theta = 0.7292568331, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9727  -0.6256  -0.4619  -0.2897   5.0494  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.121936   0.214303  -5.235 1.65e-07 ***
## quality      0.721999   0.040117  17.998  < 2e-16 ***
## skiyes       0.612139   0.150303   4.073 4.65e-05 ***
## income      -0.026059   0.042453  -0.614    0.539    
## userfeeyes   0.669168   0.353021   1.896    0.058 .  
## costC        0.048009   0.009185   5.227 1.72e-07 ***
## costS       -0.092691   0.006653 -13.931  < 2e-16 ***
## costH        0.038836   0.007751   5.011 5.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.7293) family taken to be 1)
## 
##     Null deviance: 1244.61  on 658  degrees of freedom
## Residual deviance:  425.42  on 651  degrees of freedom
## AIC: 1669.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.7293 
##           Std. Err.:  0.0747 
## 
##  2 x log-likelihood:  -1651.1150

With equidispersion, \(\theta\) should remain close to 1.

The shape parameter of the fitted negative binomial distribution is Theta or \(\theta\) = 0.7292568, suggesting a considerable amount of overdispersion and confirming the results from the overdispersion test.

We note that the negative binomial model represents a substantial improvement in terms of log-likelihood (logLik).

PRM	NBRM
-1529.431	-825.5576

The NBRM is closer to 0 (better model).

PRM – Robust standard errors

Instead of switching to a more general family of distributions, it is also possible to use the Poisson estimates along with a set of standard errors estimated under less restrictive assumptions: the sandwich variance estimators for GLMs.

The resulting estimates are often called Huber-White standard errors (from the PRM).

The sandwich estimator is based on first and second derivatives of the likelihood: the outer product of gradients (OPG) forms the “meat” and the second derivatives the “bread” of the sandwich. Both are asymptotically equivalent under correct specification.

Let us compare them: the Poisson (first row) results with the H-W results (second row).

The third row is Poisson/H-W.

(Intercept)	quality	skiyes	income	userfeeyes	costC	costS	costH
0.094	0.017	0.057	0.020	0.079	0.003	0.002	0.003
0.432	0.049	0.194	0.050	0.247	0.015	0.012	0.009
0.217	0.350	0.295	0.389	0.320	0.212	0.142	0.289

The third row is far from being equal to 1. Such ratio would indicate equality. However, the result indicates that the simple PRM is inadequate.

Dispersion, again

A typical problem with count data regressions arising in microeconometric and marketing applications is that the number of zeros is often much larger than a Poisson or negative binomial regression permit. We can visualize it with a histogram.

The RecreationDemand data contain a large number of zeros; in fact, no less than 63.2776935 % of all respondents report no trips to Lake Somerville, making 0 the ‘most frequent value’.

The actual count from RecreationDemand dataset (659 observations) vs. the expected count from the Poisson distribution.

The PRM misses the actuals by a great deal.

The Poisson regression fitted above only provides 42.0333839 % of zero observations, again suggesting that this model is not satisfactory`the PRM does fit the actual data distribution.

There a number of very high values (more than 10 trips).

This again underlines that there are problems with the previously considered specification. In fact, the variance-to-mean ratio for the trips variable equals 17.6425026, a value usually too large to be accommodated by covariates (a value confirming the dispersion \(\theta\) form above).

The actual count from RecreationDemand dataset (659 observations) vs. the expected count from the Poisson distribution (PRM) and the NBRM count (we estimated above).

We see an improvement with the NBRM. Can we do better?

Alternative models are needed.

Zero-inflated Poisson model (ZIPM)

One such model is the ZIPM, which suggests a mixture specification with a Poisson count component and an additional point mass at zero.

We consider a regression of trips on all further variables for the count part (using a negative binomial distribution) and model the inflation part as a function of quality and income.

\[f_{zeroinfl}(y) = p \times I_{\{0\}}(y) + (1-p) \times f_{count}(y; \mu)\] We fit zero-inflated regression models for count data via maximum likelihood; the zeroinfl function allows for Poisson, geometric, and negative binomial distributions for the count component \(f_{count}(y)\).

We employ all available regressors, while the inflation part, separated by |, only uses quality and income. The inflation part by default uses the logit model, but all other links available in glm may also be used upon setting the argument link appropriately.

## 
## Call:
## zeroinfl(formula = trips ~ . | quality + income, data = RecreationDemand, 
##     dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08885 -0.20037 -0.05696 -0.04509 40.01393 
## 
## Count model coefficients (negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.096634   0.256679   4.272 1.93e-05 ***
## quality      0.168911   0.053032   3.185 0.001447 ** 
## skiyes       0.500694   0.134488   3.723 0.000197 ***
## income      -0.069268   0.043800  -1.581 0.113775    
## userfeeyes   0.542786   0.282801   1.919 0.054944 .  
## costC        0.040445   0.014520   2.785 0.005345 ** 
## costS       -0.066206   0.007745  -8.548  < 2e-16 ***
## costH        0.020596   0.010233   2.013 0.044146 *  
## Log(theta)   0.190175   0.112989   1.683 0.092352 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   5.7427     1.5561   3.691 0.000224 ***
## quality      -8.3074     3.6816  -2.256 0.024041 *  
## income       -0.2585     0.2821  -0.916 0.359504    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 1.2095 
## Number of iterations in BFGS optimization: 26 
## Log-likelihood:  -722 on 12 Df

We can compare the results with the other models; let us compare log-likelihood (logLik)

PRM	NBRM	ZIPM
-1529.431	-825.5576	-721.9525

We have a clearly improved log-likelihood compared with the two previous models.

The Observed count from RecreationDemand dataset (659 observations) vs. the expected count from a Poisson distribution.

The actual count from RecreationDemand dataset vs. the expected count from ZIPM.

The expected trip counts shows that both the zeros and the remaining counts are captured much better with the ZIPM than in the previous models (especially for ‘0’ trips).

The negative binomial hurdle model (NBHM)

Another model that is able to capture excessively large (or small) numbers of zeros (or low values) is the NBHM. It is more widely used than the ZIPM.

The NBHM consists of two parts (hence it is also called a ‘two-part model’):

A binary part (given by a count distribution right-censored at y = 1): is y equal to zero or positive? “Is the hurdle crossed?”"
A count part (given by a count distribution left-truncated at y = 1): if y > 0, how large is y?

The NBHM is somewhat easier to fit than the ZIPM because the resulting likelihood decomposes into two parts that may be maximized separately.

Here it is possible to specify either:

A count distribution right-censored at one or,
A Bernoulli distribution distinguishing between zeros and non-zeros (which is equivalent to the right-censored geometric distribution).

The first variant appears to be more common; however, the Bernoulli specification is more intuitively interpretable as a hurdle. We employ the latter, which is equivalent to the geometric distribution.

## 
## Call:
## hurdle(formula = trips ~ . | quality + income, data = RecreationDemand, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6097 -0.2071 -0.1845 -0.1642 12.1114 
## 
## Count model coefficients (truncated negbin with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.84193    0.38278   2.200  0.02784 *  
## quality      0.17170    0.07234   2.374  0.01762 *  
## skiyes       0.62236    0.19013   3.273  0.00106 ** 
## income      -0.05709    0.06452  -0.885  0.37629    
## userfeeyes   0.57634    0.38508   1.497  0.13448    
## costC        0.05707    0.02169   2.632  0.00850 ** 
## costS       -0.07752    0.01155  -6.713  1.9e-11 ***
## costH        0.01237    0.01490   0.830  0.40640    
## Log(theta)  -0.53031    0.26114  -2.031  0.04228 *  
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.76631    0.36230  -7.635 2.25e-14 ***
## quality      1.50291    0.10032  14.981  < 2e-16 ***
## income      -0.04467    0.07853  -0.569    0.569    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 0.5884
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -765.1 on 12 Df

The actual count from RecreationDemand dataset vs. the expected count from NBHM.

Like the ZIPM, the NBHM results in a considerable improvement over the initial Poisson specification.

We compare the results with the other models; let us compare log-likelihood (logLik)

PRM	NBRM	ZIPM	NBHM
-1529.431	-825.5576	-721.9525	-765.0984

Overall, the best model remains the ZIPM (lowest log-likelihood).

…and Counting

Modeling count data

Case 1, Counting TV news stories

The Poisson regression model (PRM)

The negative binomial regression model (NBRM)

The two models

Visualization

Case 2, Counting recreational boating trips

Poisson regression model (PRM)

Poisson assumption and dispersion

Dispersion test

The negative binomial regression model (NBRM)

PRM – Robust standard errors

Dispersion, again

Zero-inflated Poisson model (ZIPM)

The negative binomial hurdle model (NBHM)