Modeling count data

In many cases, the dependent variable is of the count type:

number of visits to a webpage in a promotion day,
number of patients in an emergency in a week,
number of tickets sold one month before the show,
number of cars through a street or a speed radar per week,
demand for BBQ chickens each day,
time frequency of tweets per hour.

Sometimes, count data also include rare occurrences, such as:

getting a car accident within a day of deliveries,
having the same network converter breaking down in two successive weeks and
counting the number of middle manager departures for one department in a year.

All these cases are discrete and non-negative integers (count values) measured over a certain finite time period.

The Poisson and negative binomial distributions take into account this unique feature of count data.

Patents and R&D expenditures

What is the relationship between the number of patents and the expenditures in R&D by manufacturing firms?

Industrial organizations and government bodies are interested in predicting if the money they invest will yield the benefits they expect.

The dataset contains the number of patents received by a sample of 181 international manufacturing firms and the amount of their R&D expenditure.

⊕p90 and p91 are the number patents for 1990 and 1991. lr90 and lr91 are the log(R&D expenditures). The logarithm smooths out disparities among the industries. Dummy variables representing five major industries: aerospace, chemistry, computers, machines and instruments, and motor vehicles. Food, fuel, metal and others are the reference category. Dummy variables for Japan and the US. The EU is the comparison group.

## 'data.frame':    181 obs. of  11 variables:
##  $ p91     : int  55 67 55 83 0 4 7 0 0 96 ...
##  $ p90     : int  80 46 42 102 1 11 55 0 1 67 ...
##  $ lr91    : num  6.29 5.15 4.17 6.13 4.87 ...
##  $ lr90    : num  6.16 5.14 4.1 6.15 4.82 ...
##  $ aerosp  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ chemist : int  0 0 1 1 0 0 0 0 0 1 ...
##  $ computer: int  0 0 0 0 0 0 0 0 1 0 ...
##  $ machines: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ vehicles: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ japan   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ us      : int  1 1 1 0 0 0 1 1 1 1 ...

Our objective is to determine the influence of R&D, industry category and the two countries on the mean or the average number of patents received by the 181 firms.

The OLS model

lm(formula = p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us, data = r_d)

⊕Significance codes <0.001 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’

	Coeff.	p-value	Codes
(Intercept)	-250.838563686667	1.1232386254688e-05	***
lr90	73.1720158715173	1.33977125731228e-16	***
aerosp	-44.1619950591535	0.217061611807876
chemist	47.0812267408583	0.0778571842288539	.
computer	33.8564485417461	0.224436503833659
machines	34.3794129244747	0.218114559171204
vehicles	-191.790295137144	4.98459605006646e-07	***
japan	26.2385298089429	0.522234970999198
us	-76.8538696344533	0.00801776140385856	**

As expected, there is a positive relationship between the number of patents received (p90)and R&D expenditure (lr90), which is highly statistically significant.

On the other hand, the \(R^2\), 0.4729115 and 0.4483957, are rather weak; many coefficients are not significant.

Since lr90 is in the logarithmic form and the patent variable is in the linear form, the R&D coefficient of 73.1720159 suggests that if we increase R&D expenditure by 1 %, the average number of patents received will increase by about 73.1720159 %, ceteris paribus. However, we should not rely on that result.

Despite these results, the OLS regression is not appropriate because the number of patents granted per firm per year is usually small.

This histogram shows the highly skewed distribution of the patent data, which can be confirmed by the coefficients of skewness and kurtosis.

A normally distributed variable should have:

skewness = 0,
kurtosis = 3.

p90 has:

## [1] 3.264074
## [1] 14.44027

We run a Jarque-Bera (normality) test where:

\(H_0\): observations are normally distributed.

## 
##  Jarque-Bera Normality Test
## 
## data:  r_d$p90
## JB = 1308.5, p-value < 2.2e-16
## alternative hypothesis: greater

The test rejects the \(H_0\); the p90 data distribution is nonnormal.

We recall that in large samples the J-B statistic follows the chi-square distribution with 2 degrees of freedom.

It is clear from the histogram that a preponderance of firms received fewer than 200 patents; actually much fewer than this number. In other words, the data distribution follows a Poisson-like distribution (the Poisson distribution is skewed (when \(\lambda\), the expected number of occurrence, is small) and the mean equals the variance.

The Poisson regression model (PRM)

The Poisson regression model is nonlinear, necessitating the method of maximum likelihood (ML).

glm(formula = p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us, family = poisson(link = log), data = r_d)

⊕Significance codes <0.001 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’

	Coeff.	p-value	Codes
(Intercept)	-0.745849112597202	3.40797084804473e-33	***
lr90	0.865149167172618	0	***
aerosp	-0.796537924973795	9.84591643113336e-32	***
chemist	0.77475146614091	4.64164524638813e-246	***
computer	0.468893808945581	1.99003488904591e-85	***
machines	0.646383116443375	8.8777043083812e-65	***
vehicles	-1.50564089232817	0	***
japan	-0.00389338071286475	0.884773905581404
us	-0.418937640052307	1.52290365106868e-73	***

Except for japan, this time, the other variables are highly statistically significant.

The log-likelihood and the AIC replace the \(R^2\) as a measure of the goodness-of-fit. But it is interpretable in relation to comparable models. We have not comparables; it will come.

The lr90 coefficient of 0.8651492 suggests that if R&D expenditure increases by 1 %, the average number of patents given a firm will increase by about 0.8651492 %. Note that R&D expenditure is expressed in logarithmic form. In other words, the elasticity of patents granted with respect to R&D expenditure is about 0.8651492 %.

We must apply exp(coefficients) to interpret the other coefficients (we rule the exponentiated lr90 since the model already applied a transformation and the exponentiated intercept).

	Coeff.	exp(Coeff.)
(Intercept)	-0.7458491	0.4743314
lr90	0.8651492	2.3753604
aerosp	-0.7965379	0.4508873
chemist	0.7747515	2.1700527
computer	0.4688938	1.5982253
machines	0.6463831	1.9086251
vehicles	-1.5056409	0.2218751
japan	-0.0038934	0.9961142
us	-0.4189376	0.6577452

⊕Coefficient above 0, the baseline, and exponentiated coefficients above 1, the exp(baseline), increase p90, and vice-versa. The further way from 1 the exponentiated coefficient is, the larger the effect on p90. The percentage effect on p90: ((exp(Coeff.) − 1)*100)

What is the interpretation of the machines dummy coefficient of 0.6463831 and exponentiated coefficient 1.9086251? The dummy coefficient in a semi-log model: the average number of patents in the machines industry is higher by 90.8625055% compared to the comparison category (Food, fuel, metal and others, the reference category = 0%).

Limitation of the PRM

The Poisson regression results should not be accepted at face value. They are valid only if the assumption (mean equals variance or ‘equidispersion’) of the Poisson distribution underlying the estimated model is correct.

We test for overdispersion:

\(H_0:\alpha\) = 0 or no overdispersion.

## 
##  Overdispersion test
## 
## data:  patents_poisson
## z = 3.7991, p-value = 7.261e-05
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##    58.8568

\(H_0\) is rejected (the p-value confirms the test is at least 1 % significant).

The PRM for patents is not well specified in that there appears to be a substantial amount of overdispersion.

If there is overdispersion, the variance of the count data, p90, is wider than we would expect given the covariates. The PRM estimates are therefore inefficient; with standard errors that are downward biased.

One possible remedy is to consider a more flexible distribution that does not impose ‘equality of mean and variance’.

The alternative is the negative binomial regression model.

The negative binomial regression model (NBRM)

NBRM directly corrects for overdispersion (allowing the conditional variance to differ from the conditional mean).

glm.nb(formula = p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us, data = r_d, init.theta = 0.777306951, link = log)

⊕Significance codes <0.001 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’

	Coeff.	p-value	Codes
(Intercept)	-0.407240397637615	0.466139472932556
lr90	0.86717383169618	3.70184111775591e-27	***
aerosp	-0.874435777051719	0.0165178854589183	*
chemist	0.666190913728046	0.0121947601459778	*
computer	-0.132057094028432	0.636452862978295
machines	0.00817078023002632	0.976931061809011
vehicles	-1.51508294557734	3.62116125884335e-05	***
japan	0.121004020767274	0.766340589365073
us	-0.691413404639235	0.0155332270177428	*

If we compare the results of the NBRM given with those of the PRM, we see the differences in the estimated standard errors and p-values.

	Coeff.	exp(Coeff.)
(Intercept)	-0.4072404	0.6654842
lr90	0.8671738	2.3801746
aerosp	-0.8744358	0.4170973
chemist	0.6661909	1.9468076
computer	-0.1320571	0.8762910
machines	0.0081708	1.0082043
vehicles	-1.5150829	0.2197900
japan	0.1210040	1.1286295
us	-0.6914134	0.5008676

A closer look will show that the original PRM standard errors are underestimated in several cases; thus, the coefficients are unreliable.

Let us show the coefficients side-by-side.

The coefficients and exponentiated coefficients are slightly different; the NB coefficients are more reliable.

The log(likelihood).

Poisson	NB
-5081.331	-835.4504

The log(likelihood) is clearly improved with the NBRM (getting closer to zero).

The AIC.

Poisson	NB
10180.66	1690.901

The same goes for the AIC (the lower, the better).

Forecast

Here are the statistics for the variable of interest (lr90).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.150   4.350   5.180   5.354   6.280   8.700

We create hypothetical values for our predictor of interest.

lr90 <- seq(3, 9, by = 0.10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     4.5     6.0     6.0     7.5     9.0

We create a matrix of hypothetical data values – setting the indicators variable to zero, and continuous variables to their means, and approval to its range of hypothetical values.

p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us

inputs <- cbind(lr90, 0, 0, 0, 0, 0, 0,0)

We name the columns of the matrix after the variables in our model.

colnames(inputs) <-   c("lr90",
                        "aerosp",
                        "chemist",
                        "computer",
                        "machines",
                        "vehicles",
                        "japan",
                        "us")

We convert the matrix of hypothetical values into a data frame.

inputs <- as.data.frame(inputs)

Once we have the data frame of predictors in place, we can use is the predict command to forecast the expected counts for the Poisson and Negative Binomial models. We then build the objects for plotting the results.

forecast_poisson <- predict(patents_poisson,
                            newdata = inputs,
                            type = "response")
forecast_poisson <- data.frame(model = 'Poisson',
                               lr90 = inputs$lr90,
                               p90 = forecast_poisson)

forecast_nb <- predict(patents_nb,
                       newdata = inputs,
                       type = "response")
forecast_nb <- data.frame(model = 'NB',
                          lr90 = inputs$lr90,
                          p90 = forecast_nb)

forecast_both <- rbind(forecast_poisson, forecast_nb)


actual <- data.frame(model = 'actual',
                     lr90 = r_d$lr90,
                     p90 = r_d$p90)
# 2 sets of fake values
# for generating 3 levels (actual, Poisson, NB)
actual_p <- data.frame(model = 'Poisson',
                       lr90 = 6.16,
                       p90 = 80)
actual_nb <- data.frame(model = 'NB',
                       lr90 = 6.16,
                       p90 = 80)

forecast_actual <- rbind(actual_p, actual_nb, actual)

The two forecasting lines differ.

Given the results, we should rely on the NB forecast. The Poisson forecast is by comparison.

Conclusion

An advantage of NBRM models is that it allows for overdispersion. It also provides a direct estimation of the extent of overestimation of the variance.

Even though the NBRM corrects flaw from the PRM, the former is not perfect. There are other models than PRM and NBRM:

PRM with robust standard errors
Zero-Inflated Poisson Model (ZIPM)
Negative Binomial Hurdle Model (NBHM)

…and Counting

(Tufte Handout)

Modeling count data

Patents and R&D expenditures

The OLS model

The Poisson regression model (PRM)

Limitation of the PRM

The negative binomial regression model (NBRM)

Forecast

Conclusion