Foreword
In many cases, the dependent variable is of the count type:
Sometimes, count data also include rare occurrences, such as:
All these cases are discrete and non-negative integers (count values) measured over a certain finite time period.
The Poisson and negative binomial distributions take into account this unique feature of count data.
What is the relationship between the number of patents and the expenditures in R&D by manufacturing firms?
Industrial organizations and government bodies are interested in predicting if the money they invest will yield the benefits they expect.
The dataset contains the number of patents received by a sample of 181 international manufacturing firms and the amount of their R&D expenditure.
p90 and p91 are the number patents for 1990 and 1991. lr90 and lr91 are the log(R&D expenditures). The logarithm smooths out disparities among the industries. Dummy variables representing five major industries: aerospace, chemistry, computers, machines and instruments, and motor vehicles. Food, fuel, metal and others are the reference category. Dummy variables for Japan and the US. The EU is the comparison group.
## 'data.frame': 181 obs. of 11 variables:
## $ p91 : int 55 67 55 83 0 4 7 0 0 96 ...
## $ p90 : int 80 46 42 102 1 11 55 0 1 67 ...
## $ lr91 : num 6.29 5.15 4.17 6.13 4.87 ...
## $ lr90 : num 6.16 5.14 4.1 6.15 4.82 ...
## $ aerosp : int 0 0 0 0 0 0 1 0 0 0 ...
## $ chemist : int 0 0 1 1 0 0 0 0 0 1 ...
## $ computer: int 0 0 0 0 0 0 0 0 1 0 ...
## $ machines: int 0 0 0 0 0 0 0 0 0 0 ...
## $ vehicles: int 0 0 0 0 0 0 0 0 0 0 ...
## $ japan : int 0 0 0 0 0 0 0 0 0 0 ...
## $ us : int 1 1 1 0 0 0 1 1 1 1 ...
Our objective is to determine the influence of R&D, industry category and the two countries on the mean or the average number of patents received by the 181 firms.
lm(formula = p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us, data = r_d)
Significance codes <0.001 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’
| Coeff. | p-value | Codes | |
|---|---|---|---|
| (Intercept) | -250.838563686667 | 1.1232386254688e-05 | *** |
| lr90 | 73.1720158715173 | 1.33977125731228e-16 | *** |
| aerosp | -44.1619950591535 | 0.217061611807876 | |
| chemist | 47.0812267408583 | 0.0778571842288539 | . |
| computer | 33.8564485417461 | 0.224436503833659 | |
| machines | 34.3794129244747 | 0.218114559171204 | |
| vehicles | -191.790295137144 | 4.98459605006646e-07 | *** |
| japan | 26.2385298089429 | 0.522234970999198 | |
| us | -76.8538696344533 | 0.00801776140385856 | ** |
As expected, there is a positive relationship between the number of patents received (p90)and R&D expenditure (lr90), which is highly statistically significant.
On the other hand, the \(R^2\), 0.4729115 and 0.4483957, are rather weak; many coefficients are not significant.
Since lr90 is in the logarithmic form and the patent variable is in the linear form, the R&D coefficient of 73.1720159 suggests that if we increase R&D expenditure by 1 %, the average number of patents received will increase by about 73.1720159 %, ceteris paribus. However, we should not rely on that result.
Despite these results, the OLS regression is not appropriate because the number of patents granted per firm per year is usually small.
This histogram shows the highly skewed distribution of the patent data, which can be confirmed by the coefficients of skewness and kurtosis.
A normally distributed variable should have:
p90 has:
## [1] 3.264074
## [1] 14.44027
We run a Jarque-Bera (normality) test where:
\(H_0\): observations are normally distributed.
##
## Jarque-Bera Normality Test
##
## data: r_d$p90
## JB = 1308.5, p-value < 2.2e-16
## alternative hypothesis: greater
The test rejects the \(H_0\); the p90 data distribution is nonnormal.
We recall that in large samples the J-B statistic follows the chi-square distribution with 2 degrees of freedom.
It is clear from the histogram that a preponderance of firms received fewer than 200 patents; actually much fewer than this number. In other words, the data distribution follows a Poisson-like distribution (the Poisson distribution is skewed (when \(\lambda\), the expected number of occurrence, is small) and the mean equals the variance.
The Poisson regression model is nonlinear, necessitating the method of maximum likelihood (ML).
glm(formula = p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us, family = poisson(link = log), data = r_d)
Significance codes <0.001 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’
| Coeff. | p-value | Codes | |
|---|---|---|---|
| (Intercept) | -0.745849112597202 | 3.40797084804473e-33 | *** |
| lr90 | 0.865149167172618 | 0 | *** |
| aerosp | -0.796537924973795 | 9.84591643113336e-32 | *** |
| chemist | 0.77475146614091 | 4.64164524638813e-246 | *** |
| computer | 0.468893808945581 | 1.99003488904591e-85 | *** |
| machines | 0.646383116443375 | 8.8777043083812e-65 | *** |
| vehicles | -1.50564089232817 | 0 | *** |
| japan | -0.00389338071286475 | 0.884773905581404 | |
| us | -0.418937640052307 | 1.52290365106868e-73 | *** |
Except for japan, this time, the other variables are highly statistically significant.
The log-likelihood and the AIC replace the \(R^2\) as a measure of the goodness-of-fit. But it is interpretable in relation to comparable models. We have not comparables; it will come.
The lr90 coefficient of 0.8651492 suggests that if R&D expenditure increases by 1 %, the average number of patents given a firm will increase by about 0.8651492 %. Note that R&D expenditure is expressed in logarithmic form. In other words, the elasticity of patents granted with respect to R&D expenditure is about 0.8651492 %.
We must apply exp(coefficients) to interpret the other coefficients (we rule the exponentiated lr90 since the model already applied a transformation and the exponentiated intercept).
| Coeff. | exp(Coeff.) | |
|---|---|---|
| (Intercept) | -0.7458491 | 0.4743314 |
| lr90 | 0.8651492 | 2.3753604 |
| aerosp | -0.7965379 | 0.4508873 |
| chemist | 0.7747515 | 2.1700527 |
| computer | 0.4688938 | 1.5982253 |
| machines | 0.6463831 | 1.9086251 |
| vehicles | -1.5056409 | 0.2218751 |
| japan | -0.0038934 | 0.9961142 |
| us | -0.4189376 | 0.6577452 |
Coefficient above 0, the baseline, and exponentiated coefficients above 1, the exp(baseline), increase p90, and vice-versa. The further way from 1 the exponentiated coefficient is, the larger the effect on p90. The percentage effect on p90: ((exp(Coeff.) − 1)*100)
What is the interpretation of the machines dummy coefficient of 0.6463831 and exponentiated coefficient 1.9086251? The dummy coefficient in a semi-log model: the average number of patents in the machines industry is higher by 90.8625055% compared to the comparison category (Food, fuel, metal and others, the reference category = 0%).
The Poisson regression results should not be accepted at face value. They are valid only if the assumption (mean equals variance or ‘equidispersion’) of the Poisson distribution underlying the estimated model is correct.
We test for overdispersion:
\(H_0:\alpha\) = 0 or no overdispersion.
##
## Overdispersion test
##
## data: patents_poisson
## z = 3.7991, p-value = 7.261e-05
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 58.8568
\(H_0\) is rejected (the p-value confirms the test is at least 1 % significant).
The PRM for patents is not well specified in that there appears to be a substantial amount of overdispersion.
If there is overdispersion, the variance of the count data, p90, is wider than we would expect given the covariates. The PRM estimates are therefore inefficient; with standard errors that are downward biased.
One possible remedy is to consider a more flexible distribution that does not impose ‘equality of mean and variance’.
The alternative is the negative binomial regression model.
NBRM directly corrects for overdispersion (allowing the conditional variance to differ from the conditional mean).
glm.nb(formula = p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us, data = r_d, init.theta = 0.777306951, link = log)
Significance codes <0.001 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’
| Coeff. | p-value | Codes | |
|---|---|---|---|
| (Intercept) | -0.407240397637615 | 0.466139472932556 | |
| lr90 | 0.86717383169618 | 3.70184111775591e-27 | *** |
| aerosp | -0.874435777051719 | 0.0165178854589183 | * |
| chemist | 0.666190913728046 | 0.0121947601459778 | * |
| computer | -0.132057094028432 | 0.636452862978295 | |
| machines | 0.00817078023002632 | 0.976931061809011 | |
| vehicles | -1.51508294557734 | 3.62116125884335e-05 | *** |
| japan | 0.121004020767274 | 0.766340589365073 | |
| us | -0.691413404639235 | 0.0155332270177428 | * |
If we compare the results of the NBRM given with those of the PRM, we see the differences in the estimated standard errors and p-values.
| Coeff. | exp(Coeff.) | |
|---|---|---|
| (Intercept) | -0.4072404 | 0.6654842 |
| lr90 | 0.8671738 | 2.3801746 |
| aerosp | -0.8744358 | 0.4170973 |
| chemist | 0.6661909 | 1.9468076 |
| computer | -0.1320571 | 0.8762910 |
| machines | 0.0081708 | 1.0082043 |
| vehicles | -1.5150829 | 0.2197900 |
| japan | 0.1210040 | 1.1286295 |
| us | -0.6914134 | 0.5008676 |
A closer look will show that the original PRM standard errors are underestimated in several cases; thus, the coefficients are unreliable.
Let us show the coefficients side-by-side.
The coefficients and exponentiated coefficients are slightly different; the NB coefficients are more reliable.
The log(likelihood).
| Poisson | NB |
|---|---|
| -5081.331 | -835.4504 |
The log(likelihood) is clearly improved with the NBRM (getting closer to zero).
The AIC.
| Poisson | NB |
|---|---|
| 10180.66 | 1690.901 |
The same goes for the AIC (the lower, the better).
Here are the statistics for the variable of interest (lr90).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.150 4.350 5.180 5.354 6.280 8.700
We create hypothetical values for our predictor of interest.
lr90 <- seq(3, 9, by = 0.10)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 4.5 6.0 6.0 7.5 9.0
We create a matrix of hypothetical data values – setting the indicators variable to zero, and continuous variables to their means, and approval to its range of hypothetical values.
p90 ~ lr90 + aerosp + chemist + computer + machines + vehicles + japan + us
inputs <- cbind(lr90, 0, 0, 0, 0, 0, 0,0)We name the columns of the matrix after the variables in our model.
colnames(inputs) <- c("lr90",
"aerosp",
"chemist",
"computer",
"machines",
"vehicles",
"japan",
"us")We convert the matrix of hypothetical values into a data frame.
inputs <- as.data.frame(inputs)Once we have the data frame of predictors in place, we can use is the predict command to forecast the expected counts for the Poisson and Negative Binomial models. We then build the objects for plotting the results.
forecast_poisson <- predict(patents_poisson,
newdata = inputs,
type = "response")
forecast_poisson <- data.frame(model = 'Poisson',
lr90 = inputs$lr90,
p90 = forecast_poisson)
forecast_nb <- predict(patents_nb,
newdata = inputs,
type = "response")
forecast_nb <- data.frame(model = 'NB',
lr90 = inputs$lr90,
p90 = forecast_nb)
forecast_both <- rbind(forecast_poisson, forecast_nb)
actual <- data.frame(model = 'actual',
lr90 = r_d$lr90,
p90 = r_d$p90)
# 2 sets of fake values
# for generating 3 levels (actual, Poisson, NB)
actual_p <- data.frame(model = 'Poisson',
lr90 = 6.16,
p90 = 80)
actual_nb <- data.frame(model = 'NB',
lr90 = 6.16,
p90 = 80)
forecast_actual <- rbind(actual_p, actual_nb, actual)The two forecasting lines differ.
Given the results, we should rely on the NB forecast. The Poisson forecast is by comparison.
An advantage of NBRM models is that it allows for overdispersion. It also provides a direct estimation of the extent of overestimation of the variance.
Even though the NBRM corrects flaw from the PRM, the former is not perfect. There are other models than PRM and NBRM: