Foreword

Output options: ‘pygments’ syntax, the ‘readable’ theme.
Snippets and results.
Source: ‘Econometrics by Example, Palgrave Macmilan, 2011’.

A word on censored and truncated dataset

Let us say we have a study spanning on a given time window (defined by a beginning time period and an ending time period). Censoring occurred for two reasons:

The event does not occur during the time window.
Some individuals or firms (observations) leave the study (another event forcing them to leave). For example, the event is ‘promotion’ or ‘recidivism’ and the individual dies during the study.
Left censoring: an individual has already experienced an event by the time the study begins, but we don’t know when it occurred. It may pose a problem in some model.
Interval censoring: an event happened between two time periods (the time window), but we don’t know exactly when.
Right censoring: an individual has not experienced an event by the time the study ends and we don’t know when or if it will occur.

An individual that is censored at a particular point in time provides no information about that person’s hazard at that time. There is no way to test the assumption that censoring in non-informative. There are no available methods to relax that assumption. We can just ignore the problem and hope for the best. Non-parametric, parametric and semi-parametric methods handle left and interval censoring.

Outside survival analysis, censoring may pose problems to OLS models and logit/probit models (whether they are binary, multinomial or ordinal).

This is the situation of Limited Dependent Variable. We can remedy the situation with a Tobit model. By the way, the AER package for running Tobit regressions interfaces with the survreg function of the survival package.

In a truncated sample, we do not even pick up observations that lie outside a certain range. We both miss the dependent and the explanatory variables.

An illustration

We have a sample of 1000 observations. These observations can be consumers. Some of these consumers (1/3) have a habit, buy a product or subscribe to a service. The rest of the group doesn’t.

We want to estimate factors that explain the habit. We can think of a logit model where habit, the regressand, is 1 or 0. We could link a habit to factors.

However, we can only base our analysis on 1/3 of the sample! We have a censored sample. We might have regressors for 1000 observations, but only 333 data point for the regressand.

How could we include the other 2/3 to devise a demand function that would predict the habit based on several factors?

A regressand can be left-censored (we miss values below a threshold) or right-censored (above a threshold).

For example, when we consumers that recently moved into a market, we only have historical consumption levels for only 1/3 of the sample before a given date (left-censored). When we only have numbers about wealthy consumers purchasing a luxury good, we only have 1/3 of the potential consumers (right-censored).

When we do not have the whole dataset for 2/3 of the consumer, we have a truncated dataset (left or right).

Working mothers

We have a sample of 753 working women. 428 work outside the home, 325 don’t. We also have socioeconomic factors affecting the work decision.

## Classes 'tbl_df', 'tbl' and 'data.frame':    753 obs. of  28 variables:
##  $ taxableinc  : num  12200 18000 24000 16400 10000 ...
##  $ federaltax  : num  1494 2615 3957 2279 1063 ...
##  $ hsiblings   : num  1 8 4 6 3 8 0 2 6 5 ...
##  $ hfathereduc : num  14 7 7 7 7 7 7 7 7 7 ...
##  $ hmothereduc : num  16 3 10 12 7 7 10 7 7 7 ...
##  $ siblings    : num  4 0 2 5 7 4 8 7 7 0 ...
##  $ lfp         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ hours       : num  1610 1656 1980 456 1568 ...
##  $ kidsl6      : num  1 0 1 0 1 0 0 0 0 0 ...
##  $ kids618     : num  0 2 3 3 2 0 2 0 2 2 ...
##  $ age         : num  32 30 35 34 31 54 37 54 48 39 ...
##  $ educ        : num  12 12 12 12 14 12 16 12 12 12 ...
##  $ wage        : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ wage76      : num  2.65 2.65 4.04 3.25 3.6 ...
##  $ hhours      : num  2708 2310 3072 1920 2000 ...
##  $ hage        : num  34 30 40 53 32 57 37 53 52 43 ...
##  $ heduc       : num  12 9 12 10 12 11 12 8 4 12 ...
##  $ hwage       : num  4.03 8.44 3.58 3.54 10 ...
##  $ faminc      : num  16310 21800 21040 7300 27300 ...
##  $ mtr         : num  0.721 0.661 0.692 0.781 0.622 ...
##  $ mothereduc  : num  12 7 12 7 12 14 14 3 7 7 ...
##  $ fathereduc  : num  7 7 7 7 14 7 7 3 7 7 ...
##  $ unemployment: num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ largecity   : num  0 1 0 0 1 1 0 0 0 0 ...
##  $ exper       : num  14 5 15 6 7 33 11 35 24 21 ...
##  $ expersq     : num  196 25 225 36 49 ...
##  $ famincsq    : num  2.66e+08 4.75e+08 4.43e+08 5.33e+07 7.45e+08 ...
##  $ faminceduc  : num  195720 261600 252480 87600 382200 ...

hours is the women’s working hours. When the value is zero, the woman does not work outside the home.

We plot the entire dataset.

Alternatively.

The are socioeconomic factors affecting the decision to work. The regressors we consider are age, education, experience, squared experience, family income, the number of kids under age 6, and husband’s wage.

The OLS regression model

We run an OLS on regressand hours with the entire dataset.

work_ols <- lm(hours ~ age + educ + exper + expersq + faminc + kidsl6 + hwage, data = work)

summary(work_ols)

## 
## Call:
## lm(formula = hours ~ age + educ + exper + expersq + faminc + 
##     kidsl6 + hwage, data = work)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1619.1  -491.5   -80.5   484.0  3636.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.298e+03  2.319e+02   5.597 3.06e-08 ***
## age         -2.955e+01  3.864e+00  -7.648 6.32e-14 ***
## educ         5.064e+00  1.256e+01   0.403   0.6868    
## exper        6.852e+01  9.399e+00   7.290 7.91e-13 ***
## expersq     -7.792e-01  3.085e-01  -2.525   0.0118 *  
## faminc       2.899e-02  3.201e-03   9.057  < 2e-16 ***
## kidsl6      -3.956e+02  5.564e+01  -7.110 2.73e-12 ***
## hwage       -7.051e+01  9.025e+00  -7.814 1.89e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 712 on 745 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.3323 
## F-statistic: 54.47 on 7 and 745 DF,  p-value: < 2.2e-16

The coefficients are marginal effects of the socioeconomic regressors on the mean value of the regressand. The \(R^2\) are low. Most statistics are significant except for educ.

The truncated OLS regression model

We remove the non-working women.

nrow(work)
work_w <- subset(work, work$hours != 0)
nrow(work_w)

## [1] 753
## [1] 428

From a complete dataset of 753 observations, the truncated sample counts 428 observations.

We plot the truncated dataset.

We can see that all observation where the y-axis=0 are deleted.

Alternatively.

We run an OLS with the truncated dataset.

work_w_ols <- lm(hours ~ age + educ + exper + expersq + faminc + kidsl6 + hwage, data = work_w)

summary(work_w_ols)

## 
## Call:
## lm(formula = hours ~ age + educ + exper + expersq + faminc + 
##     kidsl6 + hwage, data = work_w)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1646.7  -517.6    59.9   462.2  3439.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.817e+03  2.964e+02   6.130 2.02e-09 ***
## age         -1.646e+01  5.365e+00  -3.067 0.002301 ** 
## educ        -3.836e+01  1.607e+01  -2.388 0.017398 *  
## exper        4.949e+01  1.373e+01   3.603 0.000352 ***
## expersq     -5.510e-01  4.169e-01  -1.322 0.187010    
## faminc       2.739e-02  3.995e-03   6.855 2.55e-11 ***
## kidsl6      -2.438e+02  9.216e+01  -2.646 0.008455 ** 
## hwage       -6.651e+01  1.284e+01  -5.179 3.47e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 691.8 on 420 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2058 
## F-statistic: 16.81 on 7 and 420 DF,  p-value: < 2.2e-16

The \(R^2\) are not improved. Now, educ is significant, but has a negative sign; this sign is counter-intuitive. More educated women should have a higher incentive to go to work.

In both cases, the OLS is biased as well as inconsistent that is, no matter how large the sample size is.

The conditional mean of residuals is nonzero and the error is correlated with the regressors. As we know, if residuals and the regressors are correlated, the OLS coefficients are biased as well as inconsistent.

The censored Tobit

The Tobit model integrates a latent variable (\(y_i^0\)) with the observed variable hours worked (\(y_i\)), the regressand.

\[y_i^0 = \beta x_i + \varepsilon\]

The latent variable hours can be positive or zero.

\[y_i = \left \{ \begin{array}{ll} y_i^0 & (y_i^0 > 0),\\ 0 & (y_i^0 < 0). \end{array} \right.\]

In the probit model, the regressand equals 1 if the latent variable is greater than zero, and the regressand equals zero if the latent variable is zero.

In the Tobit model, the regressand hours worked may take any value as long as the latent variable hours is greater than zero. That is why ‘Tobit’ is a contraction of the author’s name, Tobin, and probit, another coined word.

The AER package comes with the tobit function that can interface with survival package survreg function.

We use the full dataset work.

library(AER)

work_tobit <- tobit(hours ~ age + educ + exper + expersq + faminc + kidsl6 + hwage, data = work)

summary(work_tobit)

# From summary or stand-alone
lrtest(work_tobit)

## 
## Call:
## tobit(formula = hours ~ age + educ + exper + expersq + faminc + 
##     kidsl6 + hwage, data = work)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##            753            325            428              0 
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.126e+03  3.796e+02   2.967 0.003004 ** 
## age         -5.411e+01  6.621e+00  -8.172 3.03e-16 ***
## educ         3.865e+01  2.068e+01   1.868 0.061711 .  
## exper        1.298e+02  1.623e+01   7.999 1.25e-15 ***
## expersq     -1.845e+00  5.097e-01  -3.619 0.000295 ***
## faminc       4.077e-02  5.258e-03   7.754 8.90e-15 ***
## kidsl6      -7.824e+02  1.038e+02  -7.541 4.67e-14 ***
## hwage       -1.055e+02  1.563e+01  -6.751 1.47e-11 ***
## Log(scale)   6.964e+00  3.693e-02 188.549  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 1058 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -3790 on 9 Df
## Wald-statistic: 325.3 on 7 Df, p-value: < 2.22e-16 
## 
## Likelihood ratio test
## 
## Model 1: hours ~ age + educ + exper + expersq + faminc + kidsl6 + hwage
## Model 2: hours ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   9 -3789.9                         
## 2   2 -3954.9 -7 330.07  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The function discovered the left-censorship of the dataset. We can set all these parameters (dataset, censorship, left or right, distribution). For example, we could truncate the dataset on the right and remove higher observations.

We replace the t-values with z-values. Only educ is weakly significant, but the positive sign makes sense this time.

Below the coefficient table, the summary shows the function takes 4 iterations using a normal distribution (Gaussian).

We estimate the Tobit model with the method of maximum likelihood (ML) on the entire dataset.

Both results (log(scale) or LogLik) show a p-value = 2e-16: a highly significant ML.

We substitute the F-statistic (and \(R^2\)) with the Wald-statistic. The result is part of the summary. It can be recomputed with a function.

# stand-alone
waldtest(work_tobit)

## Wald test
## 
## Model 1: hours ~ age + educ + exper + expersq + faminc + kidsl6 + hwage
## Model 2: hours ~ 1
##   Res.Df Df  Chisq Pr(>Chisq)    
## 1    744                         
## 2    751 -7 325.31  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Wald-statistic is significant as with all coefficients.

We can compute the \(R^2\) by squaring the coefficient of correlation between the actual hours and the estimated values by the Tobit model.

Interpretation of the results

## 
## Test of coefficients:
## 
##                Estimate  Std. Error  z value  Pr(>|z|)    
## (Intercept)  1.1263e+03  3.7959e+02   2.9673 0.0030045 ** 
## age         -5.4110e+01  6.6213e+00  -8.1721 3.031e-16 ***
## educ         3.8646e+01  2.0685e+01   1.8684 0.0617113 .  
## exper        1.2983e+02  1.6230e+01   7.9994 1.251e-15 ***
## expersq     -1.8448e+00  5.0968e-01  -3.6194 0.0002953 ***
## faminc       4.0769e-02  5.2578e-03   7.7540 8.904e-15 ***
## kidsl6      -7.8237e+02  1.0375e+02  -7.5409 4.668e-14 ***
## hwage       -1.0551e+02  1.5629e+01  -6.7508 1.470e-11 ***
## Log(scale)   6.9638e+00  3.6933e-02 188.5492 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For example, if the husband’s wages (hwage) go up, on average, a woman will work less in the labor market, ceteris paribus.

Each coefficient gives the marginal impact of that variable on the mean value of the latent variable hours worked. Each coefficient is a risk factor.

The main risk factors are exper, kidsl6, and hwage.

Women with more experience, educated, no kids under 6, and married to a husband with a low wage are more likely to go to work.

We cannot interpret the Tobit coefficient of each regressor as giving the marginal impact of that regressor on the mean value of the observed regressand hours worked.

Take for instance the impact of age: about -54. If age increases by 1 year, its direct impact on the hours worked per year will be a decrease by about 54 hours per year, ceriris paribus, and the probability of a woman entering the labor force will also decrease.

So we have to multiply -54 by the probability that this will happen. Unless we know the latter, we will not be able to compute the aggregate impact of an increase in age on the hours worked. And this probability calculation depends on all the regressors in the model and their coefficients.

Violations

The Tobit model assumes that the error term follows the normal distribution with zero mean and constant variance (homoscedasticity). In the case of heteroscedasticity, we can change the distribution to fit the data and avoid unequal residuals.

Tests

We can run additional tests. For illustration, we use a sandwich covariance estimate.

# Wald-type test
linear.hypothesis(work_tobit, c('exper = 0', 'kidsl6 = 0'), vcov = sandwich)

## Linear hypothesis test
## 
## Hypothesis:
## exper = 0
## kidsl6 = 0
## 
## Model 1: restricted model
## Model 2: hours ~ age + educ + exper + expersq + faminc + kidsl6 + hwage
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df  Chisq Pr(>Chisq)    
## 1    746                         
## 2    744  2 130.73  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regressors exper and kidsl6 are jointly highly significant. In other words, a woman with more experience is less likely (correlation=0) to have young children.

The truncated regression model

In a truncated sample, we do not have information on the regressand as well as on the regressors that may be associated with the regressand.

In our case, we would not have data on hours worked for 325 women. Therefore we may not consider information about socioeconomic variables for these observations.

About, we ran an OLS on a sub-sample of 428 women only. However, the OLS estimators are inconsistent in this situation. Since the sample is truncated, the assumption that the error term in this model is normally distributed cannot be maintained.

Therefore, we have to use a truncated normal distribution with a nonlinear method of estimation, such as the ML method.

We use truncated dataset work_w.

work_tr_tobit <- tobit(hours ~ age + educ + exper + expersq + faminc + kidsl6 + hwage, data = work_w)

summary(work_tr_tobit)

## 
## Call:
## tobit(formula = hours ~ age + educ + exper + expersq + faminc + 
##     kidsl6 + hwage, data = work_w)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##            428              0            428              0 
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.817e+03  2.937e+02   6.188 6.08e-10 ***
## age         -1.646e+01  5.315e+00  -3.096 0.001960 ** 
## educ        -3.836e+01  1.592e+01  -2.410 0.015940 *  
## exper        4.949e+01  1.361e+01   3.637 0.000275 ***
## expersq     -5.510e-01  4.130e-01  -1.334 0.182151    
## faminc       2.739e-02  3.957e-03   6.920 4.51e-12 ***
## kidsl6      -2.438e+02  9.129e+01  -2.671 0.007565 ** 
## hwage       -6.651e+01  1.272e+01  -5.228 1.72e-07 ***
## Log(scale)   6.530e+00  3.418e-02 191.047  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 685.3 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -3402 on 9 Df
## Wald-statistic: 119.9 on 7 Df, p-value: < 2.22e-16

We can see differences in the magnitude and statistical significance of the coefficients between the truncated model and the censored Tobit. The education coefficient educ is positive in the censored Tobit, but is negative in the truncated model.

Since the censored Tobit model uses more information (753 observations) than the truncated regression model (428 observations), estimates obtained from the censored Tobit model are expected to be more efficient.