Foreword

  • Output options: ‘pygments’ syntax, the ‘readable’ theme.
  • Some Snippets and all results.
  • Use of htmlTable
  • Sources:
    • ‘Econometrics by Example, Palgrave Macmillan, 2011’.
    • ‘Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Second Edition, Springer, 2015’.
    • ‘An R Companion to Applied Regression, Second Edition, Sage Publications, 2011’.
    • ‘Introducing Survival and Event History Analysis, Sage Publications, 2011’.


Survival Analysis: Goal and Applications

How to

One good approach is to take a peek at the Terminology, go through all the other sections, review the Recapitulation and the Terminology.

These terms will come back over and over:

  • Cumulative distribution function (CDF) (distribution of the dependent variable, the disturbances in cases of AFT parametrization).
  • Survivor function (survival over time).
  • Hazard function (odds of event over time).
  • Cumulative hazard function.

These functions are often plotted in the worked examples.

Note that the ‘CDF’ is related to ‘probability distribution’; ‘distribution’ is synonymous with ‘density function’.

A ‘covariate’ is an ‘explanatory variable’ or a ‘regressor’.

Survival analysis (SA) goes by various names

  • Survival analysis in biostatistics, medical science, and epidemiology.
  • Reliability analysis in engineering.
  • Duration models within economics.
  • Churn models within marketing.
  • Event history analysis within sociology, demography, psychology and political science.

We sometimes hear of hazard rate analysis (the conditional probability of event occurrence).

The primary goals of SA

  • Estimate and interpret survivor (hazard functions) from survival data.
  • Analyze hazard or the odds an event will occur.
  • Assess the impact of explanatory variables (regressors or covariates) on survival time.

Unlike other techniques, it also takes a different approach by not only focusing on the outcome but also analyzing the time to an event.

This enables us to compare the survival between two or more groups and to assess the relationship between explanatory variables and survival time.

Another unique feature is the ability to include time-varying covariates (that change their values over time such as age or education), which is not possible in OLS or logistic regression.

Applications

  • Survival analysis: the time until death from cancer.
  • Reliability or failure time analysis: the time until a pump breaks down.
  • Duration models: the length of time a person is unemployed.
  • Churn models: the length of time a person will subscribe.
    • Churn models are at the core of Customer Lifetime Value (LTV).
  • Event history analysis: a longitudinal record of events in a person’s life such a marriage.
    • Transition analysis: from marriage to divorce.

Approaches

There are three basic approaches to analyzing surviving data:

  1. The non-parametric (no assumption probability distribution) methods; the Kaplan-Meier Estimator.
  2. The parametric methods.
  3. The partially parametric or semi-parametric methods.

We are going to touch base with these three approaches with a dataset about recidivism: convicts who were released from prisons and who were followed up for one year after release. The analysis measures the length of time until rearrest (or not).



1, The Non-parametric Methods: the Kaplan-Meier (KM) Estimator

In non-parametric models, which include actuarial life table, the Altschuler–Nelson estimator, and the Kaplan-Meier estimator, there is no assumption about the shape of the hazard function or about how covariates may affect that shape. However, non-parametric methods cannot handle the inclusion of multiple covariates.

This is often an excellent descriptive technique to use at the beginning of an analysis.

Adapting the data

It is possible to create survival data from existing information:

  • The duration of U.N. peacekeeping missions.
  • The time between a terrorist attack and the fall of a government.
  • Business start and end dates, mergers, takeovers or other traceable events.
  • Policy changes or take-up over time.

The central point is that we need the timing of events and an outcome variable.

The survival package

There are several packages. We begin with the survival package.

library(survival)

The package contains several datasets.

str(leukemia) # alternativaly, str(aml)
## 'data.frame':    23 obs. of  3 variables:
##  $ time  : num  9 13 13 18 23 28 31 34 45 48 ...
##  $ status: num  1 1 0 1 1 0 1 1 0 1 ...
##  $ x     : Factor w/ 2 levels "Maintained","Nonmaintained": 1 1 1 1 1 1 1 1 1 1 ...

Where:

  • time, the duration of an episode (calculated from the start and end time or censoring).
  • status=1, the occurrence of an event (or right censoring since there are no data beyond the event).
  • x, the only covariate; x=Maintained or not a treatment against leukemia.

The data are right-censored. Even though a patient survived (status=0), we have no data beyond the data point. This is a snapshot: this analysis about this chemotherapy cycle.

time status x
1 9 1 Maintained
2 13 1 Maintained
3 13 0 Maintained
4 18 1 Maintained
5 23 1 Maintained
6 28 0 Maintained
7 31 1 Maintained
8 34 1 Maintained
9 45 0 Maintained
10 48 1 Maintained
11 161 0 Maintained
12 5 1 Nonmaintained
13 5 1 Nonmaintained
14 8 1 Nonmaintained
15 8 1 Nonmaintained
16 12 1 Nonmaintained
17 16 0 Nonmaintained
18 23 1 Nonmaintained
19 27 1 Nonmaintained
20 30 1 Nonmaintained
21 33 1 Nonmaintained
22 43 1 Nonmaintained
23 45 1 Nonmaintained

For example, patient 3 survived for 13 weeks from the chemotherapy cycle. We ignore what happened after. Patient 1 deceased 9 weeks after the beginning the treatment.

The question is whether the standard course of chemotherapy should be extended (maintained) for additional cycles.

Does the treatment of maintenance chemotherapy significantly prolong a patient’s survival time until death?

Objects

First, we create an object.

with(leukemia, Surv(time, status))
##  [1]   9   13   13+  18   23   28+  31   34   45+  48  161+   5    5    8 
## [15]   8   12   16+  23   27   30   33   43   45

Producing KM estimates

leukemia_fit <- survfit(Surv(time, status) ~ 1, 
                        conf.type = 'log', 
                        conf.int = 0.95, 
                        type = 'kaplan-meier', 
                        error = 'greenwood', 
                        se.fit = TRUE, 
                        data = leukemia)

Interpretation of KM estimates

summary(leukemia_fit$time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   13.75   27.50   32.44   33.75  161.00

We print the results statistics. For example, the median and mean survival time in weeks.

summary(leukemia_fit)
## Call: survfit(formula = Surv(time, status) ~ 1, data = leukemia, conf.type = "log", 
##     conf.int = 0.95, type = "kaplan-meier", error = "greenwood", 
##     se.fit = TRUE)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     23       2   0.9130  0.0588       0.8049        1.000
##     8     21       2   0.8261  0.0790       0.6848        0.996
##     9     19       1   0.7826  0.0860       0.6310        0.971
##    12     18       1   0.7391  0.0916       0.5798        0.942
##    13     17       1   0.6957  0.0959       0.5309        0.912
##    18     14       1   0.6460  0.1011       0.4753        0.878
##    23     13       2   0.5466  0.1073       0.3721        0.803
##    27     11       1   0.4969  0.1084       0.3240        0.762
##    30      9       1   0.4417  0.1095       0.2717        0.718
##    31      8       1   0.3865  0.1089       0.2225        0.671
##    33      7       1   0.3313  0.1064       0.1765        0.622
##    34      6       1   0.2761  0.1020       0.1338        0.569
##    43      5       1   0.2208  0.0954       0.0947        0.515
##    45      4       1   0.1656  0.0860       0.0598        0.458
##    48      2       1   0.0828  0.0727       0.0148        0.462
  • time, the ordered event or failure times (deceases).
  • n.risk, the risk set, is the number of patients who have survived to at least time t; in other words, the number of subjects in the risk set at the start of the interval.
  • n.event, the number of failures (decease) or events in the specified time interval.
  • survival, an estimated probability that a patient will survive the time; a percent of patients will survive a given time or more. For example, in row 1, 21/23 survived or 91.3%.
  • std.err, an estimate of the standard error of the KM estimate according to the error parameter.
  • lower, upper 95% CI, confidence intervals.
quantile(leukemia_fit, quantiles = c(.25, .5, .75))
## $quantile
## 25 50 75 
## 12 27 43 
## 
## $lower
## 25 50 75 
##  8 18 31 
## 
## $upper
## 25 50 75 
## 30 48 NA

We are interested in the median survival time, which is the 50th percentile. In this case, it is 27 weeks, with a 95% confidence interval of 18 to 45 (lower and upper limits).

The median is the preferred measure due to the fact that the mean is biased downward when there are right-censored times that are greater than the largest event time.

Plotting a univariate KM survival curve

We recall the median: 18-27-45, where the gray lines are crossing.

Other plots resulting from transformations.

Comparing two KM survival curves

We can only use one covariate in KM: x.

# x behing the sole covariate, a factor
leukemia_fit2 <- survfit(Surv(time, status) ~ x,
                         type = 'kaplan-meier',
                         data = leukemia)

leukemia_fit2
## Call: survfit(formula = Surv(time, status) ~ x, data = leukemia, type = "kaplan-meier")
## 
##                  n events median 0.95LCL 0.95UCL
## x=Maintained    11      7     31      18      NA
## x=Nonmaintained 12     11     23       8      NA

We can now observe that patients in the treatment group (x=Maintained chemotherapy) lived on average 31 weeks in comparison to the control group (x=Nonmaintained), who had a median survival time of 23 weeks.

Since the estimated survival function for the maintained group does not go to zero (which is the case for the non-maintained group), we can determine that the largest observation is a censored value.

Testing differences between two groups

Does the treatment of maintenance chemotherapy significantly prolong a patient’s survival time until death?

In other words, whether there is a statistically significant difference between the two survival curves; akin to the two-sample t-test or the one-way ANOVA. The are many possible tests:

  • Mantel-Haenszel test or log-rank test (the most common).
  • Wilcoxon test.
  • Gehan test.
  • Peto and Peto test or Peto-Peto-Prentice test.
  • Tarone-Ware test.
  • Fleming-Harrington test.

The log-rank test

We test the hypothesis that the survival function is identical for the two groups.

# rho=0 (p-value) is the default
survdiff(Surv(time, status) ~ x,
         data = leukemia,
         rho = 0)
## Call:
## survdiff(formula = Surv(time, status) ~ x, data = leukemia, rho = 0)
## 
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained    11        7    10.69      1.27       3.4
## x=Nonmaintained 12       11     7.31      1.86       3.4
## 
##  Chisq= 3.4  on 1 degrees of freedom, p= 0.0653

Since it is a two-tailed test, the p-value is twice that of the log-rank statistic. The one-sided test is appropriate and its p-value is 0.0653/2 = 0.033.

Considering the small sample size, there is actually some evidence that maintenance chemotherapy significantly prolongs life in the remission period. The weakness of this effect is probably due to the small sample.

Alternative or confirmation: stratifying the analysis by a covariate

In a stratified test, we produce the estimates and perform the log-rank tests separately and then combine them to produce an overall statistic.

Contrary to the log-rank test, we avoid (potential) misleading statements on the basis of an aggregated analysis by not acknowledging important confounding or interaction effects.

First, we estimate the model.

leukemia_fit3 <- survfit(Surv(time, status) ~ x,
                         conf.type = 'log',
                         conf.int = 0.95,
                         type = 'kaplan-meier',
                         data = leukemia)

leukemia_fit3
## Call: survfit(formula = Surv(time, status) ~ x, data = leukemia, conf.type = "log", 
##     conf.int = 0.95, type = "kaplan-meier")
## 
##                  n events median 0.95LCL 0.95UCL
## x=Maintained    11      7     31      18      NA
## x=Nonmaintained 12     11     23       8      NA

Then, we divide the estimates by x.

quantile(leukemia_fit3, c(.25, .5, .75))
## $quantile
##                 25 50 75
## x=Maintained    18 31 48
## x=Nonmaintained  8 23 33
## 
## $lower
##                 25 50 75
## x=Maintained    13 18 34
## x=Nonmaintained  5  8 27
## 
## $upper
##                 25 50 75
## x=Maintained    NA NA NA
## x=Nonmaintained 30 NA NA

There is a difference in the median (the same as in the previous model):

  • Maintained: 31 weeks.
  • Nonmaintained: 23 weeks.

Maintenance chemotherapy likely prolongs life in the remission period; confirming the above test.

Nonetheless, it is only possible to compare differences between a limited number of groups, by including only one covariate (or a few covariates in some cases we will not cover).

Let us turn to methods where we can include several covariates.



2, The Parametric Methods

The parametric approach is largely used for continuous time data.

There are several parametric models that are used in duration analysis. Each depends on the assumed probability distribution, such as the normal, binomial (logistic), Poisson, etc.

The most common forms of distribution to describe survival time are the exponential (a special case of Weibull), Weibull, log-normal, generalized gamma and log-logistic.

Since the CDF of each of these distributions is known, we can easily derive the corresponding hazard and survival functions. Parametric models not only include assumptions about the distribution of failure or event times, but also about the relationship between the covariates and survival. This is related to whether we opt for one of the two specifications:

  • Proportional hazards (PH).
  • Accelerated failure time (AFT) model.

Proportional Hazards (PH) versus Accelerated Failure Time (AFT) models

  • In the PH models, the covariates have a multiplicative effect on the hazard function.
  • In AFT models, the natural logarithm of the survival time is expressed as a linear function of the covariates (additive effect).

A central difference between these models is also the interpretation of the parameter estimates.

  • An advantage of the AFT approach is that the effect of the covariates is described in absolute terms (i.e. a number of months or years) instead of in relative terms (i.e. a hazard ratio).
  • A central assumption in AFT models is that all observations follow the same distribution of the hazard function, but that the time axis varies in a way that some groups pass through stages of the hazard curve faster than others. A common analogy used to understand AFT models is the difference between people-years versus dog-years in relation to survival time.

Although the time equivalent differs, we would still have the same hypothesis about the shape of the hazard function from birth to death. This means that, even though we are assuming the same hazard function, we would observe the rather sizeable difference, for instance, that 90% of humans would survive to the age of 63 compared to 90% of dogs that would survive to the age of 9.

A tour of the models and distributions

Summary of selected parametric survival distributions

  • Certain models, exponential and Weibull, can be estimated as both PH and AFT models.
  • Gompertz: only as a PH,
  • Log-normal, log-logistic, and generalized gamma models: only as an AFT.

We recall that PH-multiplicative, AFT-additive.

The exponential model (PH, AFT)

The exponential model is a special case of the Weibull model with a constant hazard function.

This model is useful for modeling data with a constant hazard, due to the fact that the baseline hazard function is assumed as constant or flat over time.

The risk of an event occurring, conditional on covariates, is the same across all points in time. Regardless of how long the duration is (e.g. treatment, marriage, unemployment), there will still be the same chance of the event occurring over that time period.

Consider the example of a constant-hazard rate of survival of a wine glass: the wine glass breaks or not. Since wine glasses do not wear out, if we assume that the probability of the event (i.e. breaking) is not dependent on how often it has been used, the wine glass has the same probability of breaking whether the glass is 0, 5, 10 or 15 years old.

Any deviation of the hazard function from the flat line is attributed to the covariate effects

There are several shortcomings of the exponential model:

  • It is a ‘memoryless’ distribution in the sense that the distribution of the survival time is not affected by knowing the history of survival up to a certain period of time.
  • Second, it has the very restrictive assumption that the distribution is fully determined by the single parameter. The exponential relationship holds when a rate of change is proportional to the size of the quantity that is changing. For example, if we wanted to calculate the probability of surviving an extra 10 years, it would be the same for 10-year-olds and for 90-year-olds.

There are few applied settings:

  • Mortality of the chronically ill or the oldest-old individuals.
  • Narrow time windows (very short time periods) or highly homogeneous groups.
  • Newton’s law of cooling.
  • Interest accumulated in a bank.
  • Loss of electrical charge.

Piecewise constant exponential model (PH, AFT)

It is similar to the exponential model, with the additional assumption that time is divided into different periods and that the hazard rate is then constant within any of these time periods. In other words, a piecewise exponential model gives us the flexibility of the Cox model (that will be addressed in Section 3) with the added advantage of being able to estimate the shape of the hazard function.

Each time period is given a ‘time dummy variable’ with the estimates for each dummy representing the hazard function for that particular time period.

A piecewise exponential model to examine various labour market transitions between unemployment to employment. Transitions from unemployment to employment re-entry, and employment include periods (1-6, 6-12 and 12+ months) with the assumption that transitions out of unemployment varied across these time periods.

The Weibull model (PH, AFT)

We know the exponential model is a special case of the Weibull model with constant hazard function. The hazard function can take other forms (increasing, decreasing, accelerating, decelerating)

PH-multiplicative

The baseline hazard is characterized as monotonic (i.e. constantly going up or down). With the Weibull model, the baseline hazard may be monotonically increasing or decreasing over time, or it may be constant as in the exponential model.

A monotonically increasing hazard rate might be, for example, patients who are not responding to a particular treatment. As survival time increases over time for each patient, coupled with a worsening prognosis, the probability of death increases.

A decreasing hazard could be the survival times of patients recovering from surgery, such as organ transplantation. The potential of dying immediately after surgery is high and decreases over time after the surgery.

AFT-additive

As with the exponential model, the Weibull model can also be specified as an AFT model. The regression parameters are therefore expressed as log duration times. When covariates are included in the model, they impact the scale parameter, with the shape of the distribution remaining constant.

Log-normal and Log-logistic models (AFT-additive)

Both the log-logistic and log-normal models are only implemented in the AFT form and are parametric models that allow for non-monotonic hazard functions.

It accommodates processes where there is an initial increase and then decreasing rates (a hump).

Other parametric models (AFT-additive)

A relatively prominent model is the generalized gamma model.

It is possible to estimate additional models including the Gaussian, Gompertz, logistic, and Gompertz-Makeham models.

Packages

In the following worked examples, we use the eha package for PH parametrization and the survival package for AFT parametrization.

Exponential model: PH parameterization

Load the data and take a look at a sample of the Rossi dataset1 about recidivism.

The data pertain to 432 convicts who were released from prisons and who were followed up for one year after release. Half the released convicts were assigned at random to an experimental treatment in which they were given financial aid; half did not receive aid.

The head of the dataset.

week arrest fin age race wexp mar paro prio educ
1 20 1 no 27 black no not married yes 3 3
2 17 1 no 18 black no not married yes 8 4
3 25 1 no 19 other yes not married yes 13 3
4 52 0 yes 23 black yes married yes 1 5
5 52 0 no 19 other yes not married yes 3 3
Note: The 'empl1 to emplo52' fields are not shown.

Using the Rossi dataset, we estimate the PH exponential model (PH-multiplicative). The exponential model is a special case of the Weibull model with a constant hazard function.

library(eha)

# shape = 1 for constant hazard function
exp_ph_model <-  phreg(Surv(week, arrest) ~ fin + age + prio,
                       dist = 'weibull',
                       shape = 1,
                       data = Rossi)

summary(exp_ph_model)
## Call:
## phreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi, 
##     dist = "weibull", shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## fin 
##               no    0.489     0         1           (reference)
##              yes    0.511    -0.335     0.715     0.190     0.078 
## age                24.765    -0.065     0.937     0.021     0.002 
## prio                2.841     0.091     1.095     0.027     0.001 
## 
## log(scale)                    3.761               0.500     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    114 
## Total time at risk         19809 
## Max. log. likelihood      -688.38 
## LR test statistic         0.06 
## Degrees of freedom        3 
## Overall p-value           0.995681

Turning to the estimates of financial assistance (fin=yes), we see that the estimate is -0.335, which is significant at a 0.078 level.

Since this is a PH model, a negative coefficient means that the risk of the event (rearrest) decreases, resulting in a longer duration.

If we take the exponent, \(e^{-0.335}\) = 0.715, the hazard of rearrest for those who received financial (fin=yes) assistance is 72% of the hazard for those who did not receive financial assistance (fin=no).

Basic plots (the exponential model is a special case of the Weibull model).

We can see the constant, constant hazard curve of the exponential model.

We compute the confidence intervals for exponentiated coefficient of rearrest for those who received financial (fin=yes).

Lower 95% CI Upper 95% CI
0.49 1.04

The exponentiated coefficient of financial aid (0.715) has 95% CI of (0.49, 1.04).

We can also use the parameter estimates to provide an estimate of the time to any value of the quartiles (log(scale) x 0.50 (median)).

0.25 0.50 0.75
42.63 21.32 8.85

The median rearrest time is 21 weeks.

The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1).

Exponential model: AFT parameterization

We estimate the AFT exponential model (AFT-additive).

library(survival)

exp_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
                         dist = 'exponential',
                         data = Rossi)

summary(exp_aft_model)
## 
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi, 
##     dist = "exponential")
##               Value Std. Error     z        p
## (Intercept)  3.7614     0.4998  7.53 5.24e-14
## finyes       0.3352     0.1901  1.76 7.78e-02
## age          0.0650     0.0207  3.14 1.71e-03
## prio        -0.0911     0.0270 -3.38 7.33e-04
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -688.4   Loglik(intercept only)= -702
##  Chisq= 27.2 on 3 degrees of freedom, p= 5.3e-06 
## Number of Newton-Raphson Iterations: 5 
## n= 432

The interpretations of the parameter estimates for PH and AFT models differ.

The PH assumption is useful for a comparison of hazards and the AFT assumption is useful for a comparison of survival times.

The acceleration factor is the key measure in AFT models and is used to evaluate the effect of the covariate(s) on survival time.

  • In parametric models, negative regression coefficients imply decreasing survival times and shorter expected durations.
  • In Cox models, however, these negative coefficients referred to decreased risk and longer duration times.

This means that the interpretation of the coefficients is essentially the opposite.

The coefficient for financial aid (finyes), which is 0.335. Since it has a positive sign, those who receive financial assistance have a longer expected time to arrest (good news).

We generally estimate the acceleration factor, which is simply \(e^{0.335}\) = 1.43. To interpret these results we would say that the provision of financial assistance is effective for delaying rearrest by stretching the survival time by a factor of 1.43.

finyes age prio
1.4 1.07 0.91

An acceleration factor > 1 means that financial assistance is beneficial to survival (i.e. staying out of prison), and vice versa. An acceleration factor or hazard ratio of 1 would mean no effect.

Express the AFT results in absolute terms of the number of weeks of additional survival time until rearrest if we receive financial aid: exp(3.761 + 0.335 * 1).

finyes
60.1

In other words, receiving financial assistance set the time until rearrest to 60 weeks (good news).

We can play with the model coefficients and compute different results.


Now, we interpret a continuous variable such as prio (-0.091) in relative terms: 100 * (exp(-0.091) - 1).

prio
-8.7

Holding the covariates of financial assistance and age constant, each additional prior conviction is associated with an 8.7% decrease in the expected time to rearrest (bad news).

The 95% confidence intervals for financial assistance.

Lower 95% CI Upper 95% CI
0.96 2.03

The hazard ratio for the effect of financial assistance was estimated as \(e^{-0.335} = 0.715\) in the previous PH form of the exponential model (exp_ph_model).

Although the PH and AFT exponential models have different underlying assumptions, they are essentially the same model with the only difference being their parameterization.

Estimate the median survival time.

medfin <- predict(exp_aft_model,
                  type = 'uquantile',
                  p = 0.5,
                  se.fit = TRUE)

medfin1 <- medfin$fit[2]
medfin1.se <- medfin$se.git[2]
expmedfin <- exp(medfin1)

round(expmedfin, 2)
## [1] 46.35

From the results, we can test the model overall significance.

summary(exp_aft_model)
## 
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi, 
##     dist = "exponential")
##               Value Std. Error     z        p
## (Intercept)  3.7614     0.4998  7.53 5.24e-14
## finyes       0.3352     0.1901  1.76 7.78e-02
## age          0.0650     0.0207  3.14 1.71e-03
## prio        -0.0911     0.0270 -3.38 7.33e-04
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -688.4   Loglik(intercept only)= -702
##  Chisq= 27.2 on 3 degrees of freedom, p= 5.3e-06 
## Number of Newton-Raphson Iterations: 5 
## n= 432

There are two values for the log-likelihood: with the three fixed covariates (model, intercept and covariates) and with the null model (intercept only, no covariates). According to the p-value, we can reject the null hypothesis. In other words, at least one of the above covariates significantly improves the fit of the model.

Piecewise Exponential model: PH and AFT parameterization

It is possible to reshape our data by chosen cut points to examine how the hazard rate differs across time periods.

Reshaping of data by chosen cut-points

The Rossi data can be divided over the 52-week period into four periods, split at 0, 13, 26 and 39 weeks; in four quarters of 13 weeks each. Recall that financial assistance changed at 13 weeks, which makes 13 weeks a useful cut-point.

Since it is an exponential model, we assume that the hazard is constant within each of these intervals.

Create a new data set where one record is created for each period where the subject was at risk.

Rossi$id <- 1:432

Rossi_piecewise <- survSplit(Rossi,
                             cut = c(0, 13, 26, 39),
                             end = 'week',
                             event = 'arrest',
                             start = 'start',
                             episode = 'i')

# order
Rossi_piecewise <- Rossi_piecewise[order(Rossi_piecewise$id, Rossi_piecewise$start),]

Rossi_piecewise$t0 <- ifelse(Rossi_piecewise$start == 0, 1, 0)
Rossi_piecewise$t13 <- ifelse(Rossi_piecewise$start == 13, 1, 0)
Rossi_piecewise$t26 <- ifelse(Rossi_piecewise$start == 26, 1, 0)
Rossi_piecewise$t39 <- ifelse(Rossi_piecewise$start == 39, 1, 0)

Rossi_piecewise[1:2, c(1:2, 6:8, 61:69)]
##   fin age paro prio educ id start week arrest i t0 t13 t26 t39
## 1  no  27  yes    3    3  1     0   13      0 2  1   0   0   0
## 2  no  27  yes    3    3  1    13   20      1 3  0   1   0   0

The data is right-censored in the ‘last quarter’. Careful to read one or several rows per individual. Above, the first subject (id = 1) is arrested (arrest = 1) in the 2nd quarter (t13 = 1), on the second row

Rossi_piecewise[3:10, c(1:2, 6:8, 61:69)]
##    fin age paro prio educ id start week arrest i t0 t13 t26 t39
## 3   no  18  yes    8    4  2     0   13      0 2  1   0   0   0
## 4   no  18  yes    8    4  2    13   17      1 3  0   1   0   0
## 5   no  19  yes   13    3  3     0   13      0 2  1   0   0   0
## 6   no  19  yes   13    3  3    13   25      1 3  0   1   0   0
## 7  yes  23  yes    1    5  4     0   13      0 2  1   0   0   0
## 8  yes  23  yes    1    5  4    13   26      0 3  0   1   0   0
## 9  yes  23  yes    1    5  4    26   39      0 4  0   0   1   0
## 10 yes  23  yes    1    5  4    39   52      0 5  0   0   0   1

Above, individual 2 and 3 are also arrested in the 2nd quarter (there are 2 rows for each).

Since Individual 4 (id = 4) was not arrested (arrest = 0 is all 4 rows), the subject has four splits, at 0, 13, 26 and 39 weeks.

We use only fixed covariates; they are repeated for each line. If we had integrated time-varying covariates, they would change between intervals.

Estimation of the model

The survival or eha packages are not able to estimate the piecewise constant exponential model.

It is possible, by using the previously reshaped data, to estimate the variation in the hazard across time intervals using the PH or AFT specification, which in substantive terms provides equivalent results.

Each of the period variables representing the quarters is then estimated. For reasons of collinearity, the last quarter (t39) is not included in the model.

The PH-multiplicative model.

library(eha)

# shape = 1 for constant hazard function
pe_ph_model <- phreg(Surv(week, arrest) ~ t0 + t13 + t26 + fin + age + prio,
                     dist = 'weibull',
                     shape = 1,
                     data = Rossi_piecewise)

summary(pe_ph_model)
## Call:
## phreg(formula = Surv(week, arrest) ~ t0 + t13 + t26 + fin + age + 
##     prio, data = Rossi_piecewise, dist = "weibull", shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## t0                  0.113     0.619     1.857     0.284     0.029 
## t13                 0.216     0.525     1.690     0.245     0.032 
## t26                 0.300    -0.011     0.989     0.260     0.967 
## fin 
##               no    0.488     0         1           (reference)
##              yes    0.512    -0.335     0.716     0.190     0.078 
## age                24.772    -0.064     0.938     0.021     0.002 
## prio                2.811     0.094     1.099     0.027     0.001 
## 
## log(scale)                    4.913               0.522     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    114 
## Total time at risk         48682 
## Max. log. likelihood      -785.7 
## LR test statistic         6155.94 
## Degrees of freedom        6 
## Overall p-value           0

From our previous results, we stated that the hazard of rearrest for those who received financial (fin=1) assistance is 0.72 of the hazard for those who did not receive financial assistance (fin=0).

The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1).

During t0, the hazard is 1.86 of the hazard relative to t39 (the baseline where t39 = 1). During t13, the hazard is 1.69 relative to t39. The hazard decreases and moves back up to 1 during t26.

The result plots.

It is also possible to estimate an AFT-additive model using the survreg function.

library(survival)

pe_ph_model2 <- survreg(Surv(week, arrest) ~ t0 + t13 + t26 + fin + age + prio,
                        dist = 'exponential',
                        data = Rossi_piecewise)

summary(pe_ph_model2)
## 
## Call:
## survreg(formula = Surv(week, arrest) ~ t0 + t13 + t26 + fin + 
##     age + prio, data = Rossi_piecewise, dist = "exponential")
##               Value Std. Error      z        p
## (Intercept)  4.9132     0.5218  9.416 4.69e-21
## t0          -0.6188     0.2840 -2.179 2.94e-02
## t13         -0.5248     0.2446 -2.146 3.19e-02
## t26          0.0107     0.2595  0.041 9.67e-01
## finyes       0.3347     0.1902  1.760 7.85e-02
## age          0.0640     0.0207  3.094 1.98e-03
## prio        -0.0944     0.0273 -3.452 5.56e-04
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -785.7   Loglik(intercept only)= -804.5
##  Chisq= 37.56 on 6 degrees of freedom, p= 1.4e-06 
## Number of Newton-Raphson Iterations: 7 
## n= 1573

We recall that we do not interpret the coefficient the same way depending on the model type:

PH AFT
Positive coefficient hazard risk UP log(duration) time UP
shorter duration longer duration
Negative coefficient hazard risk DOWN log(duration) time DOWN
longer duration shorter duration

Since this is a PH model, the estimates over the four periods demonstrate that the hazard is not constant over time. The coefficients (of the covariate) of the three periods t0, t13 and t26 are all contrasted with the last period (would-be t39).

We see that the hazard of arrest in the first 13 weeks (t0) is significantly lower than the hazard in the last quarter (the baseline t39).

Visually, it is easier to interpret (the barplot shows the opposite as the barplot above).

  • The more negative the coefficient is, the longer the duration.
  • t39, the baseline is at y = 0.
  • t0 is a negative coefficient, hazard risk is less than the baseline, duration is longer.
  • t13 is still a negative coefficient, hazard risk is less than the baseline, duration is longer.
  • from t0 to t13, there is an increase in hazard, a decrease in duration.
  • t26 is a positive coefficient, hazard risk is more than the baseline, duration is shorter.
  • from t13 to t26, there is an increase in hazard, a decrease in duration.
  • from t26 to t39, there is a minor decrease in hazard, a minor increase in duration.

Both models, pe_ph_model and pe_ph_model2, end up with the same conclusion despite the different results. The first quarter has the most significative impact on rearrest.

Weibull model: PH parameterization

It is possible to estimate the Weibull model using a PH assumption.

library(eha)

# shape is NULL compared to pe_ph_model
weibull_ph_model <- phreg(Surv(week, arrest) ~ fin + age + prio,
                          dist = 'weibull',
                          data = Rossi)

summary(weibull_ph_model)
## Call:
## phreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi, 
##     dist = "weibull")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## fin 
##               no    0.489     0         1           (reference)
##              yes    0.511    -0.349     0.705     0.190     0.066 
## age                24.765    -0.067     0.935     0.021     0.001 
## prio                2.841     0.098     1.103     0.027     0.000 
## 
## log(scale)                    3.774               0.358     0.000 
## log(shape)                    0.337               0.089     0.000 
## 
## Events                    114 
## Total time at risk         19809 
## Max. log. likelihood      -682.04 
## LR test statistic         29.17 
## Degrees of freedom        3 
## Overall p-value           2.06634e-06

The results are similar to the exponential model (exp_ph_model).

finyes age prio
exp_ph_model_exp_coef 0.7152 0.9371 1.0953
weibull_ph_model_exp_coef 0.7051 0.9353 1.1027

However, we can see the log(scale) and log(shape) are not set at 1 as with the exponential model results. We do not have a constant hazard function. The result plots below illustrate the difference (top 4 plots: exp_ph_model, bottom 4 plots: weibull_ph_model).

We can see the difference between a constant hazard function and nonconstant hazard function.

Weibull model: AFT parameterization

Alternatively.

library(survival)

weibull_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
                             dist = 'weibull',
                             data = Rossi)

summary(weibull_aft_model)
## 
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi, 
##     dist = "weibull")
##               Value Std. Error     z        p
## (Intercept)  3.7738     0.3581 10.54 5.67e-26
## finyes       0.2495     0.1372  1.82 6.90e-02
## age          0.0478     0.0154  3.11 1.89e-03
## prio        -0.0698     0.0201 -3.47 5.11e-04
## Log(scale)  -0.3367     0.0892 -3.77 1.61e-04
## 
## Scale= 0.714 
## 
## Weibull distribution
## Loglik(model)= -682   Loglik(intercept only)= -696.6
##  Chisq= 29.17 on 3 degrees of freedom, p= 2.1e-06 
## Number of Newton-Raphson Iterations: 6 
## n= 432

If we compare the results from the PH and AFT specifications of the Weibull model above, we see that the PH estimates for the Weibull model differ.

The scale in the AFT estimates is 0.714 (Scale fixed at 1 is the exponential model), which means that the hazard rate is increasing at a decreasing rate (scale < 1). The scale is the acceleration factor.

In the PH model, log(scale) is 3.774 and log(shape) is 0.337.

We can also calculate (1 / 0.714) - 1 = 0.4006. Which means that for 1% increase in time since the release of prison results in a 0.40% increase in the hazard for rearrest.

Without exploring the details, both models, weibull_ph_model and weibull_aft_model, converge towards the same conclusions.

Just a reminder that:

  • The interpretations of the parameter estimates for PH and AFT models differ.
  • The PH-multiplicative-relative assumption is useful for a comparison of hazards. We interpret PH model results as we interpret logit/probit results. The exponentiated coefficients are hazard or odd ratios.
  • The AFT-additive-absolute assumption is useful for a comparison of survival times.
  • The acceleration factor (Scale = 0.714 and \(\sigma\) = 0.4006) is the key measure in AFT models and is used to evaluate the effect of the covariate(s) on survival time.
  • In parametric models, negative coefficients of AFT models imply decreasing survival times and shorter expected durations.
  • In PH and Cox models, these negative coefficients referred to decreased risk and longer duration times.
  • This means that the interpretation of the coefficients is essentially the opposite.

Log-normal and Log-logistic models: AFT parameterization

It is also possible to estimate the log-normal and log-logistic models.

The PH result plots for a log-logistic model.

The AFT parametrization for a log-logistic model.

library(survival)

loglogistic_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
                                 dist = 'loglogistic',
                                 data = Rossi)

summary(loglogistic_aft_model)
## 
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi, 
##     dist = "loglogistic")
##               Value Std. Error     z        p
## (Intercept)  3.6572     0.3651 10.02 1.28e-23
## finyes       0.2582     0.1452  1.78 7.55e-02
## age          0.0455     0.0152  3.00 2.66e-03
## prio        -0.0743     0.0219 -3.39 6.97e-04
## Log(scale)  -0.4260     0.0868 -4.91 9.22e-07
## 
## Scale= 0.653 
## 
## Log logistic distribution
## Loglik(model)= -682.9   Loglik(intercept only)= -696.7
##  Chisq= 27.49 on 3 degrees of freedom, p= 4.7e-06 
## Number of Newton-Raphson Iterations: 4 
## n= 432

We see that scale is 0.653, which is less than 1.0, meaning that hazard rate is increasing at a decreasing rate and that the estimated hazard function follows an inverted U-shape form.

The PH parametrization plots for a log-normal model.

The AFT parametrization for a log-normal model.

library(survival)

lognormal_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
                               dist = 'lognormal',
                               data = Rossi)

summary(lognormal_aft_model)
## 
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi, 
##     dist = "lognormal")
##               Value Std. Error     z        p
## (Intercept)  3.9785     0.3933 10.12 4.66e-24
## finyes       0.3226     0.1668  1.93 5.31e-02
## age          0.0379     0.0154  2.47 1.36e-02
## prio        -0.0755     0.0266 -2.84 4.45e-03
## Log(scale)   0.2815     0.0766  3.67 2.39e-04
## 
## Scale= 1.33 
## 
## Log Normal distribution
## Loglik(model)= -687.4   Loglik(intercept only)= -697.9
##  Chisq= 20.94 on 3 degrees of freedom, p= 0.00011 
## Number of Newton-Raphson Iterations: 4 
## n= 432

We remember that in PH-multiplicative models, the exponentiated coefficients above 1 is an increase in the hazard rate of rearrest compared to the baseline (reducing duration).

In AFT-additive models, the exponentiated estimate of fin, \(e^{0.3226}\) = 1.38, means the expected time to rearrest for those who received financial assistance is 38% greater than for those who did not receive financial assistance (extending duration).

In both models, loglogistic_aft_model and lognormal_aft_model, the estimates are very similar to the Weibull model (weibull_aft_model).

finyes age prio
exp_aft_model_exp_coef 1.3982 1.0671 0.913
weibull_aft_model_exp_coef 1.2834 1.0489 0.9326
loglogistic_aft_model_exp_coef 1.2946 1.0466 0.9284
lognormal_aft_model_exp_coef 1.3807 1.0387 0.9272

We can compare the three models with the log-likelihood (Loglik(model)) statistics.

Since,

\[AIC = -2log(L) + 2k\] \[BIC = -2log(L) + klog(n)\]

where \(log(L)\) = Loglik(model), the log-likelihood is related to the AIC and BIC. We can pick the best model that minimizes the AIC and/or BIC

exp_AIC wei_AIC ll_AIC ln_AIC
AIC 1409.96 1399.24 1399.34 1401.82

The weibull_aft_model wins!

One final word of caution

Parametric models can have potentially dangerous consequences. Parametric models will directly model the shape of the hazard rate even if the parameterization that we have chosen is incorrect. If the model choice is incorrect, we risk making incorrect substantive interpretations.

Covariates can also be sensitive to the specification of the distribution, meaning that even interpretations of the relationship between the covariates with survival time may be incorrect.

Furthermore, parametric models are very sensitive to included or omitted covariates, with the distribution function conditional on these covariates. For this reason, it is important not only to focus on the choice of semi- versus different types of parametric models but to pay close attention to key covariates. Models often highly vary depending upon the covariates that we do or do not include.

This makes model building and diagnostics a key step in our analysis.



3, The Semi-parametric Methods

Semi-parametric models can take any form:

  • Cox proportional-hazard (PH) regression model.
  • Piecewise constant exponential model.

They are flexible:

  • We make no assumption about the shape of the hazard function.
  • Contrary to non-parametric methods, we are able to include multiple covariates.

Cox Proportional Hazard (PH) regression model

Unlike the parametric models, the Cox method does not rely on a particular probability distribution. It generally fits data well, regardless of which parametric model is appropriate.

Special features:

  • Cox models do not have an intercept term; this term is absorbed into the baseline hazard function.
  • Cox models are written as a linear model for the log-hazard (like the AFT models) or as a multiplicative model for the hazard like the PH models).

The Cox PH model with time-constant covariates

Fixed covariates are those whose values remain constant over time, such as ‘sex’ or ‘place of birth’ for example.

The Cox PH model with time-varying covariates

Time-dependent variables are those where the values differ over time, such as ‘age’, ‘labour market status’ or ‘treatment status’ for example.

It may be a continuous variable such as ‘age’, or vary discretely over time, such as a time-varying binary variable like ‘monthly employment status’ (0, 1), which may change from time period to time period with no set pattern.

Cox PH regression model with time-constant covariates

Load in the Rossi dataset about recidivism.

The data pertains to 432 convicts who were released from prisons and who were followed up for one year after release. Half the released convicts were assigned at random to an experimental treatment in which they were given financial aid; half did not receive aid.

The head of the dataset.

week arrest fin age race wexp mar paro prio educ
1 20 1 no 27 black no not married yes 3 3
2 17 1 no 18 black no not married yes 8 4
3 25 1 no 19 other yes not married yes 13 3
4 52 0 yes 23 black yes married yes 1 5
5 52 0 no 19 other yes not married yes 3 3
Note: The 'empl1 to emplo52' fields are not shown.

The main covariates:

  • fin, yes, for the individual’s received financial aid, no, for no aid.
  • age, in years at time of release from prison.
  • race,1, for black,0`, for other.
  • wexp, full-time work experience prior to incarceration; 1 for yes, 0 for no.
  • mar, 1, for married at time of release or not.
  • paro, 1, for released on parole or not.
  • prio, for prior number of convictions.

Estimate a first model with the above covariates.

library(survival)

cox_model1 <- coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio,
                    method = 'efron',
                    data = Rossi)

summary(cox_model1)
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp + 
##     mar + paro + prio, data = Rossi, method = "efron")
## 
##   n= 432, number of events= 114 
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)   
## finyes         -0.37942   0.68426  0.19138 -1.983  0.04742 * 
## age            -0.05744   0.94418  0.02200 -2.611  0.00903 **
## raceother      -0.31390   0.73059  0.30799 -1.019  0.30812   
## wexpyes        -0.14980   0.86088  0.21222 -0.706  0.48029   
## marnot married  0.43370   1.54296  0.38187  1.136  0.25606   
## paroyes        -0.08487   0.91863  0.19576 -0.434  0.66461   
## prio            0.09150   1.09581  0.02865  3.194  0.00140 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## finyes            0.6843     1.4614    0.4702    0.9957
## age               0.9442     1.0591    0.9043    0.9858
## raceother         0.7306     1.3688    0.3995    1.3361
## wexpyes           0.8609     1.1616    0.5679    1.3049
## marnot married    1.5430     0.6481    0.7300    3.2614
## paroyes           0.9186     1.0886    0.6259    1.3482
## prio              1.0958     0.9126    1.0360    1.1591
## 
## Concordance= 0.64  (se = 0.027 )
## Rsquare= 0.074   (max possible= 0.956 )
## Likelihood ratio test= 33.27  on 7 df,   p=2.362e-05
## Wald test            = 32.11  on 7 df,   p=3.871e-05
## Score (logrank) test = 33.53  on 7 df,   p=2.11e-05
  • The model coefficient are the \(\beta_i\).
  • Some covariable are binary (fin, age), others are continuous (prio, educ).
  • The exponentiated coefficient represents the multiplicative effects on the hazard: the hazard or risk ratio.
  • A ratio means the ‘odds’, not the ‘probability’.
  • An estimated hazard ratio > 1 indicates the covariate is associated with an increased hazard of recidivism (the event of interest); and vice versa.
  • An estimated hazard ratio = 1 indicates no association between the covariate and hazard.

For example, the hazard ratio for receiving financial aid (finyes) is 0.6843. The odds of recidivism are 69% compared to those who did not receive financial aid (the baseline). Since it is less than 1, those who received financial aid have a lower hazard of rearrest.

Fixed-continuous variable

Results as the percentage change in the hazard: \((e^\beta - 1) * 100\).

age, for example: \(e^\beta\) = 0.944. Therefore, (0.944 -1) * 100 = -5.6%. For every one-year increase in an individual’s age at release, the hazard of arrest goes down by an estimated 5.6%. Older prisoners are significantly less likely to be rearrested after release than younger ones.

Binary variable

The hazard ratio for the not-married group is 1.54, which means that those who are not married have a 1.54 greater chance of rearrest.

There is a difference in interpretation of the hazard ratio between binary and fixed-continuous covariates. If we think of a graphic, the binary variable moves the curve up and down (change the intercept) while the fixed-continuous variable has an effect on the slope. In the case of age, the slope is decreasing.

Finally, since the hazard ratio is a constant we could also state it in terms of odds by saying that the odds of that a person who is not married is more likely to get rearrested is 1.5 in comparison to those who are married (50% higher).

Or we could state it in terms of a probability: 1.5/4 = 37.5% chance that an unmarried person would get rearrested, compared to a married person.

Here are the exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1, or 100%).

We can then compute the percentage change of the fixed-continuous variables.

The overall significance

The likelihood-ratio test, the Wald test and the score log-rank (chi-square) statistics all evaluate the hypothesis that all coefficients are zero.

Rsquare= 0.074   (max possible= 0.956 )
Likelihood ratio test= 33.27  on 7 df,   p=2.362e-05
Wald test            = 32.11  on 7 df,   p=3.871e-05
Score (logrank) test = 33.53  on 7 df,   p=2.11e-05

These statistics are all significant (p-ratios < 5%). We can then reject the hypothesis; at least one of the covariates contributes significantly to the explanation of the duration to the event, which in this case is rearrest.

The likelihood-ratio test is generally preferred over other tests.

The intercept

There is no estimate of the intercept; it is already incorporated in the baseline hazard function.

The hazard function is nonparametric. We do not plug in parameters and trace the hazard curve according to a distribution (Weibull for example). We can see the data points. From these observations, we can estimate the hazard line (just like a regression).

The hazard at a given time interval is the probability of the given event (e.g. arrest=1, the status of the Surv object) occurring in that time period, given that the subject survived through all of the prior time intervals.

Plotting the estimated survival function

Compare

To assess, for example, how the receipt of financial aid (fin) impacted rearrest times, we can stratify the analysis.

cox_model2 <- coxph(Surv(week, arrest) ~ age + race + wexp + mar + paro + prio + strata(fin),
                    method = 'efron',
                    data = Rossi)

Plotting the results shows the difference.

Cox PH regression model with time-varying covariates

A general hypothesis of the recidivism study was that those who received financial aid would have lower arrest rates than those who did not.

However, financial aid was only provided during the first 13 weeks. When we examine the data, we also see that the proportion of prisoners working full-time increases steadily over the one-year period. For this reason, it is also useful to test a time-varying hypothesis, which predicts that the effect of financial aid is greatest during the first 13 weeks

Within the Rossi data frame, there are weekly employment status variables (i.e. emp1 to emp52), which can be used as time-varying covariates to predict the time until rearrest.

##   week arrest fin age  race wexp         mar paro prio educ
## 1   20      1  no  27 black   no not married  yes    3    3
##   emp1 emp2 emp3 emp4 emp5 emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp13
## 1   no   no   no   no   no   no   no   no   no    no    no    no    no
##   emp14 emp15 emp16 emp17 emp18 emp19 emp20 emp21 emp22 emp23 emp24 emp25
## 1    no    no    no    no    no    no    no  <NA>  <NA>  <NA>  <NA>  <NA>
##   emp26 emp27 emp28 emp29 emp30 emp31 emp32 emp33 emp34 emp35 emp36 emp37
## 1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
##   emp38 emp39 emp40 emp41 emp42 emp43 emp44 emp45 emp46 emp47 emp48 emp49
## 1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
##   emp50 emp51 emp52
## 1  <NA>  <NA>  <NA>

Since the first individual was arrested at week=20, the employment indicators are missing for further weeks.

The first step is to create a person-period (or long) data frame that contains only the non-missing periods of observation (i.e. for person 1, only up to emp20 – after that it is only NAs).

The variables that we are concerned with during the restructuring of the data are the emp1 to emp52 variables, which chart the employment status of the individuals at each week of the study.

If we look at the first 13 variables only, we see that the first five individuals were not employed within the first three weeks of being released from prison (i.e. emp1 to emp3 equals no).

week arrest fin age race wexp mar paro prio educ emp1 emp2 emp3
1 20 1 no 27 black no not married yes 3 3 no no no
2 17 1 no 18 black no not married yes 8 4 no no no
3 25 1 no 19 other yes not married yes 13 3 no no no
4 52 0 yes 23 black yes married yes 1 5 no no no
5 52 0 no 19 other yes not married yes 3 3 no no no

Creating a subject-period data frame

Restructure the data frame using a custom function: unfold. The unfold function resemble the melt function from the reshape2 package or the gather function from the tidyr package. The unfold function is located here.

We want to go from a wide data frame to a long data frame. The dataset Rossi has 432 rows and 62 columns.

With the new person-period data frame, the time intervals or periods are stacked to form a very long data frame of i rows by j columns.

## [1] 19809    14

A part of the new Rossi_long data frame.

start stop arrest.time week arrest fin age race wexp mar paro prio educ employed
1.1 0 1 0 20 1 no 27 black no not married yes 3 3 no
1.2 1 2 0 20 1 no 27 black no not married yes 3 3 no
1.3 2 3 0 20 1 no 27 black no not married yes 3 3 no
1.4 3 4 0 20 1 no 27 black no not married yes 3 3 no
1.5 4 5 0 20 1 no 27 black no not married yes 3 3 no
1.6 5 6 0 20 1 no 27 black no not married yes 3 3 no
1.7 6 7 0 20 1 no 27 black no not married yes 3 3 no
1.8 7 8 0 20 1 no 27 black no not married yes 3 3 no
1.9 8 9 0 20 1 no 27 black no not married yes 3 3 no
1.10 9 10 0 20 1 no 27 black no not married yes 3 3 no
1.11 10 11 0 20 1 no 27 black no not married yes 3 3 no
1.12 11 12 0 20 1 no 27 black no not married yes 3 3 no
1.13 12 13 0 20 1 no 27 black no not married yes 3 3 no
1.14 13 14 0 20 1 no 27 black no not married yes 3 3 no
1.15 14 15 0 20 1 no 27 black no not married yes 3 3 no
1.16 15 16 0 20 1 no 27 black no not married yes 3 3 no
1.17 16 17 0 20 1 no 27 black no not married yes 3 3 no
1.18 17 18 0 20 1 no 27 black no not married yes 3 3 no
1.19 18 19 0 20 1 no 27 black no not married yes 3 3 no
1.20 19 20 1 20 1 no 27 black no not married yes 3 3 no

The first individual was rearrested at 20 weeks. If we examine the first 20 cases, we see that the arrest.time variable has a value of 0. 0 until week 20, when the individual was rearrested – where it then turns to a value of 1.

We estimate the Cox model with time-varying covariates.

library(survival)

cox_model3 <- coxph(Surv(start, stop, arrest.time) ~ fin + age + race + wexp + mar + paro + prio + employed, data = Rossi_long)

summary(cox_model3)
## Call:
## coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + 
##     race + wexp + mar + paro + prio + employed, data = Rossi_long)
## 
##   n= 19809, number of events= 114 
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)    
## finyes         -0.35672   0.69997  0.19113 -1.866  0.06198 .  
## age            -0.04634   0.95472  0.02174 -2.132  0.03301 *  
## raceother      -0.33866   0.71273  0.30960 -1.094  0.27402    
## wexpyes        -0.02555   0.97477  0.21142 -0.121  0.90380    
## marnot married  0.29375   1.34145  0.38303  0.767  0.44314    
## paroyes        -0.06421   0.93781  0.19468 -0.330  0.74156    
## prio            0.08514   1.08887  0.02896  2.940  0.00328 ** 
## employedyes    -1.32832   0.26492  0.25072 -5.298 1.17e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## finyes            0.7000     1.4286    0.4813    1.0180
## age               0.9547     1.0474    0.9149    0.9963
## raceother         0.7127     1.4031    0.3885    1.3075
## wexpyes           0.9748     1.0259    0.6441    1.4753
## marnot married    1.3414     0.7455    0.6332    2.8419
## paroyes           0.9378     1.0663    0.6403    1.3735
## prio              1.0889     0.9184    1.0288    1.1525
## employedyes       0.2649     3.7747    0.1621    0.4330
## 
## Concordance= 0.708  (se = 0.027 )
## Rsquare= 0.003   (max possible= 0.066 )
## Likelihood ratio test= 68.65  on 8 df,   p=9.114e-12
## Wald test            = 56.15  on 8 df,   p=2.632e-09
## Score (logrank) test = 64.48  on 8 df,   p=6.102e-11

Here are the exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1 or 100%).

Compare to the baseline (\((e^\beta - 1) * 100\)). However, we have to rule out the results on binary variables on this second barplot.

The most striking difference is that the newly included time-varying covariate (employedyes) has the strongest effect of all coefficients in the model. It is also the most significant. This variable measures whether an individual was employed or not each week.

With a hazard ratio of 0.2649 and percentage change in the hazard of -73.5%, we can conclude that the probability of rearrest for individuals who were employed full time is about 26.5% of the risk for those who were not employed full-time.

We plot the hazard function and the survival curve.

Cox PH regression model with time-varying covariates – Lagged covariates

By lagging a variable, we ensure that we correctly specify that the cause precedes the effect.

Most time-varying covariates should be entered in a lagged form to avoid this problem of simultaneity of cause and effect.

In the previous section, we prepared the data to predict arrests in a given week in relation to the individual’s employment status in the very same week. Being arrested on a Monday makes it impossible to work for the rest of the week.

In this case, being arrested influences one’s employment status. But since data is only registered on a weekly basis, it could appear that the person’s unemployment status had influenced their arrest status. The problem of reverse causation is a common one with time-varying covariates and even acuter when event times are measured in broader units such as a month or year.

Creating a subject-period file with lagged variables to reduce problems of causal ordering

To lag the variable, we need to specify the employment status in the previous week. The unfold function has a parameter for that.

Unfold the data, rerun the survival analysis and print the results.

Rossi_long2 <- unfold(Rossi, 'week', 'arrest', 11:62, 'employed', lag = 1)

library(survival)

cox_model4 <- coxph(Surv(start, stop, arrest.time) ~ fin + age + race + wexp + mar + paro + prio + employed, data = Rossi_long2)

summary(cox_model4)
## Call:
## coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + 
##     race + wexp + mar + paro + prio + employed, data = Rossi_long2)
## 
##   n= 19377, number of events= 113 
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)    
## finyes         -0.35130   0.70377  0.19181 -1.831 0.067035 .  
## age            -0.04977   0.95144  0.02189 -2.274 0.022969 *  
## raceother      -0.32147   0.72508  0.30912 -1.040 0.298369    
## wexpyes        -0.04764   0.95348  0.21323 -0.223 0.823207    
## marnot married  0.34476   1.41165  0.38322  0.900 0.368310    
## paroyes        -0.04710   0.95399  0.19630 -0.240 0.810375    
## prio            0.09199   1.09635  0.02880  3.194 0.001402 ** 
## employedyes    -0.78689   0.45526  0.21808 -3.608 0.000308 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## finyes            0.7038     1.4209    0.4832    1.0250
## age               0.9514     1.0510    0.9115    0.9932
## raceother         0.7251     1.3792    0.3956    1.3290
## wexpyes           0.9535     1.0488    0.6278    1.4481
## marnot married    1.4116     0.7084    0.6661    2.9917
## paroyes           0.9540     1.0482    0.6493    1.4016
## prio              1.0964     0.9121    1.0362    1.1600
## employedyes       0.4553     2.1966    0.2969    0.6981
## 
## Concordance= 0.67  (se = 0.027 )
## Rsquare= 0.002   (max possible= 0.067 )
## Likelihood ratio test= 47.16  on 8 df,   p=1.43e-07
## Wald test            = 43.37  on 8 df,   p=7.483e-07
## Score (logrank) test = 46.4  on 8 df,   p=1.993e-07

The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1 or 100%).

The exponentiated coefficients compared to the baseline (\((e^\beta - 1) * 100\)). We rule out the results on binary variables on this second barplot.

The coefficient for the lagged employment, employedyes, is still highly significant. The effect of the coefficient is much smaller as it clearly drops.

The risk of arrest (exp(coeff)) for individuals who were employed is around 45,5% of the risk compared to those who were not employed.

The ‘lag’ model is more realistic: the hazard is larger, the duration is shorter, and the survival probability is smaller.

Cox PH regression model with time-varying covariates – Interactions

Interactions with time as time-varying covariates – part 1

Another approach to examining time-dependence is by splitting the episode by a particular time point.

We want to determine whether financial aid would lower rearrest rates. Since financial aid was only provided during the first 13 weeks, we can hypothesize that the effect of financial aid should be stronger for the first 13 weeks and then wane after this period.

Creating episode-splitting at time intervals

We cut or split the dataset at ‘13 weeks’ using the and create a set named Rossi_split.

Rossi2 <- Rossi
Rossi2$id <- 1:432

Rossi_split <- survSplit(Rossi2, 
                         cut = c(13),
                         end = 'week',
                         event = 'arrest',
                         start = 'start',
                         episode = 'i')

# Reorder
Rossi_split <- Rossi_split[order(Rossi_split$id, Rossi_split$start),]

Rossi_split[1:2,]
##   fin age  race wexp         mar paro prio educ emp1 emp2 emp3 emp4 emp5
## 1  no  27 black   no not married  yes    3    3   no   no   no   no   no
## 2  no  27 black   no not married  yes    3    3   no   no   no   no   no
##   emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp13 emp14 emp15 emp16 emp17
## 1   no   no   no   no    no    no    no    no    no    no    no    no
## 2   no   no   no   no    no    no    no    no    no    no    no    no
##   emp18 emp19 emp20 emp21 emp22 emp23 emp24 emp25 emp26 emp27 emp28 emp29
## 1    no    no    no  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
## 2    no    no    no  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
##   emp30 emp31 emp32 emp33 emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41
## 1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
## 2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
##   emp42 emp43 emp44 emp45 emp46 emp47 emp48 emp49 emp50 emp51 emp52 id
## 1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  1
## 2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  1
##   start week arrest i
## 1     0   13      0 1
## 2    13   20      1 2

Above, the first subject (id=1) had an event (i.e. rearrest) in week 20 and therefore has one split at week 13. The data frame shows 2 rows (i=1 & i=2: the first 13 weeks (1-13) where arrest=0 and the second 13 weeks (14-26, stopped at week=20) where arrest=1.

There was no event between 0 and 13 weeks. The newly defined start variable has the value of 13 for observations with the survival time exceeding or equal to 13 and 0 for otherwise.

Below, another subject (id=2).

Rossi_split[3:4,]
##   fin age  race wexp         mar paro prio educ emp1 emp2 emp3 emp4 emp5
## 3  no  18 black   no not married  yes    8    4   no   no   no   no   no
## 4  no  18 black   no not married  yes    8    4   no   no   no   no   no
##   emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp13 emp14 emp15 emp16 emp17
## 3   no   no   no   no   yes   yes   yes   yes   yes    no    no    no
## 4   no   no   no   no   yes   yes   yes   yes   yes    no    no    no
##   emp18 emp19 emp20 emp21 emp22 emp23 emp24 emp25 emp26 emp27 emp28 emp29
## 3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
## 4  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
##   emp30 emp31 emp32 emp33 emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41
## 3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
## 4  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
##   emp42 emp43 emp44 emp45 emp46 emp47 emp48 emp49 emp50 emp51 emp52 id
## 3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  2
## 4  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  2
##   start week arrest i
## 3     0   13      0 1
## 4    13   17      1 2

We create two new Heaviside function variables.

Rossi_split$fin <- as.numeric(Rossi_split$fin)

Rossi_split$hv1 <- Rossi_split$fin*(1 - (Rossi_split$start / 13))

Rossi_split$hv2 <- Rossi_split$fin*(Rossi_split$start / 13)

We run a time-dependent model using these variables.

library(survival)

cox_model5 <- coxph(Surv(week, arrest) ~ age + prio + hv1 + hv2, data = Rossi_split)

summary(cox_model5)
## Call:
## coxph(formula = Surv(week, arrest) ~ age + prio + hv1 + hv2, 
##     data = Rossi_split)
## 
##   n= 844, number of events= 114 
## 
##          coef exp(coef) se(coef)      z Pr(>|z|)    
## age  -0.06705   0.93515  0.02087 -3.212 0.001316 ** 
## prio  0.09955   1.10467  0.02746  3.625 0.000289 ***
## hv1   0.70481   2.02345  0.33179  2.124 0.033651 *  
## hv2  -0.54135   0.58196  0.20345 -2.661 0.007794 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## age     0.9352     1.0693    0.8977    0.9742
## prio    1.1047     0.9052    1.0468    1.1658
## hv1     2.0235     0.4942    1.0560    3.8772
## hv2     0.5820     1.7183    0.3906    0.8671
## 
## Concordance= 0.679  (se = 0.029 )
## Rsquare= 0.054   (max possible= 0.805 )
## Likelihood ratio test= 46.73  on 4 df,   p=1.732e-09
## Wald test            = 41.24  on 4 df,   p=2.398e-08
## Score (logrank) test = 45.21  on 4 df,   p=3.6e-09

The results show a significant hazard ratio (at 0.05 significance level) of 2.023 for the effect of no financial aid in the first 13 weeks (hv1).

This is contrasted with the hazard ratio of 0.582 (significant at the 0.01 level) for second 13 weeks (hv2).

The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1).

The exponentiated coefficients compared to the baseline (\((e^\beta - 1) * 100\)).

\(HR\) being the hazard ratio.

\[t < 13~weeks:~HR = e^{0.7048} = 2.023\]

\[t \geq 13~weeks:~HR = e^{0.5413} = 0.582\]

Note that the variables hv1 and hv2 are the product terms or interaction terms between \(fin \times g1\) and \(fin \times g2\). g1 and g2 being dummies.

\[if~t < 13,~g1 = 1;~otherwise,~g1 = 0\] \[if~t \geq 13,~g2 = 1;~otherwise,~g1 = 0\]

The estimates should be interpreted just as we would any interaction effect which in this case has ‘Financial aid = yes’ as the reference group. No financial assistance clearly shows higher rearrest rates in the first weeks since the exp(coef > 1.

There is a large difference in survival times (i.e. not being rearrested) after 13 weeks for those who did not receive any financial compensation at the beginning in comparison to a smaller difference in survival times prior to 13 weeks (exp(coef < 1).

Overall, those who received financial aid always did better in terms of survival at any time than those who did not.

Cox PH regression model – Tests, interaction, stratification, and measures

Testing overall model adequacy

One way to assess if your model is adequately specified is via the diagnostic approach of the Cox-Snell residuals, via the construction of a residual plot that follows a unit exponential distribution with a hazard ratio of 1. If the assumption holds, then the plot of the Cox-Snell residuals should have a straight line with a slope that is equal to 1 (hazard ratio of 1).

To the extent that this plot forms a 45-degree straight line through the origin, the model is correct.

First, we scale down our recidivism model, eliminating the covariates whose coefficients were not statistically significant in cox_model1.

library(survival)

cox_model6 <- coxph(Surv(week, arrest) ~ fin + age + prio, data = Rossi)

cox_model6
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi)
## 
##           coef exp(coef) se(coef)     z       p
## finyes -0.3470    0.7068   0.1902 -1.82 0.06820
## age    -0.0671    0.9351   0.0209 -3.22 0.00129
## prio    0.0969    1.1017   0.0273  3.56 0.00038
## 
## Likelihood ratio test=29.1  on 3 df, p=2.19e-06
## n= 432, number of events= 114

Second, we need to compute an estimate of the survival function using the KM estimates. Finally, we construct the graphic (there are not function included in the survival package).

r <- cox_model6$residuals

# recover Cox Snell residual values from modified values
rr <- Rossi$arrest - r
fit <- survfit(Surv(rr, Rossi$arrest) ~ 1) # KM estimate
S <- fit$surv
T <- fit$time

# plot residuals
plot(S ~ T, xlim = range(T), ylim = c(0,1), xlab = 'time', ylab = 'Cox-Snell residual values', col = 'cyan2')

T_0 <- seq(0, max(T), 0.05)
S_0 <- exp(-T_0)

lines(S_0 ~ T_0, col = 'red3', lty = 3, lwd = 2)

Since the line fits the data and there are no serious deviations from the line, the Cox model appears to be adequate. If there were strong deviations, it would suggest that we have omitted a key covariate or that the functional form of one or more of the covariates was incorrect.

It is a fast and simple graphical check. However, other tests will assess the adequacy of a model.

Checking proportional hazards

Tests and graphical diagnostics for proportional hazards may be based on the ‘scaled Schoenfeld residuals’; these can be obtained directly as residuals(coxph, 'scaledsch').

cox_model6
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi)
## 
##           coef exp(coef) se(coef)     z       p
## finyes -0.3470    0.7068   0.1902 -1.82 0.06820
## age    -0.0671    0.9351   0.0209 -3.22 0.00129
## prio    0.0969    1.1017   0.0273  3.56 0.00038
## 
## Likelihood ratio test=29.1  on 3 df, p=2.19e-06
## n= 432, number of events= 114

The coefficient for financial aid, which is the focus of the study, now has a p-values greater than 5%. This p-value is based on a two-sided test. A one-sided test is appropriate because we expect the coefficient to be negative.

We test the proportional-hazards assumption (the null hypothesis); we test each covariate.

cox.zph(cox_model6)
##             rho   chisq      p
## finyes -0.00657 0.00507 0.9433
## age    -0.20976 6.54147 0.0105
## prio   -0.08004 0.77288 0.3793
## GLOBAL       NA 7.13046 0.0679

There is strong evidence of non-proportional hazards of age (we accept the alternative hypothesis).

age weakens the global test. The test accept the null hypothesis, but the p-value is barely over 5%; this is not highly significant.

These tests are sensitive to linear trends in the hazard.

Visually.

The interpretation of these graphs is facilitated by using a smoothing spline, shown on each graph by a solid line; the broken lines represent 2-standard-error envelopes around the fit.

We seek to have straight horizontal lines. Sloped lines are indicative of non-proportional hazards. The assumption of proportional hazards appears to be supported for the binary covariates finyes and continuous covariate prio, but there appears to be a downward trend in the plot for continuous covariate age.

Interactions with time as time-varying covariates – part 2

One way of accommodating non-proportional hazards is to build interactions between covariates and time into the Cox regression model; such interactions are themselves time-dependent covariates.

library(survival)

cox_model7 <- coxph(Surv(start, stop, arrest.time) ~
    fin + age + age:stop + prio, data = Rossi_long)

cox_model7
## Call:
## coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + 
##     age:stop + prio, data = Rossi_long)
## 
##              coef exp(coef) se(coef)     z       p
## finyes   -0.34856   0.70570  0.19023 -1.83 0.06690
## age       0.03228   1.03280  0.03943  0.82 0.41301
## prio      0.09818   1.10316  0.02726  3.60 0.00032
## age:stop -0.00383   0.99617  0.00147 -2.61 0.00899
## 
## Likelihood ratio test=36  on 4 df, p=2.85e-07
## n= 19809, number of events= 114

As expected, the coefficient for the interaction is negative and highly statistically significant: the effect of age declines with time.

cox.zph(cox_model7)
##              rho   chisq     p
## finyes   -0.0068 0.00544 0.941
## age      -0.0471 0.36703 0.545
## prio     -0.0849 0.86449 0.352
## age:stop  0.0567 0.55701 0.455
## GLOBAL        NA 1.41901 0.841

All p-value are above 5%. We can now accept the proportional-hazards assumption (the null hypothesis).

The age slope has seemingly lost its downward trend.

Stratification

An alternative to incorporating an interaction in the model is to divide the data into strata based on the value of one or more covariates.

Each stratum is permitted to have a different baseline hazard function, while the coefficients of the remaining covariates are assumed to be constant across all strata.

We do not have to assume a particular form of interaction between the stratifying covariates and time. A disadvantage is the inability to examine the effects of the stratifying covariates. Stratification is most natural when a covariate binary (or limited to a few categories).

In our example, age takes on many different values, but we can create age categories by dissecting the variable into class intervals.

We use the recode function from the car package to categorize age.

library(car)

Rossi3 <- Rossi

Rossi3$age_cat <- recode(Rossi3$age, 'lo:19=1; 20:25=2; 26:30=3; 31:hi=4')

Most of the individuals in the dataset are in the second age category, 20 to 25.

We estimate a stratified Cox regression model by including a strata function on the righthand side of the model formula. And we test the results.

library(survival)

cox_model8 <- coxph(Surv(week, arrest) ~
    fin + prio + strata(age_cat), data = Rossi3)

cox_model8
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + prio + strata(age_cat), 
##     data = Rossi3)
## 
##           coef exp(coef) se(coef)     z      p
## finyes -0.3408    0.7112   0.1902 -1.79 0.0732
## prio    0.0941    1.0986   0.0270  3.48 0.0005
## 
## Likelihood ratio test=13.4  on 2 df, p=0.00122
## n= 432, number of events= 114
cox.zph(cox_model8)
##            rho  chisq     p
## finyes -0.0183 0.0392 0.843
## prio   -0.0771 0.6859 0.408
## GLOBAL      NA 0.7299 0.694

All p-value are above 5%. We accept the proportional-hazards assumption (the null hypothesis).

The method of stratification has two main advantages over interactions.

First, when we include an interaction term, we are required to choose a particular form of interaction. The stratification approach is more flexible and allows for any kind of change in the effect of a variable over time.

Second, stratification is efficient and takes less computation time.

The first disadvantage is that there are no estimates for the effect of the stratifying variable. For this reason, it would not be advisable to stratify on a key variable.

A second disadvantage is that the interaction effect of the covariate with time may offer more efficient estimates.

Finally, it is not possible to test for the main effect of the stratifying variable or its interaction with time.

Influential observations

Another common problem is that there are influential observations in your data that influence your estimates. It is important to know whether the removal of one observation would result in a substantial increase or decrease of the relative hazard.

type = 'dfbetas' produces the estimated changes in the coefficients divided by their standard errors. We compute the dfbeta on cox_model6, the model without interactions nor statification.

dfbeta <- residuals(cox_model6, type = 'dfbeta')

We then plot the residuals per covariate.

Comparing the magnitudes of the largest ‘degree of freedom-beta’ values to the regression coefficients suggests that none of the observations is influential individually.

Some of the dfbeta values for age are large compared with the others.

It appears to be a clustering of older respondents. It might be useful to do further sensitivity analyses.

We need to consider whether the model that we are estimating needs to be changed to embrace the influential observation.

Nonlinearity

When we incorrectly specify the functional form in the parametric part of the model, there is a potential problem in Cox model regressions (as with linear and generalized linear models).

The ‘martingale residuals’ may be plotted against covariates to detect nonlinearity. They may also be used to form partial-residual plots (in the manner of linear and generalized linear models).

Nonlinearity is not an issue for financial aid, because this covariate is dichotomous; we will only compute the ‘martingale residuals’ for two covariates of cox_model6 ( the model without interactions nor statification).

res <- residuals(cox_model6, type = 'martingale')

X <- as.matrix(Rossi3[, c('age', 'prio')])



4, Recapitulation

We will not tackle the mathematical explanations and keep Greek equations to the minimum. The Terminology section is a good reference for the principal key words.

Methodology

Here is a tentative checklist/workflow for performing survival analyses.

Step 1

What is the shape of the baseline hazard function?

  • Known.
    • Go for a parametric model, specify the shape of hazard (exploratory analysis of the dataset) and choose the appropriate model.
  • Unknown.
    • Go for a Cox model.

Step 2

How do the covariates affect the hazard rate?

  • Covariates go up or down in a multiplicative manner (akin to a logit/probit).
    • Go for the PH assumption: exponential, Weibull, Gompertz.
  • Covariates assumed to multiply the time scale (akin to an OLS model).
    • Go for the AFT assumption: exponential, Weibull, gamma, log-normal, log-logistic.

Step 3

  • Try alternative specifications.
  • Test which model fits your data the best by assessing the model fit.
  • Assess the overall goodness of fit:
    • perform the log-likelihood and likelihood ratio tests,
    • compute and compare the AIC,
    • compare standard errors,
    • test the overall model adequacy with the Cox-Snell residuals (plot),
    • test the proportional hazards assumption (Schoenfeld residuals and plots),
    • check for influential observations (score residuals and plots),
    • assessing nonlinearity (Martingale residual and component-plus-residual plots).

Step 4

  • Deal with violations and problems.
  • Adjust the model, add/remove covariates or change the model.
  • Adopt advanced techniques.

1, Non-parametric Methods: the Kaplan-Meier (KM) Estimator

  • Strengths: simplicity, no assumption about the distribution of the hazard (whether it is constant over time, accelerating, U-shaped, etc). Good for predicting.
  • Weaknesses: take only one covariate.

Worked examples:

  • leukemia_fit, a base model with no covariate and graphics.
  • leukemia_fit2 with one covariate and graphics.
  • leukemia_fit2 with (binary) stratified results (using the covariate).

Non-parametric methods also include actuarial life tables and the Altschuler–Nelson estimator (both are not covered).

2 & 3, Continuous-time (data) methods

  • 2, Parametric Methods.
    • Proportional Hazard (PH) parametrization.
    • Accelerated Failure Time (AFT) parametrization.
  • 3, Semi-parametric Methods.

2a, A note on PH and AFT parametrizations

  • The PH-multiplicative assumption is useful for a comparison of hazards.
  • A negative coefficient means that the risk of the event decreases, resulting in a longer duration.
  • Done with the eha package.
  • We can plot the results.

  • The AFT-additive assumption is useful for a comparison of survival times.
  • The acceleration factor is the key measure in AFT models and is used to evaluate the effect of the covariate(s) on survival time.
  • Done with the survival package.
  • We cannot plot the results since the focus is on coefficient estimation.

2b, Parametric PH models

  • Strengths: time must not be subdivided. Handle left censoring and interval censoring. The use of multiple covariates. Good for predicting once we have computed the covariate coefficients just like with OLS or logistic regressions.
  • Weaknesses: fitting the right distribution.

Specs:

  • \(h(t)\) is the hazard function; the ‘risk’ of event occurence given covariates.
  • \(log \big(h(t) \big)\) is the dependent variable; covariates are explanatory variables.
  • \(t\) is the continuous time.

We seek to find the best distribution to fit the hazard function. Among the most common models:

The exponential model

\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2\]

a special case of the Weibull model.

\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2 +c~log(t)\]

c = 1; log(1) = 0 (constant over time).

Worked examples:

  • exp_ph_model and graphics

The piecewise constant exponential model

Similar to the exponential model, with the additional assumption that time is divided into different periods and that the hazard rate is then constant within any of these time periods. In other words, a piecewise exponential model gives us the flexibility of the Cox model with the added advantage of being able to estimate the shape of the hazard function.

Worked examples with 4 time segments:

  • pe_ph_model and graphics (with the eha package).
  • pe_ph_model2 (with the survival package).

The Weibull model

\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2 +c~log(t)\]

  • c (or rho, Scale, acceleration factor, sigma or other names) is the time coefficient.
  • This accelerator factor can be positive (increasing risk), negative (decreasing risk), less than 1 (decreasing returns) or more than 1 (increasing return).
  • When we plot the hazard function, we see a curve (positive or negative, accelerating or decelerating).

Worked examples:

  • weibull_ph_model and graphics.

The Gompertz model (not covered)

\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2 +ct\]

  • c is a time coefficient.
  • This accelerator factor can be positive (increasing risk) or negative (decreasing risk).
  • When we plot the hazard function, we see a curve (positive or negative) flattening over time (less than 1).
  • The Gompertz distribution has an S-shaped like the logistic distribution.
  • The Gompertz-Makeham model, a special case, states that the human death rate is the sum of an age-independent component and an age-dependent component which increases exponentially with age.

2c, Parametric AFT models

  • Strengths: time must not be subdivided. Handle left censoring and interval censoring. The use of multiple covariates. Good for predicting once we have computed the covariate coefficients just like with OLS or logistic regressions.
  • Weaknesses: fitting the right distribution.

Specs:

  • \(T\) is the elapsed time until an event occurs given covariates.
  • \(log~T\) is the dependent variable; covariates are explanatory variables.
  • \(t\) is the continuous time.
  • \(\varepsilon\) is the disturbances or residuals.

We seek to find the best distribution to fit the disturbances or residuals.

The general model

\[log~T = \beta_0 + \beta_1x_1 + \beta_2x_2 + \varepsilon\]

The most common nonmonotonic distributions:

  • Log-normal.
  • Log-logistic.
  • Generalized gamma.

Worked examples:

  • lognormal_aft_model.
  • loglogistic_aft_model.

We can also apply the AFT model to the exponential, Weibull and Gompertz monotonic distributions.

Worked examples:

  • exp_aft_model.
  • weibull_aft_model.

Other notable distributions:

  • Normal.
  • Log-gamma.
  • Logistic
  • Extreme value.

2d, The function of time

  • Hereby, the graphic displays several hazard functions.
  • Hazard, failure, odds or risk all mean the same thing.
  • A hazard curve, shape, rate or function all means the same thing.
  • There is a shape (hazard function) for each situation (data distribution): cancer patients, infantile deseases, machines, objects, customers, etc.
  • Non- and semi-parametric models (KM and Cox for e.g.) do not rely on any hazard functions.
  • Monotonic: the Weibull and Gompertz models can be an increasing or decreasing function of time. In other words, disturbances rise (or decline). The Weibull model can even accelerate or decelerate over time. The Gompertz curve flattens out (decelerate) over time. The exponential model, a special case of the Weibull model, is constant over time.
  • Nonmonotonic: other models can first increase, reach a peak and then decrease (hump, U-shaped). Some have fat tails, others have steep climbs.
  • Piecewise: similar to the exponential model, with the additional assumption that time is subdivided and that the hazard rate is then constant within any of these time periods. There can be two periods or a multitude of periods. There can be increase/decreasing steps or ups and downs.

Finding the right distribution (density function) to fit the hazard function (PH) or disturbances (AFT) is crucial. Here are a couple of situations:

  • Following graduation, we want to model the ‘hazard’ of promotion for an individual (the event). We know the chance of being promoted rises with time (years of experience). Monotonic exponential model.
  • On the other hand, a chance for a doctoral graduate to become associate professor rises over time, but at a decreasing rate. Whether you are a promising researcher with a bright publishing record or not will decide your fate. If you are not promoted after a certain time, you might thing reorienting our career outside academics. Monotonic Weibull or Gompertz models. We can run both models and test which is the best.
  • Mortality is a function of age. Youngsters and elders are more prone to getting sick. U-shaped models (nonmonotonic) such as the log-normal or log-logistic models.
  • And so on.

A study of the data a priori indicate which distribution we should opt for. We might consider less common forms such the log-normal, log-logistic, generalized gamma distributions; or even the normal, log-gamma, logistic, and extreme value distributions.

Tests a posteriori measure the goodness of fit of the models.

3, Semi-parametric Methods

  • Strengths: can work with time-constant (like the parametric exponential model) and time-varying covariates (like the other parametric models). Handle left censoring and interval censoring. Multiple covariates. No assumptions about the distribution of the hazard (constant over time, accelerating, U-shaped, etc). Multiple covariates. Good for predicting. Weaknesses: more flexible methods are time-consuming since we must set lots of parameters!

The Cox PH regression models – time-constant models

\[log \big(h(t) \big) = a(t) + \beta_1x_1 + \beta_2x_2\]

Cox models do not have an intercept term; this term is absorbed into the baseline hazard function.

Worked examples:

  • cox_model1 and graphics.

Stratification

\[log \big(h(t) \big) = a_1(t) + \beta_1x_1 + \beta_2x_2\] \[log \big(h(t) \big) = a_2(t) + \beta_1x_1 + \beta_2x_2\]

Worked examples:

  • cox_model2 with a binary stratification (using one covariate) and graphics.
  • cox_model8 where one covariate is split into 4 categories. cox_model6, a base model and graphics, is a comparative basis for cox_model8.

Time-varying models

\[log \big(h(t) \big) = a(t) + \beta_1x_1 + \beta_2x_2(t)\]

Worked examples:

  • cox_model3 (with a ‘long’ dataset instead of a ‘wide’) and graphics.

Time-varying models with a lagged covariate

\[log \big(h(t) \big) = a(t) + \beta_1x_1 + \beta_2x_2(t-1)\]

Worked examples:

  • cox_model4 (with a ‘long’ dataset instead of a ‘wide’) and comparative plots with cox_model3.

Interaction

\[log \big(h(t) \big) = a(t) + (b+ct)x\]

Worked examples:

  • cox_model5 with an interaction with time-varying covariates.
  • cox_model6, a base model and graphics, vs.cox_model7 with an interaction covariate-time and plot graphics.

  • The piecewise constant exponential models can be considered semi-parametric.

Discrete-time (data) Methods (not covered)

These are models with binary dependent variable: logit, probit, and log-log. These model are related to survival analysis because a discrete event is a nonrepeated event of a single kind.

  • Strengths: good for predicting once we have computed the regressor coefficients since they are regression models just like with the OLS. Take multiple covariates. Take multiple forms (linear, polynomial).
  • Weaknesses: time must be subdivided into a set of distinct observational units.

We do not cover these model in details. We only provide an example.

We have doctoral graduates. The event of interest is ‘promotion to associate professor’ during a time window of 10 years following graduation. These events are recorded in discrete time periods since we know only the year in which the promotion occured. The regression model captures the conditional probability of a promotion during the 10-year time window depending on several explanatory variables.

Key concepts:

  • The ‘risk’ of event occurrence at each point in time, being promoted, diminishes from year 1 to year 10.
  • The hazard rate is the conditional probability that an event will occur at a particular time to a particular individual.

  • Logistic regression model (logit).
    • Linear: the risk is a function of the explanatory variables such as time. The risk is constant over time.
    • Quadratic: the risk can be curvilinear; a U-shaped risk would mean the risk accelerates (rise faster year after year). Beyond a certain point, the risk continues to rise, but at decreasing rate. Among many potential regressors, we have two explanatory variables: time and squared time.
    • Other polynomial and functional forms (a mix of linear and logarithmic variables).
  • Linear form.

\[log \bigg( \frac{P(t)}{1 - P(t)} \bigg) = \beta_0 + \beta_1x_1 + \beta_2x_2(t)\]

  • Quadratic form.

\[log \bigg( \frac{P(t)}{1 - P(t)} \bigg) = \beta_0 + \beta_1x_1 + \beta_2x_2(t) + \beta_3t + \beta_4t^2\]

  • Probabilistic regression model (probit), an alternatively to logit.
  • A log-log regression model is equivalent to the Cox PH regression model.

\[log \big[ -log(1 - P(t)) \big] = \beta_0 + \beta_1x_1 + \beta_2x_2(t)\]

  • Also, consider tree-based methods to predict events.

A word on censored and truncated dataset

Let us say we have a study spanning on a given time window (defined by a beginning time period and an ending time period). Censoring occurred for two reasons:

  • The event does not occur during the time window.
  • Some individuals or firms (observations) leave the study (another event forcing them to leave). For example, the event is ‘promotion’ or ‘recidivism’ and the individual dies during the study.

  • Left censoring: an individual has already experienced an event by the time the study begins, but we don’t know when it occurred. It may pose a problem in some model.
  • Interval censoring: an event happened between two time periods (the time window), but we don’t know exactly when.
  • Right censoring: an individual has not experienced an event by the time the study ends and we don’t know when or if it will occur.

An individual that is censored at a particular point in time provides no information about that person’s hazard at that time. There is no way to test the assumption that censoring in non-informative. There are no available methods to relax that assumption. We can just ignore the problem and hope for the best. Non-parametric, parametric and semi-parametric methods handle left and interval censoring.

Outside survival analysis, censoring may pose problems to OLS models. This is a situation of ‘limited dependent variable’. We can remedy the situation with a Tobit model. By the way, the AER package for running Tobit regressions interfaces with the survreg function of the survival package.

In a truncated sample, we do not even pick up observations that lie outside a certain range. We both miss the dependent and the explanatory variables.



5, Going further (with the Rossi dataset in mind…)

(…or for other situations.)

Advanced techniques

There are several techniques that provide guidance on model-building and diagnostics.

There has also been growing attention to multiple kinds of events, repeated and recurrent events such as multiple rearrests or recurring recidivism. Correlation of event times can occur when subjects that experience a single event belong to particular groups or clusters such as a clinic, family, workplace or geographical region. In other cases, we may face multiple relapses from remission, repeated heart attacks or infections, multiple marriages, births, cabinet durations or unemployment episodes. We can use count data models, methods based on gap times, and on times since origin.

Non-parametric KM models are related to actuarial life tables (in insurance) and half-life models. The latter, along with the parametric exponential models, have very specific applied settings. We supplied a couple of examples with the parametric exponential model. We can also think of the duration of radioactive disintegration and medication absorption (elimination).

Semi-parametric Cox PH regression models are related to Poisson regressions (count models). Survival models measure the odds of one or more events while count models account for the number of events.

Cox model can have communal covariates (that vary in time outside any individual and are common to all individual at each specific calendar time). It is an exogenous variable such as the effect of inflation (rent increase) on rearrest (and whether financial aid can curb the event of rearrest).

We make the implicit assumption that all subjects are homogeneous. We assume that they are prone to experience events or failures in the same way, with the timing of events considered as independent from one another. Some subjects are more ‘frail’. Thus, these subjects are more likely to experience an event. Frailty models account for the unobserved heterogeneity. Frailty models are sometimes also referred to as multi-level or random effect models. Frailty is an unobserved random proportionality factor that modifies the hazard function of a subject. It is a way to introduce random effects, association and unobserved heterogeneity into duration models.

We can also account for competing risks with the Rossi dataset. We could consider two kinds of rearrest: crimes against property (robbery, burglary, larceny, etc.) and other crimes. We might expect that financial aid would reduce the hazard of arrests for crimes against property, where there is a financial incentive to commit the crime. We can account for dependence among different kinds of events and model (multiple events and multi-state models) a cumulative incidence function. We can also model entire histories.

Sequence analysis is another technique highly intertwined with event history analysis, as it offers the ability to model entire life paths or event history trajectories.

We address causality with the example of the lagged variable in the Cox PH regression model with time-varying covariates. There is a lot more we can take into account: causal inference, predictive causality, counterfactual or treatment effect or reverse causality, Aalen’s additive hazards model, dynamic path analysis, and matching.

Packages

  • survival. Handles non-parametric models and Cox PH models. Also handles parametric AFT models although graphics is not possible.
  • eha (event history analysis). Handles parametric PH models in this case. Capables of handling parametric AFT models.

  • coxme for frailty models in Cox regression.
  • timereg for extending the Cox model and aareg from survival (Aalen’s additive hazard models).
  • cmprsk for competing risks.

And also lattice, mstate (multistate and competing risk), TRaMineR (sequence analysis), cluster (cluster analysis), reshape2 and tidyr (data restructuring).

Books

  • Event History Analysis with R, CRC Press, 2012.
  • Multivariate Survival Analysis and Competing Risks, CRC Press, .
  • Survival Analysis, A Self-Learning Text, Springer, 2012.
  • Survival Analysis by Example, Hands on approach using R, self-published, 2016.
  • Econometrics by Example, PalgraveMacmilan, 2011.

Terminology

  • Event: a qualitative change that occurs at a specific point in time such as death, job changes, promotions, layoffs, retirement, convictions, incarcerations, admission, etc.
  • Duration spell: the length of time before an event occurs.
  • Discrete-time analysis and discrete-time event history: some events occur only at discrete times; every year, monthly, etc.
  • Continuous time analysis: very few events are observed along a time continuum. Weekly at best. The statistical techniques used to handle continuous time SA are different from those used to handle discrete time SA.
  • Cumulative distribution function (CDF) of time: let \(T\) denote the time until the event happens. If we treat \(T\) as a continuous variable, the distribution of \(T\) is given by the CDF which gives the probability that the event has occurred by duration \(t\).
  • Survivor function: the probability of surviving past time \(t\).
  • Hazard function or hazard rate function: in simple terms, the hazard function is the ratio of the density function to the survivor function for a random variable. Simply stated, it gives the probability that someone fails at time \(t\), given that they have survived up to that point, failure to be understood in the given context.

How do we choose the density and survivor function in practice? Before:

  • Censoring: a frequently encountered problem in SA and is that the data are often censored. Suppose we follow 100 unemployed people at time \(t\) and followed them until time period (\(t+h\)). Depending on the value we choose for \(h\), there is no guarantee that all 100 people will still be unemployed at time (\(t+h\)); some of them will have to be reemployed and some dropped out of the labor force. Therefore, we will have a censored sample. The sample may be right-censored because we stopped following are the sample of the unemployed at time (\(t+h\)). Our sample can also be left-censored because we do not know how many of the 100 unemployed were in that status before time \(t\). In estimating the hazard function we have to take into account this censoring problem. We cover these problems in detail when we will discuss the censored and truncated sample regression models in the next section.
  • Hazard function with or without covariates (or regressors): in SA our interest is not only in estimating the hazard function but also in trying to find out if it depends on some explanatory variables or covariates. Duration dependence: if the hazard functions not constant, there is a duration dependence. There can be a positive duration dependence. In this case, the probability of exiting the initial state increases the longer a person is in the initial state. For example, the longer a person is unemployed, his probability of exiting the unemployment status increases in the case of positive duration dependence. The opposite is the case if there is a negative dependency.
  • Unobserved heterogeneity: no matter how many covariates we consider, there may be intrinsic heterogeneity among individuals and we may have to account for this.

Ave Imperator, morituri te salutant.2



  1. The criminal recidivism dataset from Rossi, P. H., Berk R. A., and Lenihan, K. J. (1980). Money, work and crime: Some experimental results. New York, NY Academic.

  2. We who are about to die, salute you!