Foreword
htmlTableHow to
One good approach is to take a peek at the Terminology, go through all the other sections, review the Recapitulation and the Terminology.
These terms will come back over and over:
These functions are often plotted in the worked examples.
Note that the ‘CDF’ is related to ‘probability distribution’; ‘distribution’ is synonymous with ‘density function’.
A ‘covariate’ is an ‘explanatory variable’ or a ‘regressor’.
Survival analysis (SA) goes by various names
We sometimes hear of hazard rate analysis (the conditional probability of event occurrence).
The primary goals of SA
Unlike other techniques, it also takes a different approach by not only focusing on the outcome but also analyzing the time to an event.
This enables us to compare the survival between two or more groups and to assess the relationship between explanatory variables and survival time.
Another unique feature is the ability to include time-varying covariates (that change their values over time such as age or education), which is not possible in OLS or logistic regression.
Applications
Approaches
There are three basic approaches to analyzing surviving data:
We are going to touch base with these three approaches with a dataset about recidivism: convicts who were released from prisons and who were followed up for one year after release. The analysis measures the length of time until rearrest (or not).
In non-parametric models, which include actuarial life table, the Altschuler–Nelson estimator, and the Kaplan-Meier estimator, there is no assumption about the shape of the hazard function or about how covariates may affect that shape. However, non-parametric methods cannot handle the inclusion of multiple covariates.
This is often an excellent descriptive technique to use at the beginning of an analysis.
Adapting the data
It is possible to create survival data from existing information:
The central point is that we need the timing of events and an outcome variable.
The survival package
There are several packages. We begin with the survival package.
library(survival)The package contains several datasets.
str(leukemia) # alternativaly, str(aml)## 'data.frame': 23 obs. of 3 variables:
## $ time : num 9 13 13 18 23 28 31 34 45 48 ...
## $ status: num 1 1 0 1 1 0 1 1 0 1 ...
## $ x : Factor w/ 2 levels "Maintained","Nonmaintained": 1 1 1 1 1 1 1 1 1 1 ...
Where:
time, the duration of an episode (calculated from the start and end time or censoring).status=1, the occurrence of an event (or right censoring since there are no data beyond the event).x, the only covariate; x=Maintained or not a treatment against leukemia.The data are right-censored. Even though a patient survived (status=0), we have no data beyond the data point. This is a snapshot: this analysis about this chemotherapy cycle.
| time | status | x | |
|---|---|---|---|
| 1 | 9 | 1 | Maintained |
| 2 | 13 | 1 | Maintained |
| 3 | 13 | 0 | Maintained |
| 4 | 18 | 1 | Maintained |
| 5 | 23 | 1 | Maintained |
| 6 | 28 | 0 | Maintained |
| 7 | 31 | 1 | Maintained |
| 8 | 34 | 1 | Maintained |
| 9 | 45 | 0 | Maintained |
| 10 | 48 | 1 | Maintained |
| 11 | 161 | 0 | Maintained |
| 12 | 5 | 1 | Nonmaintained |
| 13 | 5 | 1 | Nonmaintained |
| 14 | 8 | 1 | Nonmaintained |
| 15 | 8 | 1 | Nonmaintained |
| 16 | 12 | 1 | Nonmaintained |
| 17 | 16 | 0 | Nonmaintained |
| 18 | 23 | 1 | Nonmaintained |
| 19 | 27 | 1 | Nonmaintained |
| 20 | 30 | 1 | Nonmaintained |
| 21 | 33 | 1 | Nonmaintained |
| 22 | 43 | 1 | Nonmaintained |
| 23 | 45 | 1 | Nonmaintained |
For example, patient 3 survived for 13 weeks from the chemotherapy cycle. We ignore what happened after. Patient 1 deceased 9 weeks after the beginning the treatment.
The question is whether the standard course of chemotherapy should be extended (maintained) for additional cycles.
Does the treatment of maintenance chemotherapy significantly prolong a patient’s survival time until death?
Objects
First, we create an object.
with(leukemia, Surv(time, status))## [1] 9 13 13+ 18 23 28+ 31 34 45+ 48 161+ 5 5 8
## [15] 8 12 16+ 23 27 30 33 43 45
Producing KM estimates
leukemia_fit <- survfit(Surv(time, status) ~ 1,
conf.type = 'log',
conf.int = 0.95,
type = 'kaplan-meier',
error = 'greenwood',
se.fit = TRUE,
data = leukemia)Interpretation of KM estimates
summary(leukemia_fit$time)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 13.75 27.50 32.44 33.75 161.00
We print the results statistics. For example, the median and mean survival time in weeks.
summary(leukemia_fit)## Call: survfit(formula = Surv(time, status) ~ 1, data = leukemia, conf.type = "log",
## conf.int = 0.95, type = "kaplan-meier", error = "greenwood",
## se.fit = TRUE)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 23 2 0.9130 0.0588 0.8049 1.000
## 8 21 2 0.8261 0.0790 0.6848 0.996
## 9 19 1 0.7826 0.0860 0.6310 0.971
## 12 18 1 0.7391 0.0916 0.5798 0.942
## 13 17 1 0.6957 0.0959 0.5309 0.912
## 18 14 1 0.6460 0.1011 0.4753 0.878
## 23 13 2 0.5466 0.1073 0.3721 0.803
## 27 11 1 0.4969 0.1084 0.3240 0.762
## 30 9 1 0.4417 0.1095 0.2717 0.718
## 31 8 1 0.3865 0.1089 0.2225 0.671
## 33 7 1 0.3313 0.1064 0.1765 0.622
## 34 6 1 0.2761 0.1020 0.1338 0.569
## 43 5 1 0.2208 0.0954 0.0947 0.515
## 45 4 1 0.1656 0.0860 0.0598 0.458
## 48 2 1 0.0828 0.0727 0.0148 0.462
time, the ordered event or failure times (deceases).n.risk, the risk set, is the number of patients who have survived to at least time t; in other words, the number of subjects in the risk set at the start of the interval.n.event, the number of failures (decease) or events in the specified time interval.survival, an estimated probability that a patient will survive the time; a percent of patients will survive a given time or more. For example, in row 1, 21/23 survived or 91.3%.std.err, an estimate of the standard error of the KM estimate according to the error parameter.lower, upper 95% CI, confidence intervals.quantile(leukemia_fit, quantiles = c(.25, .5, .75))## $quantile
## 25 50 75
## 12 27 43
##
## $lower
## 25 50 75
## 8 18 31
##
## $upper
## 25 50 75
## 30 48 NA
We are interested in the median survival time, which is the 50th percentile. In this case, it is 27 weeks, with a 95% confidence interval of 18 to 45 (lower and upper limits).
The median is the preferred measure due to the fact that the mean is biased downward when there are right-censored times that are greater than the largest event time.
Plotting a univariate KM survival curve
We recall the median: 18-27-45, where the gray lines are crossing.
Other plots resulting from transformations.
Comparing two KM survival curves
We can only use one covariate in KM: x.
# x behing the sole covariate, a factor
leukemia_fit2 <- survfit(Surv(time, status) ~ x,
type = 'kaplan-meier',
data = leukemia)
leukemia_fit2## Call: survfit(formula = Surv(time, status) ~ x, data = leukemia, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## x=Maintained 11 7 31 18 NA
## x=Nonmaintained 12 11 23 8 NA
We can now observe that patients in the treatment group (x=Maintained chemotherapy) lived on average 31 weeks in comparison to the control group (x=Nonmaintained), who had a median survival time of 23 weeks.
Since the estimated survival function for the maintained group does not go to zero (which is the case for the non-maintained group), we can determine that the largest observation is a censored value.
Testing differences between two groups
Does the treatment of maintenance chemotherapy significantly prolong a patient’s survival time until death?
In other words, whether there is a statistically significant difference between the two survival curves; akin to the two-sample t-test or the one-way ANOVA. The are many possible tests:
The log-rank test
We test the hypothesis that the survival function is identical for the two groups.
# rho=0 (p-value) is the default
survdiff(Surv(time, status) ~ x,
data = leukemia,
rho = 0)## Call:
## survdiff(formula = Surv(time, status) ~ x, data = leukemia, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained 11 7 10.69 1.27 3.4
## x=Nonmaintained 12 11 7.31 1.86 3.4
##
## Chisq= 3.4 on 1 degrees of freedom, p= 0.0653
Since it is a two-tailed test, the p-value is twice that of the log-rank statistic. The one-sided test is appropriate and its p-value is 0.0653/2 = 0.033.
Considering the small sample size, there is actually some evidence that maintenance chemotherapy significantly prolongs life in the remission period. The weakness of this effect is probably due to the small sample.
Alternative or confirmation: stratifying the analysis by a covariate
In a stratified test, we produce the estimates and perform the log-rank tests separately and then combine them to produce an overall statistic.
Contrary to the log-rank test, we avoid (potential) misleading statements on the basis of an aggregated analysis by not acknowledging important confounding or interaction effects.
First, we estimate the model.
leukemia_fit3 <- survfit(Surv(time, status) ~ x,
conf.type = 'log',
conf.int = 0.95,
type = 'kaplan-meier',
data = leukemia)
leukemia_fit3## Call: survfit(formula = Surv(time, status) ~ x, data = leukemia, conf.type = "log",
## conf.int = 0.95, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## x=Maintained 11 7 31 18 NA
## x=Nonmaintained 12 11 23 8 NA
Then, we divide the estimates by x.
quantile(leukemia_fit3, c(.25, .5, .75))## $quantile
## 25 50 75
## x=Maintained 18 31 48
## x=Nonmaintained 8 23 33
##
## $lower
## 25 50 75
## x=Maintained 13 18 34
## x=Nonmaintained 5 8 27
##
## $upper
## 25 50 75
## x=Maintained NA NA NA
## x=Nonmaintained 30 NA NA
There is a difference in the median (the same as in the previous model):
Maintained: 31 weeks.Nonmaintained: 23 weeks.Maintenance chemotherapy likely prolongs life in the remission period; confirming the above test.
Nonetheless, it is only possible to compare differences between a limited number of groups, by including only one covariate (or a few covariates in some cases we will not cover).
Let us turn to methods where we can include several covariates.
The parametric approach is largely used for continuous time data.
There are several parametric models that are used in duration analysis. Each depends on the assumed probability distribution, such as the normal, binomial (logistic), Poisson, etc.
The most common forms of distribution to describe survival time are the exponential (a special case of Weibull), Weibull, log-normal, generalized gamma and log-logistic.
Since the CDF of each of these distributions is known, we can easily derive the corresponding hazard and survival functions. Parametric models not only include assumptions about the distribution of failure or event times, but also about the relationship between the covariates and survival. This is related to whether we opt for one of the two specifications:
Proportional Hazards (PH) versus Accelerated Failure Time (AFT) models
A central difference between these models is also the interpretation of the parameter estimates.
Although the time equivalent differs, we would still have the same hypothesis about the shape of the hazard function from birth to death. This means that, even though we are assuming the same hazard function, we would observe the rather sizeable difference, for instance, that 90% of humans would survive to the age of 63 compared to 90% of dogs that would survive to the age of 9.
Summary of selected parametric survival distributions
We recall that PH-multiplicative, AFT-additive.
The exponential model (PH, AFT)
The exponential model is a special case of the Weibull model with a constant hazard function.
This model is useful for modeling data with a constant hazard, due to the fact that the baseline hazard function is assumed as constant or flat over time.
The risk of an event occurring, conditional on covariates, is the same across all points in time. Regardless of how long the duration is (e.g. treatment, marriage, unemployment), there will still be the same chance of the event occurring over that time period.
Consider the example of a constant-hazard rate of survival of a wine glass: the wine glass breaks or not. Since wine glasses do not wear out, if we assume that the probability of the event (i.e. breaking) is not dependent on how often it has been used, the wine glass has the same probability of breaking whether the glass is 0, 5, 10 or 15 years old.
Any deviation of the hazard function from the flat line is attributed to the covariate effects
There are several shortcomings of the exponential model:
There are few applied settings:
Piecewise constant exponential model (PH, AFT)
It is similar to the exponential model, with the additional assumption that time is divided into different periods and that the hazard rate is then constant within any of these time periods. In other words, a piecewise exponential model gives us the flexibility of the Cox model (that will be addressed in Section 3) with the added advantage of being able to estimate the shape of the hazard function.
Each time period is given a ‘time dummy variable’ with the estimates for each dummy representing the hazard function for that particular time period.
A piecewise exponential model to examine various labour market transitions between unemployment to employment. Transitions from unemployment to employment re-entry, and employment include periods (1-6, 6-12 and 12+ months) with the assumption that transitions out of unemployment varied across these time periods.
The Weibull model (PH, AFT)
We know the exponential model is a special case of the Weibull model with constant hazard function. The hazard function can take other forms (increasing, decreasing, accelerating, decelerating)
PH-multiplicative
The baseline hazard is characterized as monotonic (i.e. constantly going up or down). With the Weibull model, the baseline hazard may be monotonically increasing or decreasing over time, or it may be constant as in the exponential model.
A monotonically increasing hazard rate might be, for example, patients who are not responding to a particular treatment. As survival time increases over time for each patient, coupled with a worsening prognosis, the probability of death increases.
A decreasing hazard could be the survival times of patients recovering from surgery, such as organ transplantation. The potential of dying immediately after surgery is high and decreases over time after the surgery.
AFT-additive
As with the exponential model, the Weibull model can also be specified as an AFT model. The regression parameters are therefore expressed as log duration times. When covariates are included in the model, they impact the scale parameter, with the shape of the distribution remaining constant.
Log-normal and Log-logistic models (AFT-additive)
Both the log-logistic and log-normal models are only implemented in the AFT form and are parametric models that allow for non-monotonic hazard functions.
It accommodates processes where there is an initial increase and then decreasing rates (a hump).
Other parametric models (AFT-additive)
A relatively prominent model is the generalized gamma model.
It is possible to estimate additional models including the Gaussian, Gompertz, logistic, and Gompertz-Makeham models.
Packages
In the following worked examples, we use the eha package for PH parametrization and the survival package for AFT parametrization.
Load the data and take a look at a sample of the Rossi dataset1 about recidivism.
The data pertain to 432 convicts who were released from prisons and who were followed up for one year after release. Half the released convicts were assigned at random to an experimental treatment in which they were given financial aid; half did not receive aid.
The head of the dataset.
| week | arrest | fin | age | race | wexp | mar | paro | prio | educ | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 |
| 2 | 17 | 1 | no | 18 | black | no | not married | yes | 8 | 4 |
| 3 | 25 | 1 | no | 19 | other | yes | not married | yes | 13 | 3 |
| 4 | 52 | 0 | yes | 23 | black | yes | married | yes | 1 | 5 |
| 5 | 52 | 0 | no | 19 | other | yes | not married | yes | 3 | 3 |
| Note: The 'empl1 to emplo52' fields are not shown. | ||||||||||
Using the Rossi dataset, we estimate the PH exponential model (PH-multiplicative). The exponential model is a special case of the Weibull model with a constant hazard function.
library(eha)
# shape = 1 for constant hazard function
exp_ph_model <- phreg(Surv(week, arrest) ~ fin + age + prio,
dist = 'weibull',
shape = 1,
data = Rossi)
summary(exp_ph_model)## Call:
## phreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi,
## dist = "weibull", shape = 1)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## fin
## no 0.489 0 1 (reference)
## yes 0.511 -0.335 0.715 0.190 0.078
## age 24.765 -0.065 0.937 0.021 0.002
## prio 2.841 0.091 1.095 0.027 0.001
##
## log(scale) 3.761 0.500 0.000
##
## Shape is fixed at 1
##
## Events 114
## Total time at risk 19809
## Max. log. likelihood -688.38
## LR test statistic 0.06
## Degrees of freedom 3
## Overall p-value 0.995681
Turning to the estimates of financial assistance (fin=yes), we see that the estimate is -0.335, which is significant at a 0.078 level.
Since this is a PH model, a negative coefficient means that the risk of the event (rearrest) decreases, resulting in a longer duration.
If we take the exponent, \(e^{-0.335}\) = 0.715, the hazard of rearrest for those who received financial (fin=yes) assistance is 72% of the hazard for those who did not receive financial assistance (fin=no).
Basic plots (the exponential model is a special case of the Weibull model).
We can see the constant, constant hazard curve of the exponential model.
We compute the confidence intervals for exponentiated coefficient of rearrest for those who received financial (fin=yes).
| Lower 95% CI | Upper 95% CI |
|---|---|
| 0.49 | 1.04 |
The exponentiated coefficient of financial aid (0.715) has 95% CI of (0.49, 1.04).
We can also use the parameter estimates to provide an estimate of the time to any value of the quartiles (log(scale) x 0.50 (median)).
| 0.25 | 0.50 | 0.75 |
|---|---|---|
| 42.63 | 21.32 | 8.85 |
The median rearrest time is 21 weeks.
The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1).
We estimate the AFT exponential model (AFT-additive).
library(survival)
exp_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
dist = 'exponential',
data = Rossi)
summary(exp_aft_model)##
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi,
## dist = "exponential")
## Value Std. Error z p
## (Intercept) 3.7614 0.4998 7.53 5.24e-14
## finyes 0.3352 0.1901 1.76 7.78e-02
## age 0.0650 0.0207 3.14 1.71e-03
## prio -0.0911 0.0270 -3.38 7.33e-04
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -688.4 Loglik(intercept only)= -702
## Chisq= 27.2 on 3 degrees of freedom, p= 5.3e-06
## Number of Newton-Raphson Iterations: 5
## n= 432
The interpretations of the parameter estimates for PH and AFT models differ.
The PH assumption is useful for a comparison of hazards and the AFT assumption is useful for a comparison of survival times.
The acceleration factor is the key measure in AFT models and is used to evaluate the effect of the covariate(s) on survival time.
This means that the interpretation of the coefficients is essentially the opposite.
The coefficient for financial aid (finyes), which is 0.335. Since it has a positive sign, those who receive financial assistance have a longer expected time to arrest (good news).
We generally estimate the acceleration factor, which is simply \(e^{0.335}\) = 1.43. To interpret these results we would say that the provision of financial assistance is effective for delaying rearrest by stretching the survival time by a factor of 1.43.
| finyes | age | prio |
|---|---|---|
| 1.4 | 1.07 | 0.91 |
An acceleration factor > 1 means that financial assistance is beneficial to survival (i.e. staying out of prison), and vice versa. An acceleration factor or hazard ratio of 1 would mean no effect.
Express the AFT results in absolute terms of the number of weeks of additional survival time until rearrest if we receive financial aid: exp(3.761 + 0.335 * 1).
| finyes |
|---|
| 60.1 |
In other words, receiving financial assistance set the time until rearrest to 60 weeks (good news).
We can play with the model coefficients and compute different results.
Now, we interpret a continuous variable such as prio (-0.091) in relative terms: 100 * (exp(-0.091) - 1).
| prio |
|---|
| -8.7 |
Holding the covariates of financial assistance and age constant, each additional prior conviction is associated with an 8.7% decrease in the expected time to rearrest (bad news).
The 95% confidence intervals for financial assistance.
| Lower 95% CI | Upper 95% CI |
|---|---|
| 0.96 | 2.03 |
The hazard ratio for the effect of financial assistance was estimated as \(e^{-0.335} = 0.715\) in the previous PH form of the exponential model (exp_ph_model).
Although the PH and AFT exponential models have different underlying assumptions, they are essentially the same model with the only difference being their parameterization.
Estimate the median survival time.
medfin <- predict(exp_aft_model,
type = 'uquantile',
p = 0.5,
se.fit = TRUE)
medfin1 <- medfin$fit[2]
medfin1.se <- medfin$se.git[2]
expmedfin <- exp(medfin1)
round(expmedfin, 2)## [1] 46.35
From the results, we can test the model overall significance.
summary(exp_aft_model)##
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi,
## dist = "exponential")
## Value Std. Error z p
## (Intercept) 3.7614 0.4998 7.53 5.24e-14
## finyes 0.3352 0.1901 1.76 7.78e-02
## age 0.0650 0.0207 3.14 1.71e-03
## prio -0.0911 0.0270 -3.38 7.33e-04
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -688.4 Loglik(intercept only)= -702
## Chisq= 27.2 on 3 degrees of freedom, p= 5.3e-06
## Number of Newton-Raphson Iterations: 5
## n= 432
There are two values for the log-likelihood: with the three fixed covariates (model, intercept and covariates) and with the null model (intercept only, no covariates). According to the p-value, we can reject the null hypothesis. In other words, at least one of the above covariates significantly improves the fit of the model.
It is possible to reshape our data by chosen cut points to examine how the hazard rate differs across time periods.
Reshaping of data by chosen cut-points
The Rossi data can be divided over the 52-week period into four periods, split at 0, 13, 26 and 39 weeks; in four quarters of 13 weeks each. Recall that financial assistance changed at 13 weeks, which makes 13 weeks a useful cut-point.
Since it is an exponential model, we assume that the hazard is constant within each of these intervals.
Create a new data set where one record is created for each period where the subject was at risk.
Rossi$id <- 1:432
Rossi_piecewise <- survSplit(Rossi,
cut = c(0, 13, 26, 39),
end = 'week',
event = 'arrest',
start = 'start',
episode = 'i')
# order
Rossi_piecewise <- Rossi_piecewise[order(Rossi_piecewise$id, Rossi_piecewise$start),]
Rossi_piecewise$t0 <- ifelse(Rossi_piecewise$start == 0, 1, 0)
Rossi_piecewise$t13 <- ifelse(Rossi_piecewise$start == 13, 1, 0)
Rossi_piecewise$t26 <- ifelse(Rossi_piecewise$start == 26, 1, 0)
Rossi_piecewise$t39 <- ifelse(Rossi_piecewise$start == 39, 1, 0)
Rossi_piecewise[1:2, c(1:2, 6:8, 61:69)]## fin age paro prio educ id start week arrest i t0 t13 t26 t39
## 1 no 27 yes 3 3 1 0 13 0 2 1 0 0 0
## 2 no 27 yes 3 3 1 13 20 1 3 0 1 0 0
The data is right-censored in the ‘last quarter’. Careful to read one or several rows per individual. Above, the first subject (id = 1) is arrested (arrest = 1) in the 2nd quarter (t13 = 1), on the second row
Rossi_piecewise[3:10, c(1:2, 6:8, 61:69)]## fin age paro prio educ id start week arrest i t0 t13 t26 t39
## 3 no 18 yes 8 4 2 0 13 0 2 1 0 0 0
## 4 no 18 yes 8 4 2 13 17 1 3 0 1 0 0
## 5 no 19 yes 13 3 3 0 13 0 2 1 0 0 0
## 6 no 19 yes 13 3 3 13 25 1 3 0 1 0 0
## 7 yes 23 yes 1 5 4 0 13 0 2 1 0 0 0
## 8 yes 23 yes 1 5 4 13 26 0 3 0 1 0 0
## 9 yes 23 yes 1 5 4 26 39 0 4 0 0 1 0
## 10 yes 23 yes 1 5 4 39 52 0 5 0 0 0 1
Above, individual 2 and 3 are also arrested in the 2nd quarter (there are 2 rows for each).
Since Individual 4 (id = 4) was not arrested (arrest = 0 is all 4 rows), the subject has four splits, at 0, 13, 26 and 39 weeks.
We use only fixed covariates; they are repeated for each line. If we had integrated time-varying covariates, they would change between intervals.
Estimation of the model
The survival or eha packages are not able to estimate the piecewise constant exponential model.
It is possible, by using the previously reshaped data, to estimate the variation in the hazard across time intervals using the PH or AFT specification, which in substantive terms provides equivalent results.
Each of the period variables representing the quarters is then estimated. For reasons of collinearity, the last quarter (t39) is not included in the model.
The PH-multiplicative model.
library(eha)
# shape = 1 for constant hazard function
pe_ph_model <- phreg(Surv(week, arrest) ~ t0 + t13 + t26 + fin + age + prio,
dist = 'weibull',
shape = 1,
data = Rossi_piecewise)
summary(pe_ph_model)## Call:
## phreg(formula = Surv(week, arrest) ~ t0 + t13 + t26 + fin + age +
## prio, data = Rossi_piecewise, dist = "weibull", shape = 1)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## t0 0.113 0.619 1.857 0.284 0.029
## t13 0.216 0.525 1.690 0.245 0.032
## t26 0.300 -0.011 0.989 0.260 0.967
## fin
## no 0.488 0 1 (reference)
## yes 0.512 -0.335 0.716 0.190 0.078
## age 24.772 -0.064 0.938 0.021 0.002
## prio 2.811 0.094 1.099 0.027 0.001
##
## log(scale) 4.913 0.522 0.000
##
## Shape is fixed at 1
##
## Events 114
## Total time at risk 48682
## Max. log. likelihood -785.7
## LR test statistic 6155.94
## Degrees of freedom 6
## Overall p-value 0
From our previous results, we stated that the hazard of rearrest for those who received financial (fin=1) assistance is 0.72 of the hazard for those who did not receive financial assistance (fin=0).
The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1).
During t0, the hazard is 1.86 of the hazard relative to t39 (the baseline where t39 = 1). During t13, the hazard is 1.69 relative to t39. The hazard decreases and moves back up to 1 during t26.
The result plots.
It is also possible to estimate an AFT-additive model using the survreg function.
library(survival)
pe_ph_model2 <- survreg(Surv(week, arrest) ~ t0 + t13 + t26 + fin + age + prio,
dist = 'exponential',
data = Rossi_piecewise)
summary(pe_ph_model2)##
## Call:
## survreg(formula = Surv(week, arrest) ~ t0 + t13 + t26 + fin +
## age + prio, data = Rossi_piecewise, dist = "exponential")
## Value Std. Error z p
## (Intercept) 4.9132 0.5218 9.416 4.69e-21
## t0 -0.6188 0.2840 -2.179 2.94e-02
## t13 -0.5248 0.2446 -2.146 3.19e-02
## t26 0.0107 0.2595 0.041 9.67e-01
## finyes 0.3347 0.1902 1.760 7.85e-02
## age 0.0640 0.0207 3.094 1.98e-03
## prio -0.0944 0.0273 -3.452 5.56e-04
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -785.7 Loglik(intercept only)= -804.5
## Chisq= 37.56 on 6 degrees of freedom, p= 1.4e-06
## Number of Newton-Raphson Iterations: 7
## n= 1573
We recall that we do not interpret the coefficient the same way depending on the model type:
| PH | AFT | |
|---|---|---|
| Positive coefficient | hazard risk UP | log(duration) time UP |
| shorter duration | longer duration | |
| Negative coefficient | hazard risk DOWN | log(duration) time DOWN |
| longer duration | shorter duration |
Since this is a PH model, the estimates over the four periods demonstrate that the hazard is not constant over time. The coefficients (of the covariate) of the three periods t0, t13 and t26 are all contrasted with the last period (would-be t39).
We see that the hazard of arrest in the first 13 weeks (t0) is significantly lower than the hazard in the last quarter (the baseline t39).
Visually, it is easier to interpret (the barplot shows the opposite as the barplot above).
t39, the baseline is at y = 0.t0 is a negative coefficient, hazard risk is less than the baseline, duration is longer.t13 is still a negative coefficient, hazard risk is less than the baseline, duration is longer.t0 to t13, there is an increase in hazard, a decrease in duration.t26 is a positive coefficient, hazard risk is more than the baseline, duration is shorter.t13 to t26, there is an increase in hazard, a decrease in duration.t26 to t39, there is a minor decrease in hazard, a minor increase in duration.Both models, pe_ph_model and pe_ph_model2, end up with the same conclusion despite the different results. The first quarter has the most significative impact on rearrest.
It is possible to estimate the Weibull model using a PH assumption.
library(eha)
# shape is NULL compared to pe_ph_model
weibull_ph_model <- phreg(Surv(week, arrest) ~ fin + age + prio,
dist = 'weibull',
data = Rossi)
summary(weibull_ph_model)## Call:
## phreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi,
## dist = "weibull")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## fin
## no 0.489 0 1 (reference)
## yes 0.511 -0.349 0.705 0.190 0.066
## age 24.765 -0.067 0.935 0.021 0.001
## prio 2.841 0.098 1.103 0.027 0.000
##
## log(scale) 3.774 0.358 0.000
## log(shape) 0.337 0.089 0.000
##
## Events 114
## Total time at risk 19809
## Max. log. likelihood -682.04
## LR test statistic 29.17
## Degrees of freedom 3
## Overall p-value 2.06634e-06
The results are similar to the exponential model (exp_ph_model).
| finyes | age | prio | |
|---|---|---|---|
| exp_ph_model_exp_coef | 0.7152 | 0.9371 | 1.0953 |
| weibull_ph_model_exp_coef | 0.7051 | 0.9353 | 1.1027 |
However, we can see the log(scale) and log(shape) are not set at 1 as with the exponential model results. We do not have a constant hazard function. The result plots below illustrate the difference (top 4 plots: exp_ph_model, bottom 4 plots: weibull_ph_model).
We can see the difference between a constant hazard function and nonconstant hazard function.
Alternatively.
library(survival)
weibull_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
dist = 'weibull',
data = Rossi)
summary(weibull_aft_model)##
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi,
## dist = "weibull")
## Value Std. Error z p
## (Intercept) 3.7738 0.3581 10.54 5.67e-26
## finyes 0.2495 0.1372 1.82 6.90e-02
## age 0.0478 0.0154 3.11 1.89e-03
## prio -0.0698 0.0201 -3.47 5.11e-04
## Log(scale) -0.3367 0.0892 -3.77 1.61e-04
##
## Scale= 0.714
##
## Weibull distribution
## Loglik(model)= -682 Loglik(intercept only)= -696.6
## Chisq= 29.17 on 3 degrees of freedom, p= 2.1e-06
## Number of Newton-Raphson Iterations: 6
## n= 432
If we compare the results from the PH and AFT specifications of the Weibull model above, we see that the PH estimates for the Weibull model differ.
The scale in the AFT estimates is 0.714 (Scale fixed at 1 is the exponential model), which means that the hazard rate is increasing at a decreasing rate (scale < 1). The scale is the acceleration factor.
In the PH model, log(scale) is 3.774 and log(shape) is 0.337.
We can also calculate (1 / 0.714) - 1 = 0.4006. Which means that for 1% increase in time since the release of prison results in a 0.40% increase in the hazard for rearrest.
Without exploring the details, both models, weibull_ph_model and weibull_aft_model, converge towards the same conclusions.
Just a reminder that:
Scale = 0.714 and \(\sigma\) = 0.4006) is the key measure in AFT models and is used to evaluate the effect of the covariate(s) on survival time.It is also possible to estimate the log-normal and log-logistic models.
The PH result plots for a log-logistic model.
The AFT parametrization for a log-logistic model.
library(survival)
loglogistic_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
dist = 'loglogistic',
data = Rossi)
summary(loglogistic_aft_model)##
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi,
## dist = "loglogistic")
## Value Std. Error z p
## (Intercept) 3.6572 0.3651 10.02 1.28e-23
## finyes 0.2582 0.1452 1.78 7.55e-02
## age 0.0455 0.0152 3.00 2.66e-03
## prio -0.0743 0.0219 -3.39 6.97e-04
## Log(scale) -0.4260 0.0868 -4.91 9.22e-07
##
## Scale= 0.653
##
## Log logistic distribution
## Loglik(model)= -682.9 Loglik(intercept only)= -696.7
## Chisq= 27.49 on 3 degrees of freedom, p= 4.7e-06
## Number of Newton-Raphson Iterations: 4
## n= 432
We see that scale is 0.653, which is less than 1.0, meaning that hazard rate is increasing at a decreasing rate and that the estimated hazard function follows an inverted U-shape form.
The PH parametrization plots for a log-normal model.
The AFT parametrization for a log-normal model.
library(survival)
lognormal_aft_model <- survreg(Surv(week, arrest) ~ fin + age + prio,
dist = 'lognormal',
data = Rossi)
summary(lognormal_aft_model)##
## Call:
## survreg(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi,
## dist = "lognormal")
## Value Std. Error z p
## (Intercept) 3.9785 0.3933 10.12 4.66e-24
## finyes 0.3226 0.1668 1.93 5.31e-02
## age 0.0379 0.0154 2.47 1.36e-02
## prio -0.0755 0.0266 -2.84 4.45e-03
## Log(scale) 0.2815 0.0766 3.67 2.39e-04
##
## Scale= 1.33
##
## Log Normal distribution
## Loglik(model)= -687.4 Loglik(intercept only)= -697.9
## Chisq= 20.94 on 3 degrees of freedom, p= 0.00011
## Number of Newton-Raphson Iterations: 4
## n= 432
We remember that in PH-multiplicative models, the exponentiated coefficients above 1 is an increase in the hazard rate of rearrest compared to the baseline (reducing duration).
In AFT-additive models, the exponentiated estimate of fin, \(e^{0.3226}\) = 1.38, means the expected time to rearrest for those who received financial assistance is 38% greater than for those who did not receive financial assistance (extending duration).
In both models, loglogistic_aft_model and lognormal_aft_model, the estimates are very similar to the Weibull model (weibull_aft_model).
| finyes | age | prio | |
|---|---|---|---|
| exp_aft_model_exp_coef | 1.3982 | 1.0671 | 0.913 |
| weibull_aft_model_exp_coef | 1.2834 | 1.0489 | 0.9326 |
| loglogistic_aft_model_exp_coef | 1.2946 | 1.0466 | 0.9284 |
| lognormal_aft_model_exp_coef | 1.3807 | 1.0387 | 0.9272 |
We can compare the three models with the log-likelihood (Loglik(model)) statistics.
Since,
\[AIC = -2log(L) + 2k\] \[BIC = -2log(L) + klog(n)\]
where \(log(L)\) = Loglik(model), the log-likelihood is related to the AIC and BIC. We can pick the best model that minimizes the AIC and/or BIC
| exp_AIC | wei_AIC | ll_AIC | ln_AIC | |
|---|---|---|---|---|
| AIC | 1409.96 | 1399.24 | 1399.34 | 1401.82 |
The weibull_aft_model wins!
One final word of caution
Parametric models can have potentially dangerous consequences. Parametric models will directly model the shape of the hazard rate even if the parameterization that we have chosen is incorrect. If the model choice is incorrect, we risk making incorrect substantive interpretations.
Covariates can also be sensitive to the specification of the distribution, meaning that even interpretations of the relationship between the covariates with survival time may be incorrect.
Furthermore, parametric models are very sensitive to included or omitted covariates, with the distribution function conditional on these covariates. For this reason, it is important not only to focus on the choice of semi- versus different types of parametric models but to pay close attention to key covariates. Models often highly vary depending upon the covariates that we do or do not include.
This makes model building and diagnostics a key step in our analysis.
Semi-parametric models can take any form:
They are flexible:
Unlike the parametric models, the Cox method does not rely on a particular probability distribution. It generally fits data well, regardless of which parametric model is appropriate.
Special features:
The Cox PH model with time-constant covariates
Fixed covariates are those whose values remain constant over time, such as ‘sex’ or ‘place of birth’ for example.
The Cox PH model with time-varying covariates
Time-dependent variables are those where the values differ over time, such as ‘age’, ‘labour market status’ or ‘treatment status’ for example.
It may be a continuous variable such as ‘age’, or vary discretely over time, such as a time-varying binary variable like ‘monthly employment status’ (0, 1), which may change from time period to time period with no set pattern.
Load in the Rossi dataset about recidivism.
The data pertains to 432 convicts who were released from prisons and who were followed up for one year after release. Half the released convicts were assigned at random to an experimental treatment in which they were given financial aid; half did not receive aid.
The head of the dataset.
| week | arrest | fin | age | race | wexp | mar | paro | prio | educ | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 |
| 2 | 17 | 1 | no | 18 | black | no | not married | yes | 8 | 4 |
| 3 | 25 | 1 | no | 19 | other | yes | not married | yes | 13 | 3 |
| 4 | 52 | 0 | yes | 23 | black | yes | married | yes | 1 | 5 |
| 5 | 52 | 0 | no | 19 | other | yes | not married | yes | 3 | 3 |
| Note: The 'empl1 to emplo52' fields are not shown. | ||||||||||
The main covariates:
fin, yes, for the individual’s received financial aid, no, for no aid.age, in years at time of release from prison.race,1, for black,0`, for other.wexp, full-time work experience prior to incarceration; 1 for yes, 0 for no.mar, 1, for married at time of release or not.paro, 1, for released on parole or not.prio, for prior number of convictions.Estimate a first model with the above covariates.
library(survival)
cox_model1 <- coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio,
method = 'efron',
data = Rossi)
summary(cox_model1)## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
## mar + paro + prio, data = Rossi, method = "efron")
##
## n= 432, number of events= 114
##
## coef exp(coef) se(coef) z Pr(>|z|)
## finyes -0.37942 0.68426 0.19138 -1.983 0.04742 *
## age -0.05744 0.94418 0.02200 -2.611 0.00903 **
## raceother -0.31390 0.73059 0.30799 -1.019 0.30812
## wexpyes -0.14980 0.86088 0.21222 -0.706 0.48029
## marnot married 0.43370 1.54296 0.38187 1.136 0.25606
## paroyes -0.08487 0.91863 0.19576 -0.434 0.66461
## prio 0.09150 1.09581 0.02865 3.194 0.00140 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## finyes 0.6843 1.4614 0.4702 0.9957
## age 0.9442 1.0591 0.9043 0.9858
## raceother 0.7306 1.3688 0.3995 1.3361
## wexpyes 0.8609 1.1616 0.5679 1.3049
## marnot married 1.5430 0.6481 0.7300 3.2614
## paroyes 0.9186 1.0886 0.6259 1.3482
## prio 1.0958 0.9126 1.0360 1.1591
##
## Concordance= 0.64 (se = 0.027 )
## Rsquare= 0.074 (max possible= 0.956 )
## Likelihood ratio test= 33.27 on 7 df, p=2.362e-05
## Wald test = 32.11 on 7 df, p=3.871e-05
## Score (logrank) test = 33.53 on 7 df, p=2.11e-05
fin, age), others are continuous (prio, educ).For example, the hazard ratio for receiving financial aid (finyes) is 0.6843. The odds of recidivism are 69% compared to those who did not receive financial aid (the baseline). Since it is less than 1, those who received financial aid have a lower hazard of rearrest.
Fixed-continuous variable
Results as the percentage change in the hazard: \((e^\beta - 1) * 100\).
age, for example: \(e^\beta\) = 0.944. Therefore, (0.944 -1) * 100 = -5.6%. For every one-year increase in an individual’s age at release, the hazard of arrest goes down by an estimated 5.6%. Older prisoners are significantly less likely to be rearrested after release than younger ones.
Binary variable
The hazard ratio for the not-married group is 1.54, which means that those who are not married have a 1.54 greater chance of rearrest.
There is a difference in interpretation of the hazard ratio between binary and fixed-continuous covariates. If we think of a graphic, the binary variable moves the curve up and down (change the intercept) while the fixed-continuous variable has an effect on the slope. In the case of age, the slope is decreasing.
Finally, since the hazard ratio is a constant we could also state it in terms of odds by saying that the odds of that a person who is not married is more likely to get rearrested is 1.5 in comparison to those who are married (50% higher).
Or we could state it in terms of a probability: 1.5/4 = 37.5% chance that an unmarried person would get rearrested, compared to a married person.
Here are the exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1, or 100%).
We can then compute the percentage change of the fixed-continuous variables.
The overall significance
The likelihood-ratio test, the Wald test and the score log-rank (chi-square) statistics all evaluate the hypothesis that all coefficients are zero.
Rsquare= 0.074 (max possible= 0.956 )
Likelihood ratio test= 33.27 on 7 df, p=2.362e-05
Wald test = 32.11 on 7 df, p=3.871e-05
Score (logrank) test = 33.53 on 7 df, p=2.11e-05These statistics are all significant (p-ratios < 5%). We can then reject the hypothesis; at least one of the covariates contributes significantly to the explanation of the duration to the event, which in this case is rearrest.
The likelihood-ratio test is generally preferred over other tests.
The intercept
There is no estimate of the intercept; it is already incorporated in the baseline hazard function.
The hazard function is nonparametric. We do not plug in parameters and trace the hazard curve according to a distribution (Weibull for example). We can see the data points. From these observations, we can estimate the hazard line (just like a regression).
The hazard at a given time interval is the probability of the given event (e.g. arrest=1, the status of the Surv object) occurring in that time period, given that the subject survived through all of the prior time intervals.
Plotting the estimated survival function
Compare
To assess, for example, how the receipt of financial aid (fin) impacted rearrest times, we can stratify the analysis.
cox_model2 <- coxph(Surv(week, arrest) ~ age + race + wexp + mar + paro + prio + strata(fin),
method = 'efron',
data = Rossi)Plotting the results shows the difference.
A general hypothesis of the recidivism study was that those who received financial aid would have lower arrest rates than those who did not.
However, financial aid was only provided during the first 13 weeks. When we examine the data, we also see that the proportion of prisoners working full-time increases steadily over the one-year period. For this reason, it is also useful to test a time-varying hypothesis, which predicts that the effect of financial aid is greatest during the first 13 weeks
Within the Rossi data frame, there are weekly employment status variables (i.e. emp1 to emp52), which can be used as time-varying covariates to predict the time until rearrest.
## week arrest fin age race wexp mar paro prio educ
## 1 20 1 no 27 black no not married yes 3 3
## emp1 emp2 emp3 emp4 emp5 emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp13
## 1 no no no no no no no no no no no no no
## emp14 emp15 emp16 emp17 emp18 emp19 emp20 emp21 emp22 emp23 emp24 emp25
## 1 no no no no no no no <NA> <NA> <NA> <NA> <NA>
## emp26 emp27 emp28 emp29 emp30 emp31 emp32 emp33 emp34 emp35 emp36 emp37
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## emp38 emp39 emp40 emp41 emp42 emp43 emp44 emp45 emp46 emp47 emp48 emp49
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## emp50 emp51 emp52
## 1 <NA> <NA> <NA>
Since the first individual was arrested at week=20, the employment indicators are missing for further weeks.
The first step is to create a person-period (or long) data frame that contains only the non-missing periods of observation (i.e. for person 1, only up to emp20 – after that it is only NAs).
The variables that we are concerned with during the restructuring of the data are the emp1 to emp52 variables, which chart the employment status of the individuals at each week of the study.
If we look at the first 13 variables only, we see that the first five individuals were not employed within the first three weeks of being released from prison (i.e. emp1 to emp3 equals no).
| week | arrest | fin | age | race | wexp | mar | paro | prio | educ | emp1 | emp2 | emp3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no | no | no |
| 2 | 17 | 1 | no | 18 | black | no | not married | yes | 8 | 4 | no | no | no |
| 3 | 25 | 1 | no | 19 | other | yes | not married | yes | 13 | 3 | no | no | no |
| 4 | 52 | 0 | yes | 23 | black | yes | married | yes | 1 | 5 | no | no | no |
| 5 | 52 | 0 | no | 19 | other | yes | not married | yes | 3 | 3 | no | no | no |
Creating a subject-period data frame
Restructure the data frame using a custom function: unfold. The unfold function resemble the melt function from the reshape2 package or the gather function from the tidyr package. The unfold function is located here.
We want to go from a wide data frame to a long data frame. The dataset Rossi has 432 rows and 62 columns.
With the new person-period data frame, the time intervals or periods are stacked to form a very long data frame of i rows by j columns.
## [1] 19809 14
A part of the new Rossi_long data frame.
| start | stop | arrest.time | week | arrest | fin | age | race | wexp | mar | paro | prio | educ | employed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.1 | 0 | 1 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.2 | 1 | 2 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.3 | 2 | 3 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.4 | 3 | 4 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.5 | 4 | 5 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.6 | 5 | 6 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.7 | 6 | 7 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.8 | 7 | 8 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.9 | 8 | 9 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.10 | 9 | 10 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.11 | 10 | 11 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.12 | 11 | 12 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.13 | 12 | 13 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.14 | 13 | 14 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.15 | 14 | 15 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.16 | 15 | 16 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.17 | 16 | 17 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.18 | 17 | 18 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.19 | 18 | 19 | 0 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
| 1.20 | 19 | 20 | 1 | 20 | 1 | no | 27 | black | no | not married | yes | 3 | 3 | no |
The first individual was rearrested at 20 weeks. If we examine the first 20 cases, we see that the arrest.time variable has a value of 0. 0 until week 20, when the individual was rearrested – where it then turns to a value of 1.
We estimate the Cox model with time-varying covariates.
library(survival)
cox_model3 <- coxph(Surv(start, stop, arrest.time) ~ fin + age + race + wexp + mar + paro + prio + employed, data = Rossi_long)
summary(cox_model3)## Call:
## coxph(formula = Surv(start, stop, arrest.time) ~ fin + age +
## race + wexp + mar + paro + prio + employed, data = Rossi_long)
##
## n= 19809, number of events= 114
##
## coef exp(coef) se(coef) z Pr(>|z|)
## finyes -0.35672 0.69997 0.19113 -1.866 0.06198 .
## age -0.04634 0.95472 0.02174 -2.132 0.03301 *
## raceother -0.33866 0.71273 0.30960 -1.094 0.27402
## wexpyes -0.02555 0.97477 0.21142 -0.121 0.90380
## marnot married 0.29375 1.34145 0.38303 0.767 0.44314
## paroyes -0.06421 0.93781 0.19468 -0.330 0.74156
## prio 0.08514 1.08887 0.02896 2.940 0.00328 **
## employedyes -1.32832 0.26492 0.25072 -5.298 1.17e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## finyes 0.7000 1.4286 0.4813 1.0180
## age 0.9547 1.0474 0.9149 0.9963
## raceother 0.7127 1.4031 0.3885 1.3075
## wexpyes 0.9748 1.0259 0.6441 1.4753
## marnot married 1.3414 0.7455 0.6332 2.8419
## paroyes 0.9378 1.0663 0.6403 1.3735
## prio 1.0889 0.9184 1.0288 1.1525
## employedyes 0.2649 3.7747 0.1621 0.4330
##
## Concordance= 0.708 (se = 0.027 )
## Rsquare= 0.003 (max possible= 0.066 )
## Likelihood ratio test= 68.65 on 8 df, p=9.114e-12
## Wald test = 56.15 on 8 df, p=2.632e-09
## Score (logrank) test = 64.48 on 8 df, p=6.102e-11
Here are the exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1 or 100%).
Compare to the baseline (\((e^\beta - 1) * 100\)). However, we have to rule out the results on binary variables on this second barplot.
The most striking difference is that the newly included time-varying covariate (employedyes) has the strongest effect of all coefficients in the model. It is also the most significant. This variable measures whether an individual was employed or not each week.
With a hazard ratio of 0.2649 and percentage change in the hazard of -73.5%, we can conclude that the probability of rearrest for individuals who were employed full time is about 26.5% of the risk for those who were not employed full-time.
We plot the hazard function and the survival curve.
By lagging a variable, we ensure that we correctly specify that the cause precedes the effect.
Most time-varying covariates should be entered in a lagged form to avoid this problem of simultaneity of cause and effect.
In the previous section, we prepared the data to predict arrests in a given week in relation to the individual’s employment status in the very same week. Being arrested on a Monday makes it impossible to work for the rest of the week.
In this case, being arrested influences one’s employment status. But since data is only registered on a weekly basis, it could appear that the person’s unemployment status had influenced their arrest status. The problem of reverse causation is a common one with time-varying covariates and even acuter when event times are measured in broader units such as a month or year.
Creating a subject-period file with lagged variables to reduce problems of causal ordering
To lag the variable, we need to specify the employment status in the previous week. The unfold function has a parameter for that.
Unfold the data, rerun the survival analysis and print the results.
Rossi_long2 <- unfold(Rossi, 'week', 'arrest', 11:62, 'employed', lag = 1)
library(survival)
cox_model4 <- coxph(Surv(start, stop, arrest.time) ~ fin + age + race + wexp + mar + paro + prio + employed, data = Rossi_long2)
summary(cox_model4)## Call:
## coxph(formula = Surv(start, stop, arrest.time) ~ fin + age +
## race + wexp + mar + paro + prio + employed, data = Rossi_long2)
##
## n= 19377, number of events= 113
##
## coef exp(coef) se(coef) z Pr(>|z|)
## finyes -0.35130 0.70377 0.19181 -1.831 0.067035 .
## age -0.04977 0.95144 0.02189 -2.274 0.022969 *
## raceother -0.32147 0.72508 0.30912 -1.040 0.298369
## wexpyes -0.04764 0.95348 0.21323 -0.223 0.823207
## marnot married 0.34476 1.41165 0.38322 0.900 0.368310
## paroyes -0.04710 0.95399 0.19630 -0.240 0.810375
## prio 0.09199 1.09635 0.02880 3.194 0.001402 **
## employedyes -0.78689 0.45526 0.21808 -3.608 0.000308 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## finyes 0.7038 1.4209 0.4832 1.0250
## age 0.9514 1.0510 0.9115 0.9932
## raceother 0.7251 1.3792 0.3956 1.3290
## wexpyes 0.9535 1.0488 0.6278 1.4481
## marnot married 1.4116 0.7084 0.6661 2.9917
## paroyes 0.9540 1.0482 0.6493 1.4016
## prio 1.0964 0.9121 1.0362 1.1600
## employedyes 0.4553 2.1966 0.2969 0.6981
##
## Concordance= 0.67 (se = 0.027 )
## Rsquare= 0.002 (max possible= 0.067 )
## Likelihood ratio test= 47.16 on 8 df, p=1.43e-07
## Wald test = 43.37 on 8 df, p=7.483e-07
## Score (logrank) test = 46.4 on 8 df, p=1.993e-07
The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1 or 100%).
The exponentiated coefficients compared to the baseline (\((e^\beta - 1) * 100\)). We rule out the results on binary variables on this second barplot.
The coefficient for the lagged employment, employedyes, is still highly significant. The effect of the coefficient is much smaller as it clearly drops.
The risk of arrest (exp(coeff)) for individuals who were employed is around 45,5% of the risk compared to those who were not employed.
The ‘lag’ model is more realistic: the hazard is larger, the duration is shorter, and the survival probability is smaller.
Interactions with time as time-varying covariates – part 1
Another approach to examining time-dependence is by splitting the episode by a particular time point.
We want to determine whether financial aid would lower rearrest rates. Since financial aid was only provided during the first 13 weeks, we can hypothesize that the effect of financial aid should be stronger for the first 13 weeks and then wane after this period.
Creating episode-splitting at time intervals
We cut or split the dataset at ‘13 weeks’ using the and create a set named Rossi_split.
Rossi2 <- Rossi
Rossi2$id <- 1:432
Rossi_split <- survSplit(Rossi2,
cut = c(13),
end = 'week',
event = 'arrest',
start = 'start',
episode = 'i')
# Reorder
Rossi_split <- Rossi_split[order(Rossi_split$id, Rossi_split$start),]
Rossi_split[1:2,]## fin age race wexp mar paro prio educ emp1 emp2 emp3 emp4 emp5
## 1 no 27 black no not married yes 3 3 no no no no no
## 2 no 27 black no not married yes 3 3 no no no no no
## emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp13 emp14 emp15 emp16 emp17
## 1 no no no no no no no no no no no no
## 2 no no no no no no no no no no no no
## emp18 emp19 emp20 emp21 emp22 emp23 emp24 emp25 emp26 emp27 emp28 emp29
## 1 no no no <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 no no no <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## emp30 emp31 emp32 emp33 emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## emp42 emp43 emp44 emp45 emp46 emp47 emp48 emp49 emp50 emp51 emp52 id
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 1
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 1
## start week arrest i
## 1 0 13 0 1
## 2 13 20 1 2
Above, the first subject (id=1) had an event (i.e. rearrest) in week 20 and therefore has one split at week 13. The data frame shows 2 rows (i=1 & i=2: the first 13 weeks (1-13) where arrest=0 and the second 13 weeks (14-26, stopped at week=20) where arrest=1.
There was no event between 0 and 13 weeks. The newly defined start variable has the value of 13 for observations with the survival time exceeding or equal to 13 and 0 for otherwise.
Below, another subject (id=2).
Rossi_split[3:4,]## fin age race wexp mar paro prio educ emp1 emp2 emp3 emp4 emp5
## 3 no 18 black no not married yes 8 4 no no no no no
## 4 no 18 black no not married yes 8 4 no no no no no
## emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp13 emp14 emp15 emp16 emp17
## 3 no no no no yes yes yes yes yes no no no
## 4 no no no no yes yes yes yes yes no no no
## emp18 emp19 emp20 emp21 emp22 emp23 emp24 emp25 emp26 emp27 emp28 emp29
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## emp30 emp31 emp32 emp33 emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## emp42 emp43 emp44 emp45 emp46 emp47 emp48 emp49 emp50 emp51 emp52 id
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2
## start week arrest i
## 3 0 13 0 1
## 4 13 17 1 2
We create two new Heaviside function variables.
Rossi_split$fin <- as.numeric(Rossi_split$fin)
Rossi_split$hv1 <- Rossi_split$fin*(1 - (Rossi_split$start / 13))
Rossi_split$hv2 <- Rossi_split$fin*(Rossi_split$start / 13)We run a time-dependent model using these variables.
library(survival)
cox_model5 <- coxph(Surv(week, arrest) ~ age + prio + hv1 + hv2, data = Rossi_split)
summary(cox_model5)## Call:
## coxph(formula = Surv(week, arrest) ~ age + prio + hv1 + hv2,
## data = Rossi_split)
##
## n= 844, number of events= 114
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age -0.06705 0.93515 0.02087 -3.212 0.001316 **
## prio 0.09955 1.10467 0.02746 3.625 0.000289 ***
## hv1 0.70481 2.02345 0.33179 2.124 0.033651 *
## hv2 -0.54135 0.58196 0.20345 -2.661 0.007794 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 0.9352 1.0693 0.8977 0.9742
## prio 1.1047 0.9052 1.0468 1.1658
## hv1 2.0235 0.4942 1.0560 3.8772
## hv2 0.5820 1.7183 0.3906 0.8671
##
## Concordance= 0.679 (se = 0.029 )
## Rsquare= 0.054 (max possible= 0.805 )
## Likelihood ratio test= 46.73 on 4 df, p=1.732e-09
## Wald test = 41.24 on 4 df, p=2.398e-08
## Score (logrank) test = 45.21 on 4 df, p=3.6e-09
The results show a significant hazard ratio (at 0.05 significance level) of 2.023 for the effect of no financial aid in the first 13 weeks (hv1).
This is contrasted with the hazard ratio of 0.582 (significant at the 0.01 level) for second 13 weeks (hv2).
The exponentiated coefficients (increasing/decreasing the risk of rearrest) and the baseline (y = 1).
The exponentiated coefficients compared to the baseline (\((e^\beta - 1) * 100\)).
\(HR\) being the hazard ratio.
\[t < 13~weeks:~HR = e^{0.7048} = 2.023\]
\[t \geq 13~weeks:~HR = e^{0.5413} = 0.582\]
Note that the variables hv1 and hv2 are the product terms or interaction terms between \(fin \times g1\) and \(fin \times g2\). g1 and g2 being dummies.
\[if~t < 13,~g1 = 1;~otherwise,~g1 = 0\] \[if~t \geq 13,~g2 = 1;~otherwise,~g1 = 0\]
The estimates should be interpreted just as we would any interaction effect which in this case has ‘Financial aid = yes’ as the reference group. No financial assistance clearly shows higher rearrest rates in the first weeks since the exp(coef > 1.
There is a large difference in survival times (i.e. not being rearrested) after 13 weeks for those who did not receive any financial compensation at the beginning in comparison to a smaller difference in survival times prior to 13 weeks (exp(coef < 1).
Overall, those who received financial aid always did better in terms of survival at any time than those who did not.
One way to assess if your model is adequately specified is via the diagnostic approach of the Cox-Snell residuals, via the construction of a residual plot that follows a unit exponential distribution with a hazard ratio of 1. If the assumption holds, then the plot of the Cox-Snell residuals should have a straight line with a slope that is equal to 1 (hazard ratio of 1).
To the extent that this plot forms a 45-degree straight line through the origin, the model is correct.
First, we scale down our recidivism model, eliminating the covariates whose coefficients were not statistically significant in cox_model1.
library(survival)
cox_model6 <- coxph(Surv(week, arrest) ~ fin + age + prio, data = Rossi)
cox_model6## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi)
##
## coef exp(coef) se(coef) z p
## finyes -0.3470 0.7068 0.1902 -1.82 0.06820
## age -0.0671 0.9351 0.0209 -3.22 0.00129
## prio 0.0969 1.1017 0.0273 3.56 0.00038
##
## Likelihood ratio test=29.1 on 3 df, p=2.19e-06
## n= 432, number of events= 114
Second, we need to compute an estimate of the survival function using the KM estimates. Finally, we construct the graphic (there are not function included in the survival package).
r <- cox_model6$residuals
# recover Cox Snell residual values from modified values
rr <- Rossi$arrest - r
fit <- survfit(Surv(rr, Rossi$arrest) ~ 1) # KM estimate
S <- fit$surv
T <- fit$time
# plot residuals
plot(S ~ T, xlim = range(T), ylim = c(0,1), xlab = 'time', ylab = 'Cox-Snell residual values', col = 'cyan2')
T_0 <- seq(0, max(T), 0.05)
S_0 <- exp(-T_0)
lines(S_0 ~ T_0, col = 'red3', lty = 3, lwd = 2)Since the line fits the data and there are no serious deviations from the line, the Cox model appears to be adequate. If there were strong deviations, it would suggest that we have omitted a key covariate or that the functional form of one or more of the covariates was incorrect.
It is a fast and simple graphical check. However, other tests will assess the adequacy of a model.
Tests and graphical diagnostics for proportional hazards may be based on the ‘scaled Schoenfeld residuals’; these can be obtained directly as residuals(coxph, 'scaledsch').
cox_model6## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi)
##
## coef exp(coef) se(coef) z p
## finyes -0.3470 0.7068 0.1902 -1.82 0.06820
## age -0.0671 0.9351 0.0209 -3.22 0.00129
## prio 0.0969 1.1017 0.0273 3.56 0.00038
##
## Likelihood ratio test=29.1 on 3 df, p=2.19e-06
## n= 432, number of events= 114
The coefficient for financial aid, which is the focus of the study, now has a p-values greater than 5%. This p-value is based on a two-sided test. A one-sided test is appropriate because we expect the coefficient to be negative.
We test the proportional-hazards assumption (the null hypothesis); we test each covariate.
cox.zph(cox_model6)## rho chisq p
## finyes -0.00657 0.00507 0.9433
## age -0.20976 6.54147 0.0105
## prio -0.08004 0.77288 0.3793
## GLOBAL NA 7.13046 0.0679
There is strong evidence of non-proportional hazards of age (we accept the alternative hypothesis).
age weakens the global test. The test accept the null hypothesis, but the p-value is barely over 5%; this is not highly significant.
These tests are sensitive to linear trends in the hazard.
Visually.
The interpretation of these graphs is facilitated by using a smoothing spline, shown on each graph by a solid line; the broken lines represent 2-standard-error envelopes around the fit.
We seek to have straight horizontal lines. Sloped lines are indicative of non-proportional hazards. The assumption of proportional hazards appears to be supported for the binary covariates finyes and continuous covariate prio, but there appears to be a downward trend in the plot for continuous covariate age.
One way of accommodating non-proportional hazards is to build interactions between covariates and time into the Cox regression model; such interactions are themselves time-dependent covariates.
library(survival)
cox_model7 <- coxph(Surv(start, stop, arrest.time) ~
fin + age + age:stop + prio, data = Rossi_long)
cox_model7## Call:
## coxph(formula = Surv(start, stop, arrest.time) ~ fin + age +
## age:stop + prio, data = Rossi_long)
##
## coef exp(coef) se(coef) z p
## finyes -0.34856 0.70570 0.19023 -1.83 0.06690
## age 0.03228 1.03280 0.03943 0.82 0.41301
## prio 0.09818 1.10316 0.02726 3.60 0.00032
## age:stop -0.00383 0.99617 0.00147 -2.61 0.00899
##
## Likelihood ratio test=36 on 4 df, p=2.85e-07
## n= 19809, number of events= 114
As expected, the coefficient for the interaction is negative and highly statistically significant: the effect of age declines with time.
cox.zph(cox_model7)## rho chisq p
## finyes -0.0068 0.00544 0.941
## age -0.0471 0.36703 0.545
## prio -0.0849 0.86449 0.352
## age:stop 0.0567 0.55701 0.455
## GLOBAL NA 1.41901 0.841
All p-value are above 5%. We can now accept the proportional-hazards assumption (the null hypothesis).
The age slope has seemingly lost its downward trend.
An alternative to incorporating an interaction in the model is to divide the data into strata based on the value of one or more covariates.
Each stratum is permitted to have a different baseline hazard function, while the coefficients of the remaining covariates are assumed to be constant across all strata.
We do not have to assume a particular form of interaction between the stratifying covariates and time. A disadvantage is the inability to examine the effects of the stratifying covariates. Stratification is most natural when a covariate binary (or limited to a few categories).
In our example, age takes on many different values, but we can create age categories by dissecting the variable into class intervals.
We use the recode function from the car package to categorize age.
library(car)
Rossi3 <- Rossi
Rossi3$age_cat <- recode(Rossi3$age, 'lo:19=1; 20:25=2; 26:30=3; 31:hi=4')Most of the individuals in the dataset are in the second age category, 20 to 25.
We estimate a stratified Cox regression model by including a strata function on the righthand side of the model formula. And we test the results.
library(survival)
cox_model8 <- coxph(Surv(week, arrest) ~
fin + prio + strata(age_cat), data = Rossi3)
cox_model8## Call:
## coxph(formula = Surv(week, arrest) ~ fin + prio + strata(age_cat),
## data = Rossi3)
##
## coef exp(coef) se(coef) z p
## finyes -0.3408 0.7112 0.1902 -1.79 0.0732
## prio 0.0941 1.0986 0.0270 3.48 0.0005
##
## Likelihood ratio test=13.4 on 2 df, p=0.00122
## n= 432, number of events= 114
cox.zph(cox_model8)## rho chisq p
## finyes -0.0183 0.0392 0.843
## prio -0.0771 0.6859 0.408
## GLOBAL NA 0.7299 0.694
All p-value are above 5%. We accept the proportional-hazards assumption (the null hypothesis).
The method of stratification has two main advantages over interactions.
First, when we include an interaction term, we are required to choose a particular form of interaction. The stratification approach is more flexible and allows for any kind of change in the effect of a variable over time.
Second, stratification is efficient and takes less computation time.
The first disadvantage is that there are no estimates for the effect of the stratifying variable. For this reason, it would not be advisable to stratify on a key variable.
A second disadvantage is that the interaction effect of the covariate with time may offer more efficient estimates.
Finally, it is not possible to test for the main effect of the stratifying variable or its interaction with time.
Another common problem is that there are influential observations in your data that influence your estimates. It is important to know whether the removal of one observation would result in a substantial increase or decrease of the relative hazard.
type = 'dfbetas' produces the estimated changes in the coefficients divided by their standard errors. We compute the dfbeta on cox_model6, the model without interactions nor statification.
dfbeta <- residuals(cox_model6, type = 'dfbeta')We then plot the residuals per covariate.
Comparing the magnitudes of the largest ‘degree of freedom-beta’ values to the regression coefficients suggests that none of the observations is influential individually.
Some of the dfbeta values for age are large compared with the others.
It appears to be a clustering of older respondents. It might be useful to do further sensitivity analyses.
We need to consider whether the model that we are estimating needs to be changed to embrace the influential observation.
When we incorrectly specify the functional form in the parametric part of the model, there is a potential problem in Cox model regressions (as with linear and generalized linear models).
The ‘martingale residuals’ may be plotted against covariates to detect nonlinearity. They may also be used to form partial-residual plots (in the manner of linear and generalized linear models).
Nonlinearity is not an issue for financial aid, because this covariate is dichotomous; we will only compute the ‘martingale residuals’ for two covariates of cox_model6 ( the model without interactions nor statification).
res <- residuals(cox_model6, type = 'martingale')
X <- as.matrix(Rossi3[, c('age', 'prio')])We will not tackle the mathematical explanations and keep Greek equations to the minimum. The Terminology section is a good reference for the principal key words.
Here is a tentative checklist/workflow for performing survival analyses.
Step 1
What is the shape of the baseline hazard function?
Step 2
How do the covariates affect the hazard rate?
Step 3
Step 4
Worked examples:
leukemia_fit, a base model with no covariate and graphics.leukemia_fit2 with one covariate and graphics.leukemia_fit2 with (binary) stratified results (using the covariate).Non-parametric methods also include actuarial life tables and the Altschuler–Nelson estimator (both are not covered).
eha package.We can plot the results.
survival package.We cannot plot the results since the focus is on coefficient estimation.
Specs:
We seek to find the best distribution to fit the hazard function. Among the most common models:
The exponential model
\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2\]
a special case of the Weibull model.
\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2 +c~log(t)\]
c = 1; log(1) = 0 (constant over time).
Worked examples:
exp_ph_model and graphicsThe piecewise constant exponential model
Similar to the exponential model, with the additional assumption that time is divided into different periods and that the hazard rate is then constant within any of these time periods. In other words, a piecewise exponential model gives us the flexibility of the Cox model with the added advantage of being able to estimate the shape of the hazard function.
Worked examples with 4 time segments:
pe_ph_model and graphics (with the eha package).pe_ph_model2 (with the survival package).The Weibull model
\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2 +c~log(t)\]
Worked examples:
weibull_ph_model and graphics.The Gompertz model (not covered)
\[log \big(h(t) \big) = \beta_0 + \beta_1x_1 + \beta_2x_2 +ct\]
Specs:
We seek to find the best distribution to fit the disturbances or residuals.
The general model
\[log~T = \beta_0 + \beta_1x_1 + \beta_2x_2 + \varepsilon\]
The most common nonmonotonic distributions:
Worked examples:
lognormal_aft_model.loglogistic_aft_model.We can also apply the AFT model to the exponential, Weibull and Gompertz monotonic distributions.
Worked examples:
exp_aft_model.weibull_aft_model.Other notable distributions:
Finding the right distribution (density function) to fit the hazard function (PH) or disturbances (AFT) is crucial. Here are a couple of situations:
A study of the data a priori indicate which distribution we should opt for. We might consider less common forms such the log-normal, log-logistic, generalized gamma distributions; or even the normal, log-gamma, logistic, and extreme value distributions.
Tests a posteriori measure the goodness of fit of the models.
The Cox PH regression models – time-constant models
\[log \big(h(t) \big) = a(t) + \beta_1x_1 + \beta_2x_2\]
Cox models do not have an intercept term; this term is absorbed into the baseline hazard function.
Worked examples:
cox_model1 and graphics.Stratification
\[log \big(h(t) \big) = a_1(t) + \beta_1x_1 + \beta_2x_2\] \[log \big(h(t) \big) = a_2(t) + \beta_1x_1 + \beta_2x_2\]
Worked examples:
cox_model2 with a binary stratification (using one covariate) and graphics.cox_model8 where one covariate is split into 4 categories. cox_model6, a base model and graphics, is a comparative basis for cox_model8.Time-varying models
\[log \big(h(t) \big) = a(t) + \beta_1x_1 + \beta_2x_2(t)\]
Worked examples:
cox_model3 (with a ‘long’ dataset instead of a ‘wide’) and graphics.Time-varying models with a lagged covariate
\[log \big(h(t) \big) = a(t) + \beta_1x_1 + \beta_2x_2(t-1)\]
Worked examples:
cox_model4 (with a ‘long’ dataset instead of a ‘wide’) and comparative plots with cox_model3.Interaction
\[log \big(h(t) \big) = a(t) + (b+ct)x\]
Worked examples:
cox_model5 with an interaction with time-varying covariates.cox_model6, a base model and graphics, vs.cox_model7 with an interaction covariate-time and plot graphics.
The piecewise constant exponential models can be considered semi-parametric.
These are models with binary dependent variable: logit, probit, and log-log. These model are related to survival analysis because a discrete event is a nonrepeated event of a single kind.
We do not cover these model in details. We only provide an example.
We have doctoral graduates. The event of interest is ‘promotion to associate professor’ during a time window of 10 years following graduation. These events are recorded in discrete time periods since we know only the year in which the promotion occured. The regression model captures the conditional probability of a promotion during the 10-year time window depending on several explanatory variables.
Key concepts:
The hazard rate is the conditional probability that an event will occur at a particular time to a particular individual.
Linear form.
\[log \bigg( \frac{P(t)}{1 - P(t)} \bigg) = \beta_0 + \beta_1x_1 + \beta_2x_2(t)\]
\[log \bigg( \frac{P(t)}{1 - P(t)} \bigg) = \beta_0 + \beta_1x_1 + \beta_2x_2(t) + \beta_3t + \beta_4t^2\]
\[log \big[ -log(1 - P(t)) \big] = \beta_0 + \beta_1x_1 + \beta_2x_2(t)\]
Let us say we have a study spanning on a given time window (defined by a beginning time period and an ending time period). Censoring occurred for two reasons:
Some individuals or firms (observations) leave the study (another event forcing them to leave). For example, the event is ‘promotion’ or ‘recidivism’ and the individual dies during the study.
Right censoring: an individual has not experienced an event by the time the study ends and we don’t know when or if it will occur.
An individual that is censored at a particular point in time provides no information about that person’s hazard at that time. There is no way to test the assumption that censoring in non-informative. There are no available methods to relax that assumption. We can just ignore the problem and hope for the best. Non-parametric, parametric and semi-parametric methods handle left and interval censoring.
Outside survival analysis, censoring may pose problems to OLS models. This is a situation of ‘limited dependent variable’. We can remedy the situation with a Tobit model. By the way, the AER package for running Tobit regressions interfaces with the survreg function of the survival package.
In a truncated sample, we do not even pick up observations that lie outside a certain range. We both miss the dependent and the explanatory variables.
Rossi dataset in mind…)(…or for other situations.)
There are several techniques that provide guidance on model-building and diagnostics.
There has also been growing attention to multiple kinds of events, repeated and recurrent events such as multiple rearrests or recurring recidivism. Correlation of event times can occur when subjects that experience a single event belong to particular groups or clusters such as a clinic, family, workplace or geographical region. In other cases, we may face multiple relapses from remission, repeated heart attacks or infections, multiple marriages, births, cabinet durations or unemployment episodes. We can use count data models, methods based on gap times, and on times since origin.
Non-parametric KM models are related to actuarial life tables (in insurance) and half-life models. The latter, along with the parametric exponential models, have very specific applied settings. We supplied a couple of examples with the parametric exponential model. We can also think of the duration of radioactive disintegration and medication absorption (elimination).
Semi-parametric Cox PH regression models are related to Poisson regressions (count models). Survival models measure the odds of one or more events while count models account for the number of events.
Cox model can have communal covariates (that vary in time outside any individual and are common to all individual at each specific calendar time). It is an exogenous variable such as the effect of inflation (rent increase) on rearrest (and whether financial aid can curb the event of rearrest).
We make the implicit assumption that all subjects are homogeneous. We assume that they are prone to experience events or failures in the same way, with the timing of events considered as independent from one another. Some subjects are more ‘frail’. Thus, these subjects are more likely to experience an event. Frailty models account for the unobserved heterogeneity. Frailty models are sometimes also referred to as multi-level or random effect models. Frailty is an unobserved random proportionality factor that modifies the hazard function of a subject. It is a way to introduce random effects, association and unobserved heterogeneity into duration models.
We can also account for competing risks with the Rossi dataset. We could consider two kinds of rearrest: crimes against property (robbery, burglary, larceny, etc.) and other crimes. We might expect that financial aid would reduce the hazard of arrests for crimes against property, where there is a financial incentive to commit the crime. We can account for dependence among different kinds of events and model (multiple events and multi-state models) a cumulative incidence function. We can also model entire histories.
Sequence analysis is another technique highly intertwined with event history analysis, as it offers the ability to model entire life paths or event history trajectories.
We address causality with the example of the lagged variable in the Cox PH regression model with time-varying covariates. There is a lot more we can take into account: causal inference, predictive causality, counterfactual or treatment effect or reverse causality, Aalen’s additive hazards model, dynamic path analysis, and matching.
survival. Handles non-parametric models and Cox PH models. Also handles parametric AFT models although graphics is not possible.eha (event history analysis). Handles parametric PH models in this case. Capables of handling parametric AFT models.
coxme for frailty models in Cox regression.timereg for extending the Cox model and aareg from survival (Aalen’s additive hazard models).cmprsk for competing risks.
And also lattice, mstate (multistate and competing risk), TRaMineR (sequence analysis), cluster (cluster analysis), reshape2 and tidyr (data restructuring).
How do we choose the density and survivor function in practice? Before:
Ave Imperator, morituri te salutant.2