Survival analysis concerns a special kind of outcome variable : the time until an event occurs
For example, suppose that we have conducted a five-year medical study, in which patients have been treated for cance
We would like to fit a model to predict patient survival time, using features such as baseline health measurements or type of treatment
Sounds like a regression problem. But there is an important complication : some of the patients have survived until the end of the study. Such a patient's survival time is said to be censored
We do not wnat to discard this subset of surviving patients, since the fact that they survived at least five years amounts to valuable information
Non-medical Examples
The applications of survival analysis extend far beyond medicine. For example, consider a company that wishes to model churn, the event when customers cancel subscription to a service
The company might collect data on customers over some time period, in order to predict each customer's time to cancellation
However, presumably not all customers will have cancelled their subscription by the end of this time period; for such customers, the time to cancellation is censored
Survival analysis is a very well-studied topic within statistic. However, it has received relatively little attention in the machine learning community
Survival and Censoring Times
For each individual, we suppose that there is a true failure or event time T, as well as a true censoring time C
The survival time represents the time at which the event of interest occurs (such as death)
By contrast, the censoring is the time at which censoring occurs: for example, the time at which the patient drops out of the study or the study ends
We observe either the survival time T or else the censoring time C. Specifically, we observe the random variable
Y=min(T,C)
If the event occurs before censoring (i.e. T<C) then we observe the true survival time T; if censoring occurs before the event (T>C) then we observe the censoring time. We also observe a status indicator
δ={1if T≤C0if T>C
Finally, in our dataset we observe n paris (Y,δ), which we denote as (y1,δ1),…,(yn,δn)
Here is an illustration of censored survival data
For patients 1 and 3, the event was observed
Patient 2 was alive when the study ended
Patient 4 dropped out of the study
A Closer Look at Censoring
Suppose that a number of patients drop out of a cancer study early because they are very sick
An analysis that does not take into consideration the reason why the patients dropped out will likely overestimate the true average survival time
Similarly, suppose that males who are very sick are more likely to drop out of the study than females who are very sick
Then a comparison of male and female survival times may wrongly suggest that males survive longer than females
In general, we need to assume that, conditional on the features, the event time T is independent of the censoring time C
The two examples above violate the assumption of independent censoring
The Survival Curve
The survival function ( or curve) is defined as
S(t)=Pr(T>t)
This decreasing function quantifies the probability of surviving past time t
For example, suppose that a company is interested in modeling customer churn
Let T represent the time that a customer cancels a subscription to the company's service
Then S(t) represents the probability that a customer cancels later than time t
The larger the value of S(t), the less likely that the customer will cancel before time t
Estimating the Survival Curve
Consider the BrainCancer dataset, which contains the survival times for patients with primary brain tumors undergoing treatment with stereotactic radiation methods
Only 53 of the 88 patients were still alive at the end of the study
Suppose we'd like to estimate S(20)=Pr(T>20), the probability that a patient survives for at least 20 months
It is tempting to simply compute the proportion of patients who are known to have survived past 20 months, that is, the proportion of patients for whom Y>20
This turns out to be 48/88 or approximately 55%
However, this does not seem quite right : 17 of the 40 patients who did not survive to 20 months were actually censored, and this analysis implicitly assumes they died before 20 months
Hence it is probably an underestimate
Kaplen-Meier Survival Curve
Each point in the solid step-like curve shows the estimated probability of surviving past the time indicated on the horizontal axis
The estimated probability of survival pas 20 months is 71%, which is quite a bit higher than the naive estimate of 55% presented earlier
The Log-Rank Test
We wish to compare the survival of males to that of females
Shown are the Kaplan-Meier survival curves for the two groups
Females seem to fare a little better up to about 50 months, but then the two curves both level off to about 50%
How can we carry out a formal test of equality of the two survival curves?
At first glance, a two-sample t-test seems like an obvious choice : but the presence of censoring again creates a complication
To overcome this challenge, we will conduct a log-rank test
Recall that d1<d2<⋯<dK are the unique death times among the non-censored patients, rk is the number of patients at risk at time dk, and qk is the number of patients who died at time dk
We further define r1k and r2k to be the number of patients in groups 1 and 2, respectively, who are at risk at time dk
Similarly, we define q1k and q2k to be the number of patients in groups 1 and 2, respectively, who died at time dk
Note that r1k+r2k=rk and q1k+q2k=qk
Details of the Test Statistic
At each death time dk, we construct a 2×2 table of counts of the form shown above
Note that if the death times are unique (i.e. no two individuals die at the same time), then one of q1k and q2k equals one, and the other equals zero
To test H0:E(X)=0 for some random variable X, one approach is to construct a test statistic of the form
W=Var(X)X−E(X)
where E(X) and Var(X) are the expectation and variance, respectively, of X under H0
In order to construct the log-rank test statistic, we compute a quantity that takes exactly the form above, with X=∑k=1Kq1k, where q1k is given in the top left of the table above
The resulting formula for the log-rank test statistic is
When the sample size is large, the log-rank test statistic W has approximately a standard normal distribution
This can be used to compute a p-value for the null hypothesis that there is no difference between the survival curves in the two groups
Comparing the survival times of females and males on the BrainCancer data gives a log-rank test statistic of
W=1.2
which corresponds to a two-sided p-value of 0.2
Then, can not reject null hypothesis
Regression Models with a Survival Response
We now consider the task of fitting a regression model to survival data
We wish to predict the true survival time T
Since the observed quantity Y=min(T,C) is positive and may have a long right tail, we might be tempted to fit a linear regression of log(Y) on X
But censoring again creates a problem
To overcome this difficulty, we instead make use of a sequential construction, similar to the idea used for the Kaplain-Meier survival curve
The Hazard Function
The hazard function or hazard rate - also known as the force of mortality - is formally defined as
h(t)=Δt→0limΔtPr(t<T≤t+Δt∣T>t)
where T is the (true) survival time
It is the death rate in the instant after time t, given survival up to that time
The hazard function is the basis for the Proportional Hazards Model
The Proportional Hazards Model
The proportional hazards assumprion states that
h(t∣xi)=h0(t)exp(j=1∑pxijβj)
where h0(t)≥0 is an unspecified function, known as the baseline hazard
It is the hazrard function for an individual with features xi1=⋯=xip=0
The name proportional hazards arises from the fact that the hazard function for an individual with feature vector xi is some unknown function h0(t) times the factor
exp(j=1∑pxijβj)
The quantity
exp(j=1∑pxijβj)
is called the relative risk for the feature vector xi=(xi1,⋯,xip), realtive to that the feature vector xi=(0,⋯,0)
What does it mean that the baseline hazard function h0(t) is unspecified
Basically, we make no assumption about its functional form
We allow the instantaneous probability of death at time t, given that one has survived at least until time t, to take any form
This means that the hazard function is very flexible and can model a wide range of relationships between the covariates and survival time
Our only assumption is that a one-unit increase in xij corresponds to an increase in h(t∣xi) by a factor of exp(βj)
Here is an example with p=1 and a binary covariate xi∈{0,1}
Top row : the log hazard and the survival function under the model are shown (green for xi=0 and black for xi=1). Because of the proportional hazards assumption, the log hazard functions differ by a constant, and the survival functions do not cross
Bottom row : the proportional hazards assumptions does not hold
Partial Likelihood
Because the form of the baseline hazard is unknown, we cannot simply plug h(t∣xi) into the likelihood and then estimate β=(β1,…,βp)T by maximum likelihood
The magic of Cox's proportional hazard model lies in the fact that it is in fact possible to estimate βwithout having to specify the form ofh0(t)
To accomplish this, we make use of the same sequential in time logic that we used to derive the Kaplan-Meier survival curve and the log-rank test
Then the total hazard at failure time yi for the at-risk observations is
i′:yi′≥yi∑h0(yi)exp(j=1∑pxi′jβj)
Therefore, the probability that the ith observation is the one to fail at time yi (as opposed to one of the other observations in the risk set) is