[ML&DL] 9. Survival Analysis

KBC·2024년 12월 14일
0

Survival Analysis

  • Survival analysis concerns a special kind of outcome variable : the time until an event occurs
  • For example, suppose that we have conducted a five-year medical study, in which patients have been treated for cance
  • We would like to fit a model to predict patient survival time, using features such as baseline health measurements or type of treatment
  • Sounds like a regression problem. But there is an important complication : some of the patients have survived until the end of the study. Such a patient's survival time is said to be censored
  • We do not wnat to discard this subset of surviving patients, since the fact that they survived at least five years amounts to valuable information

Non-medical Examples

  • The applications of survival analysis extend far beyond medicine. For example, consider a company that wishes to model churn, the event when customers cancel subscription to a service
  • The company might collect data on customers over some time period, in order to predict each customer's time to cancellation
  • However, presumably not all customers will have cancelled their subscription by the end of this time period; for such customers, the time to cancellation is censored
  • Survival analysis is a very well-studied topic within statistic. However, it has received relatively little attention in the machine learning community

Survival and Censoring Times

  • For each individual, we suppose that there is a true failure or event time TT, as well as a true censoring time CC
  • The survival time represents the time at which the event of interest occurs (such as death)
  • By contrast, the censoring is the time at which censoring occurs: for example, the time at which the patient drops out of the study or the study ends

  • We observe either the survival time TT or else the censoring time CC. Specifically, we observe the random variable
    Y=min(T,C)Y=\min(T,C)
  • If the event occurs before censoring (i.e. T<CT<C) then we observe the true survival time TT; if censoring occurs before the event (T>CT>C) then we observe the censoring time. We also observe a status indicator
    δ={1if TC0if T>C\delta=\begin{cases}1\quad\text{if }T\leq C\\0\quad\text{if }T>C\end{cases}
  • Finally, in our dataset we observe nn paris (Y,δ)(Y,\delta), which we denote as (y1,δ1),,(yn,δn)(y_1,\delta_1),\dots,(y_n,\delta_n)

  • Here is an illustration of censored survival data
  • For patients 11 and 33, the event was observed
  • Patient 22 was alive when the study ended
  • Patient 44 dropped out of the study

A Closer Look at Censoring

  • Suppose that a number of patients drop out of a cancer study early because they are very sick
  • An analysis that does not take into consideration the reason why the patients dropped out will likely overestimate the true average survival time
  • Similarly, suppose that males who are very sick are more likely to drop out of the study than females who are very sick
  • Then a comparison of male and female survival times may wrongly suggest that males survive longer than females
  • In general, we need to assume that, conditional on the features, the event time TT is independent of the censoring time CC
  • The two examples above violate the assumption of independent censoring

The Survival Curve

  • The survival function ( or curve) is defined as
    S(t)=Pr(T>t)S(t)=\Pr(T>t)
  • This decreasing function quantifies the probability of surviving past time tt
  • For example, suppose that a company is interested in modeling customer churn
  • Let TT represent the time that a customer cancels a subscription to the company's service
  • Then S(t)S(t) represents the probability that a customer cancels later than time tt
  • The larger the value of S(t)S(t), the less likely that the customer will cancel before time tt

Estimating the Survival Curve

  • Consider the BrainCancer dataset, which contains the survival times for patients with primary brain tumors undergoing treatment with stereotactic radiation methods
  • Only 5353 of the 8888 patients were still alive at the end of the study
  • Suppose we'd like to estimate S(20)=Pr(T>20)S(20)=\Pr(T>20), the probability that a patient survives for at least 2020 months
  • It is tempting to simply compute the proportion of patients who are known to have survived past 2020 months, that is, the proportion of patients for whom Y>20Y>20
  • This turns out to be 48/8848/88 or approximately 55%55\%
  • However, this does not seem quite right : 1717 of the 4040 patients who did not survive to 2020 months were actually censored, and this analysis implicitly assumes they died before 2020 months
  • Hence it is probably an underestimate

Kaplen-Meier Survival Curve

  • Each point in the solid step-like curve shows the estimated probability of surviving past the time indicated on the horizontal axis
  • The estimated probability of survival pas 2020 months is 71%71\%, which is quite a bit higher than the naive estimate of 55%55\% presented earlier

The Log-Rank Test

  • We wish to compare the survival of males to that of females
  • Shown are the Kaplan-Meier survival curves for the two groups
  • Females seem to fare a little better up to about 5050 months, but then the two curves both level off to about 50%50\%
  • How can we carry out a formal test of equality of the two survival curves?
  • At first glance, a two-sample tt-test seems like an obvious choice : but the presence of censoring again creates a complication
  • To overcome this challenge, we will conduct a log-rank test

  • Recall that d1<d2<<dKd_1<d_2<\cdots<d_K are the unique death times among the non-censored patients, rkr_k is the number of patients at risk at time dkd_k, and qkq_k is the number of patients who died at time dkd_k
  • We further define r1kr_{1k} and r2kr_{2k} to be the number of patients in groups 11 and 22, respectively, who are at risk at time dkd_k
  • Similarly, we define q1kq_{1k} and q2kq_{2k} to be the number of patients in groups 11 and 22, respectively, who died at time dkd_k
  • Note that r1k+r2k=rkr_{1k}+r_{2k}=r_k and q1k+q2k=qkq_{1k}+q_{2k}=q_k

Details of the Test Statistic

  • At each death time dkd_k, we construct a 2×22\times 2 table of counts of the form shown above
  • Note that if the death times are unique (i.e. no two individuals die at the same time), then one of q1kq_{1k} and q2kq_{2k} equals one, and the other equals zero
  • To test H0:E(X)=0\mathcal{H}_0:E(X)=0 for some random variable XX, one approach is to construct a test statistic of the form
    W=XE(X)Var(X)W=\frac{X-E(X)}{\sqrt{\text{Var}(X)}}
    where E(X)E(X) and Var(X)\text{Var}(X) are the expectation and variance, respectively, of XX under H0\mathcal{H}_0
  • In order to construct the log-rank test statistic, we compute a quantity that takes exactly the form above, with X=k=1Kq1kX=\sum^K_{k=1}q_{1k}, where q1kq_{1k} is given in the top left of the table above

  • The resulting formula for the log-rank test statistic is
    W=k=1K(q1kE(q1k))k=1KVar(q1k)=k=1K(q1kqkrkr1k)k=1Kqk(r1k/rk)(1r1k/rk)(rkqk)rk1W=\frac{\sum^K_{k=1}(q_{1k}-E(q_{1k}))}{\sqrt{\sum^K_{k=1}\text{Var}(q_{1k})}}=\frac{\sum^K_{k=1}\left(q_{1k}-\frac{q_k}{r_k}r_{1k}\right)}{\sqrt{\sum^K_{k=1}\frac{q_k(r_{1k}/r_k)(1-r_{1k}/r_k)(r_k-q_k)}{r_k-1}}}
  • When the sample size is large, the log-rank test statistic WW has approximately a standard normal distribution
  • This can be used to compute a pp-value for the null hypothesis that there is no difference between the survival curves in the two groups

  • Comparing the survival times of females and males on the BrainCancer data gives a log-rank test statistic of
    W=1.2W=1.2
    which corresponds to a two-sided pp-value of 0.20.2
    Then, can not reject null hypothesis

Regression Models with a Survival Response

  • We now consider the task of fitting a regression model to survival data
  • We wish to predict the true survival time TT
  • Since the observed quantity Y=min(T,C)Y=\min(T,C) is positive and may have a long right tail, we might be tempted to fit a linear regression of log(Y)\log(Y) on XX
  • But censoring again creates a problem
  • To overcome this difficulty, we instead make use of a sequential construction, similar to the idea used for the Kaplain-Meier survival curve

The Hazard Function

  • The hazard function or hazard rate - also known as the force of mortality - is formally defined as
    h(t)=limΔt0Pr(t<Tt+ΔtT>t)Δth(t)=\lim_{\Delta t\rightarrow 0}\frac{\Pr(t<T\leq t+\Delta t|T>t)}{\Delta t}
    where TT is the (true) survival time
  • It is the death rate in the instant after time tt, given survival up to that time
  • The hazard function is the basis for the Proportional Hazards Model

The Proportional Hazards Model

  • The proportional hazards assumprion states that
    h(txi)=h0(t)exp(j=1pxijβj)h(t|x_i)=h_0(t)\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)
    where h0(t)0h_0(t)\geq 0 is an unspecified function, known as the baseline hazard
  • It is the hazrard function for an individual with features xi1==xip=0x_{i1} = \cdots=x_{ip} = 0
  • The name proportional hazards arises from the fact that the hazard function for an individual with feature vector xix_i is some unknown function h0(t)h_0(t) times the factor
    exp(j=1pxijβj)\exp\left(\sum^p_{j=1} x_{ij}\beta_j\right)
  • The quantity
    exp(j=1pxijβj)\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)
    is called the relative risk for the feature vector xi=(xi1,,xip)x_i=(x_{i1},\cdots,x_{ip}), realtive to that the feature vector xi=(0,,0)x_i=(0,\cdots,0)
  • What does it mean that the baseline hazard function h0(t)h_0(t) is unspecified
  • Basically, we make no assumption about its functional form
  • We allow the instantaneous probability of death at time tt, given that one has survived at least until time tt, to take any form
  • This means that the hazard function is very flexible and can model a wide range of relationships between the covariates and survival time
  • Our only assumption is that a one-unit increase in xijx_{ij} corresponds to an increase in h(txi)h(t|x_i) by a factor of exp(βj)\exp(\beta_j)

  • Here is an example with p=1p=1 and a binary covariate xi{0,1}x_i \in\{0,1\}
  • Top row : the log hazard and the survival function under the model are shown (green for xi=0x_i=0 and black for xi=1x_i=1). Because of the proportional hazards assumption, the log hazard functions differ by a constant, and the survival functions do not cross
  • Bottom row : the proportional hazards assumptions does not hold

Partial Likelihood

  • Because the form of the baseline hazard is unknown, we cannot simply plug h(txi)h(t|x_i) into the likelihood and then estimate β=(β1,,βp)T\beta=(\beta_1,\dots,\beta_p)^T by maximum likelihood
  • The magic of Cox's proportional hazard model lies in the fact that it is in fact possible to estimate β\beta without having to specify the form of h0(t)h_0(t)
  • To accomplish this, we make use of the same sequential in time logic that we used to derive the Kaplan-Meier survival curve and the log-rank test
  • Then the total hazard at failure time yiy_i for the at-risk observations is
    i:yiyih0(yi)exp(j=1pxijβj)\sum_{i':y_{i'}\geq y_i}h_0(y_i)\exp\left(\sum^p_{j=1}x_{i'j}\beta_j\right)

  • Therefore, the probability that the iith observation is the one to fail at time yiy_i (as opposed to one of the other observations in the risk set) is
    0h0(yi)exp(j=1pxijβj)i:yiyih0(yi)exp(j=1pxijβj)=exp(j=1pxijβj)i:yiyiexp(j=1pxijβ)i)10\leq \frac{h_0(y_i)\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)}{\sum_{i':y_{i'}\geq y_i}h_0(y_i)\exp\left(\sum^p_{j=1}x_{i'j}\beta_j\right)}=\frac{\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)}{\sum_{i':y_{i'}\geq y_i}\exp\left(\sum^p_{j=1}x_{i 'j}\beta)_i\right)}\leq 1
  • Notice that the unspecified baseline hazard function h0(yi)h_0(y_i) cancels out of the numerator and denominator

  • The partial likelihood is simply the product of these probabilities over all of the uncensored observations
    PL(β)=i:δi=1exp(j=1pxijβj)i:yiyiexp(j=1pxijβj)\text{PL}(\beta)=\prod_{i:\delta_i=1}\frac{\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)}{\sum_{i':y_{i'}\geq y_i}\exp\left(\sum^p_{j=1}x_{i'j}\beta_j\right)}
  • Critically, the partial likelihood is valid regardless of the true value of h0(t)h_0(t), making the model very flexible and robust

Relative Risk Functions at each Failure Time

RR1(β)=exp(j=1px1jβj)i:yiy1exp(j=1pxijβj)RR3(β)=exp(j=1px3jβj)i:yiy3exp(j=1pxijβj)RR5(β)=exp(j=1px5jβj)i:yiy5exp(j=1pxijβj)RR_1(\boldsymbol{\beta}) = \frac{\exp\left(\sum_{j=1}^p x_{1j} \beta_j \right)} {\sum_{i':y_{i'} \geq y_1} \exp\left(\sum_{j=1}^p x_{i'j} \beta_j \right)}\\[0.3cm] RR_3(\boldsymbol{\beta}) = \frac{\exp\left(\sum_{j=1}^p x_{3j} \beta_j \right)} {\sum_{i':y_{i'} \geq y_3} \exp\left(\sum_{j=1}^p x_{i'j} \beta_j \right)}\\[0.3cm] RR_5(\boldsymbol{\beta}) = \frac{\exp\left(\sum_{j=1}^p x_{5j} \beta_j \right)} {\sum_{i':y_{i'} \geq y_5} \exp\left(\sum_{j=1}^p x_{i'j} \beta_j \right)}

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글