[Statistical Modeling] Chat GPT와 함께 하는 생존 분석

허상범·2023년 10월 2일

통계 분석 및 모델링 공부를 꾸준히 하려고 하는데, 모 게임사의 데이터 분석가 JD를 보니 아래와 같은 내용이 있었다.

중급 이상의 통계분석 능력(회귀/인과/생존분석 등)을 갖추신 분

회귀, 인과의 경우에는 어떤 것인지 대강 알고 있는데 생존 분석의 경우에는 와이블 분포, 포아송 분포와 같이 데이터가 특이한 확률분포를 따르는 경우에 적용되는 내용으로 알고 있어 한 번 파봐야겠다는 생각을 했었다.

일단 생존 분석의 정의를 보자.

Survival analysis is a branch of statistics that studies how long it takes for certain instances to occur.

예를 들어, 발생 확률이 높지 않은 특정 사고(accident)가 발생하기까지 어느 정도가 걸릴지를 예측하는 모델링이 생존 분석 모델링이다. Airbnb의 경우에는 고객 구매 취소율 등을 예측할 때 생존 분석을 사용한다는 이야기를 언뜻 들었던 것 같기도 하고.. Uber의 경우에도 비슷하게 배차 예약 취소율 등을 모델링할 때 사용하지 않나 싶다.

생존 분석은 직관적으로 생각해도 생존 함수의 구조와 개형이 중요할 것으로 보여진다. 학부 4학년때 들었던 빅데이터통계읽기 수업에서 와이블 분포를 다뤘던 기억이 나는데 물론 가물가물하다..

좌우지간.이번에도 Chat GPT를 사용하여 내가 생소한 분야에 대해 최소한의 일반적 구조와 얼개를 잡고, 필요로 하는 데이터셋과 모델의 코드를 달라고 요청했다. (참고로, 생존 분석에 사용되는 가장 일반적인 파이썬 라이브러리는 lifelines라는 라이브러리임을 확인하여 곧바로 해당 라이브러리를 사용한 코드를 달라고 요청했다.

Q1.

I want to know a survival modeling process utilizing the 'lifelines' library. Can you send me a sample code of the model ?

A1.

import pandas as pd
from lifelines import KaplanMeierFitter
from lifelines import CoxPHFitter
from lifelines import WeibullAFTFitter
from sklearn.model_selection import train_test_split

# Load your dataset (assuming you have a CSV file named 'survival_data.csv')
data = pd.read_csv('survival_data.csv')

# Data Preprocessing: Assume 'duration' is time until event and 'event' is 1 if event occurred, 0 otherwise
# For demonstration purposes, let's assume you have features 'feature1', 'feature2', etc.
X = data[['feature1', 'feature2']]
y = data[['duration', 'event']]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Kaplan-Meier Estimator for Survival Curve
kmf = KaplanMeierFitter()
kmf.fit(y_train['duration'], event_observed=y_train['event'])
kmf.plot_survival_function()
# Optionally, you can also plot survival curves for different groups using kmf.plot()

# Cox Proportional Hazard Model
cph = CoxPHFitter()
cph.fit(pd.concat([X_train, y_train], axis=1), duration_col='duration', event_col='event')
cph.plot_partial_effects_on_outcome(covariates='feature1', values=[0, 1, 2])  # Example partial effects plot

# Weibull Accelerated Failure Time (AFT) Model
wft = WeibullAFTFitter()
wft.fit(pd.concat([X_train, y_train], axis=1), duration_col='duration', event_col='event')
wft.plot_partial_effects_on_outcome(covariates='feature2', values=[0, 1, 2])  # Example partial effects plot

# Evaluate models on the test set
# For example, you can use cph.predict_expectation(X_test) to get survival time predictions from the Cox model

# Note: This is a simplified example. You might need to handle missing data, perform feature engineering,
#       and fine-tune models based on your specific dataset and requirements.
.

한 가지 문제는 나에게 Raw Data가 없기 때문에 Raw Data를 같이 달라고 요청했다. 생존 분석과 같은 통계 모델링에서 사용하는 Raw Data는 무슨 형태로 구조화되어 있어야 하는지 알지 못하기 때문이다.

그랬더니 아래와 같이 생존 분석 모델 학습에 사용될 Synthetic dataset을 하나 만들어주더라. GPT 센세....

import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(0)

# Number of samples in the dataset
num_samples = 1000

# Generate random features for the dataset
feature1 = np.random.normal(loc=0, scale=1, size=num_samples)
feature2 = np.random.uniform(low=0, high=1, size=num_samples)

# Generate random durations (time to event)
# For demonstration purposes, let's assume durations follow a Weibull distribution
shape = 2  # Shape parameter for the Weibull distribution
scale = 10  # Scale parameter for the Weibull distribution
durations = np.random.weibull(shape, size=num_samples) * scale

# Generate binary event indicators (1 if event occurred, 0 otherwise)
# Let's assume events occur if feature1 is positive and feature2 is greater than 0.5
events = (feature1 > 0) & (feature2 > 0.5)

# Create a DataFrame with the generated data
data = pd.DataFrame({
    'feature1': feature1,
    'feature2': feature2,
    'duration': durations,
    'event': events
})

# Save the synthetic dataset to a CSV file
data.to_csv('synthetic_survival_data.csv', index=False)

# Print a sample of the generated dataset
print(data.head())

결론

그래서 데이터 생성 => 모델 테스트/트레이닝셋 분할 => 예측 3분할로 나눠진 코드는 개인 Github Repo에 잘 저장해두었다 !

데이터가 어떤 확률 분포를 따르는지에 따라 각 생존 분석 함수의 개형이 아래와 같이 다르게 그려지는 것을 확인할 수 있다.

결과 해석의 문제 : 생존 분석 차트 x, y axis의 의미는 ?

그냥 얼렁뚱땅 넘어가려다가 빠르게 생존 분석 그래프를 읽는 방법을 GPT 센세에게 여쭤보았다.

Q2. While comprehending the visualization result of wft.plot() within the lifelines library, I don't know the meanings of the x and y axes. Is the y-axis a value of probability ?

A2.

When you use wft.plot() in the context of the lifelines library in Python, it generally refers to a visualization related to the Weibull Accelerated Failure Time (AFT) model. Specifically, wft.plot() is used to visualize the survival function based on the Weibull distribution, which represents the probability of an event (such as failure) not occurring before a certain time.

In the context of lifelines:

X-Axis: The x-axis typically represents time or the duration until the event of interest occurs. It shows the range of time periods over which you're estimating the survival probabilities.

Y-Axis: The y-axis represents the estimated survival probabilities. Survival probability is a measure of the likelihood that an event (such as failure) has not occurred by a given time. In the case of the Weibull distribution, the y-axis values represent these probabilities. The survival function, denoted as $S(t)$ , gives the probability that the event has not occurred by time $t$ . It's important to note that these values range between 0 and 1, where 1 represents a 100% probability that the event has not occurred, and 0 represents a 0% probability (or certainty) that the event has not occurred.

So, in summary, the y-axis values in the wft.plot() visualization represent survival probabilities, not raw probabilities. It shows how the probability of survival changes over different time intervals according to the Weibull AFT model. The visualization helps you understand the survival behavior of your data based on the Weibull distribution.

정리

마지막 질문에서 굉장히 좋은 정리가 이루어진 것은...

Survival( $S(t)$ ) $:=$ 특정 Black Swan Event가 발생하지 않을 확률
- 즉, 시간 $t$ 를 독립변수로 갖고, 종속변수로 정상 상태가 유지될 확률값을 모델링하는 것이 생존 분석이다. 좀 더 쉽게 풀면 결국 특정 사고로부터 생존할 확률을 모델링하자는 것이다. (보험사에서 생존 분석을 하는 이유 !)