[Python] Titanic 생존자 예측

이소티·2023년 8월 28일

ML

목록 보기

2/4

1. 프로젝트 개요

영화로도 매우 유명한 타이타닉 ! 해당 데이터에는 배에 탑승했던 승객들의 아주 자세한 정보가 남아있다.
주요 피쳐들을 이용해 승객들의 생존율과 관계가 있는지 탐색해보고, 영화에 나왔던 남주인공과 여주인공의 생존율은 각각 어떻게 됐을지 확인해보자 !

2. 데이터 시각화

titanic.head()

일단 생존 상황에 대해 살펴보자.

titanic['survived'].value_counts()

f, ax = plt.subplots(1,2,figsize=(18,8))

titanic['survived'].value_counts().plot.pie(explode=[0,0.1], autopct='%1.1f%%', shadow=True, ax=ax[0])

ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')

sns.countplot(x = 'survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot - Survived')

plt.show()

생존하지 못한 사람은 약 61.8%로, 800명이 넘는다.
생존한 사람은 약 38.2%로, 500명 정도이다.

2-1. 성별에 따른 생존 현황

f, ax = plt.subplots(1,2,figsize=(18,8))


sns.countplot(x = 'sex', data=titanic, ax=ax[0])
ax[0].set_title('Count of Passengers of Sex')
ax[0].set_ylabel('')


sns.countplot(x = 'sex', hue = 'survived', data=titanic, ax=ax[1])
ax[1].set_title('Sex:Survived and UnSurvived')

plt.show()

여성의 2배가 되는 남성들이 배에 탐승하였다.
하지만 살아남은 인원은 여성이 남성의 2배 정도이다.

첫 번째 시사점 : 남성의 생존 가능성이 낮다.

2-2. 등실에 따른 생존 현황

pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)

두 번째 시사점 : 1등실의 생존 가능성이 아주 높다.

그렇다면 혹시 1등실에 여성이 많이 타고 있었을까 ?

grid = sns.FacetGrid(titanic, row='pclass', col='sex', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.8, bins=20)
grid.add_legend();

3등실에는 남성이 많은 것을 확인하였다. (특히 20대 남성)

2-3. 나이에 따른 생존 현황

그 전에, 나이를 5단계로 정리해보자 !

titanic['age_cat'] = pd.cut(titanic['age'],  bins=[0,7,15,30,60,100], include_lowest=True, labels=['baby', 'teen', 'young', 'adult', 'old'])

titanic.head()

plt.figure(figsize=(12,4))

plt.subplot(131)
sns.barplot(x='pclass', y='survived', data=titanic)

plt.subplot(132)
sns.barplot(x='age_cat', y='survived', data=titanic)

plt.subplot(133)
sns.barplot(x='sex', y='survived', data=titanic)

plt.subplots_adjust(top=1, bottom=0.1, left=0.1, right=1, hspace=0.5, wspace=0.5)

1등실일수록, 그리고 어릴수록, 여성일수록 생존하기 유리했을까 ?

2-4. 신분에 따른 생존 현황

추가로, '신분' 컬럼을 추가해보도록 하자 ! 탑승객의 이름에서 신분을 확인할 수 있다.

for idx, dataset in titanic.iterrows() :
    print(dataset['name'])

import re

for idx, dataset in titanic.iterrows() :
    tmp = dataset['name']
    print(re.search('\,\s\w+(\s\w+)?\.', tmp).group())

import re

title = []

for idx, dataset in titanic.iterrows() :
    tmp = dataset['name']
    title.append(re.search('\,\s\w+(\s\w+)?\.', tmp).group()[2:-1])

titanic['title'] = title
titanic.head()

# 남자 귀족과 여자 귀족으로 분리해주도록 하자.

Rare_f = ['Dona', 'Lady', 'the Countess']
Rare_m = ['Capt', 'Col', 'Don', 'Major', 'Rev', 'Sir', 'Dr', 'Master', 'Jonkheer']

for each in Rare_f :
    titanic['title'] = titanic['title'].replace(each, 'Rare_f')

for each in Rare_m :
    titanic['title'] = titanic['title'].replace(each, 'Rare_m')
    
titanic[['title', 'survived']].groupby(['title'], as_index=False).mean()

귀족 남성들은 평민 여성들보다도 생존율이 훨씬 낮다. (평민 여성 = Miss, Mrs)
즉 평민 남성이 제일 생존율이 낮고 그 다음이 귀족 남성
그 다음이 평민 여성 그 다음이 귀족 여성

3. 데이터 전처리

모델링을 하기 전, 데이터를 다시 한 번 확인해보자.

titanic.info()

3-1. 컬럼 형태 변경

확인해보면, sex 컬럼이 object 형으로 되어 있다. LabelEncoder를 이용하여 숫자로 바꿔주자

# 1. 라벨인코더 호출

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# 2. 목표 컬럼 기준으로 fit
le.fit(titanic['sex'])

# 3. 목표 컬럼 transform 
titanic['gender'] = le.transform(titanic['sex'])
titanic.head()

3-2. 결측치

titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]

titanic.info()

4. 데이터 모델링

이제 생존자를 예측하는 머신러닝을 수행하자 !

먼저, 데이터를 분리한다.

from sklearn.model_selection import train_test_split

X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']]
y = titanic['survived']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.8, random_state=13)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(max_depth=4, random_state=13)

dt.fit(X_train, y_train)

pred = dt.predict(X_test)

print(accuracy_score(y_test,pred))

이제, 디카프리오의 생존율을 확인해보자

# ['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']
# 내가 확보한 데이터에 맞춰서 디카프리오를 특정지어보자

import numpy as np

dicaprio = np.array([[3, 18, 0, 0, 5, 1]])
print('Dicaprio : ', dt.predict_proba(dicaprio))

디카프리오의 생존율은 약 23퍼센트였다.

다음으로, 윈슬릿의 생존율을 확인해보자

# ['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']

winslet = np.array([[1, 16, 1, 1, 100, 0]])
print('Winslet : ', dt.predict_proba(winslet)[0,1])