머신러닝을 활용한 타이타닉 탑승자별 생존율

해소리·2022년 5월 24일

머신러닝

목록 보기

2/7

우선 타이타닉의 표본 데이터를 가져온다.
PinkWink 내 타이타닉 데이터
다운받은 타이타닉 데이터를 vs code를 이용하여 연 이후 pd.read_excel을 통해 타이타닉 데이터가 잘 들어갔는지 확인한다.

matplotlib와 seaborn을 사용해야 하기 때문에 import 한다.

import matplotlib.pyplot as plt
import seaborn as sns

만일 seaborn이 깔려 있지 않다면

!pip install -U seaborn

탑승자의 생존율과 사망율

이후 데이터를 seaborn을 사용해서 그래프를 그릴 것이다.
이를 통해 탑승자의 생존율과 사망율을 확인할 수 있다.

f, ax= plt.subplots(1,2, figsize=(16,8))

titanic['survived'].value_counts().plot.pie(ax=ax[0], autopct='%1.1f%%',shadow=True, explode=[0,0.05])
ax[0].set_title('pie plot - survuved')
ax[0].set_ylabel('')

sns.countplot(x='survived',data=titanic, ax=ax[1])
ax[1].set_title('count plot - survuved')

plt.show()

이후 등급별 탑승객의 생존자와 사망자의 숫자를 포함한 전체 수까지 나오도록 작성한다.

pd.crosstab(titanic['pclass'],titanic['survived'], margins=True)

집계를 plt.hist를 통해 막대 그래프로 나타내 준다.

grid =sns.FacetGrid(titanic, row='pclass',col= 'sex', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.8, bins= 20)
grid.add_legend();

나이별 사망률

나이대 별로 사망률을 확인할 수 있게끔 코드를 작성한다.

fig = px.histogram(titanic, x= 'age')
fig.show()

면적대비 20대와 아이들이 가장 많다는 것을 확인할 수 있다.

등급별 사망률

등급별로 사망률을 확인할 수 있게끔 코드를 작성

grid =sns.FacetGrid(titanic, row='pclass',col= 'survived', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.5, bins= 20)
grid.add_legend();

이를 토대로 등급이 높으면 생존률이 높은 것을 확인

생존자 명단으로 각 나이별 '아기','십대','청소년,'어른,'노인'으로 나누어 확인하기

titanic['age_cat']= pd.cut(titanic['age'], bins=[0,7,15,30,60,100],
                            include_lowest=True,
                            labels= [ 'baby','teen','young','adult','old'])
titanic.head()

나눈 titanic 데이터를 plt 막대그래프로 표현

plt.figure(figsize=(14,6))
plt.subplot(131)
sns.barplot(x= 'pclass',y= 'survived',data=titanic)
plt.subplot(132)
sns.barplot(x= 'age_cat',y= 'survived',data=titanic)
plt.subplot(133)
sns.barplot(x= 'sex',y= 'survived',data=titanic)
plt.show()

귀족별 생존율 구하기

귀족별 생존율을 구하기 앞서 타이타닉 탑승자 명단을 정리해야한다.

pd.crosstab(titanic['title'], titanic['sex'])

이는 남성과 여성의 성별 타이틀을 얻을 수 있다.
타이틀만 뽑아서 확인해보자

titanic['title'].unique()

array(['Miss', 'Master', 'Mr', 'Mrs', 'Col', 'Mme', 'Dr', 'Major', 'Capt',
'Lady', 'Sir', 'Mlle', 'Dona', 'Jonkheer', 'the Countess', 'Don',
'Rev', 'Ms'], dtype=object)

다음과 같은 unique를 알 수 있다.

그렇다면 각 타이틀을 여성과 남성으로 구분해보자

titanic['title'] = titanic['title'].replace('Mlle','Miss')
titanic['title'] = titanic['title'].replace('Ms','Miss')
titanic['title'] = titanic['title'].replace('Mme','Mrs')

Rare_f= ['Dona','Lady','the Countess']
Rare_m= ['Capt','Col','Don','Major','Rev','Sir','Master','Dr','Jonkheer']

아래 코드를 통해 귀족별 생존율을 살펴볼 수 있다.

titanic[['title', 'survived']].groupby(['title'],as_index=False).mean()

그러면 평민 남성 → 귀족남성 → 평민 여성 → 귀족 여성 순으로 생존율이 높음을 확인 가능하다.

디카프리오의 생존율 확인하기

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(max_depth=4, random_state=13)
dt.fit(x_train, y_train)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
를 이용해서 머신러닝을 사용해야한다.

import numpy as np

dicaprio = np.array([[3,18,0,0,5,1]])
print('dicaprio:', dt.predict_proba(dicaprio))

np.array에 측정값을 3등급 3, 18세, 아이수, 표의 가격, 탑승객의 수를 입력해주고 머신러닝에 넣자 값은 dicaprio: [[0.7704918 0.2295082]]
의 결과를 얻을 수 있었다.
디카프리오의 생존율은 0.2295082,,,

오늘은 머신러닝을 이용해서 타이타닉 탑승객의 생존율과 디카프리오의 생존율을 계산해보았다.

처음 배우는 머신러닝이지만 데이터를 이용해서 예측을 할 수 있다는 점과 기본 원리를 배우다 보니 시간 가는 줄 몰랐다.

해소리

문과생 데이터사이언티스트되기 프로젝트

이전 포스트

머신러닝 Decision Tree

다음 포스트