Titanic dataset

eunbi kim·2024년 4월 28일

machine learning python

Exploratory Data Analysis (EDA) Projects

목록 보기

14/14

EDA

생존자 예측 - Decision Tree

디카프리오는 몇퍼센트의 확률로 생존할 수 있었을까???? 를 결정트리가 예측해주는 분석을 실행하였다.

EDA

데이터 관찰

import pandas as pd

titanic_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/titanic.xls'
titanic = pd.read_excel(titanic_url)

titanic.head()

컬럼별 정보:
pclass: 객실 등급
survived: 생존 여부
name: 이름
sex: 성별
age: 나이
sibsp: 형제/배우자 수
parch: 부모/자식 수
fare: 지불 요금

생존 상황 그려보기

import matplotlib.pyplot as plt
import seaborn as sns

titanic['survived'].value_counts()

survived 컬럼(생존 여부)은 0과 1의 값으로 분류된다.

여러 개 그리고 싶다 - subplot
그림 두 개 만들 도화지의 설정은 이렇다.

plt.subplots(1, 2, figsize=(16,8))

pie, count plot을 그려주었다.

f, ax = plt.subplots(1, 2, figsize=(18,8))

titanic['survived'].value_counts().plot.pie(ax=ax[0], autopct='%1.1f%%', shadow=True, explode=[0, 0.05])
ax[0].set_title('Pie plot - survived')
ax[0].set_ylabel('')

sns.countplot(x='survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot - survived')

plt.show()

-> 생존율 38.2%, 생존한 사람은 약 500명이다.

성별에 따른 생존 현황

f, ax = plt.subplots(1, 2, figsize=(18,8))

sns.countplot(x='sex', data=titanic, ax=ax[0])
ax[0].set_title('Count of passengers of sex')
ax[0].set_ylabel('')

sns.countplot(x='sex', data=titanic, hue='survived', ax=ax[1])
ax[1].set_title('Sex : survived')

plt.show()

-> 남성의 생존 가능성이 더 낮다.

남성 숫자의 절반정도의 여성이 탑승했다.
500명이 안되는 여성 탑승객 중 생존한 사람은 300명이 넘는다.
800명이 넘는 남성들은 700명이 안되는 숫자가 생존하지 못했다.
여성에 비해 네배쯤 생존하지 못했다.

경제력 대비 생존률

crosstab - 두번재 입력된 컬럼을 종류별로 구분지어주고 (survived 기준으로) 인덱스는 첫번째로 (pclass)로 잡아주는 기능

pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)

-> 1등실의 생존 가능성이 아주 높다.
여성의 생존률도 높은데, 그럼 1등실에 여성이 많이 타고 있었나?

FacetGrid - pclass와 sex별로 히스토그램을 만들었다.

grid = sns.FacetGrid(titanic, row='pclass', col='sex', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.8, bins=20)
grid.add_legend();

-> 3등실에는 남성이 많았다 - 특히 20대 남성

나이별 승객 현황

import plotly.express as px

fig = px.histogram(titanic, x='age')
fig.show()

-> 아이들과 10대 후반~30대 중반의 인원이 가장 많이 탑승해 있었다.

객실 등급별 생존률을 연령별로 관찰

grid = sns.FacetGrid(titanic, row='pclass', col='survived', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.5, bins=20)
grid.add_legend();

-> 확실히 선실 등급이 높으면 생존률이 높은 듯 하다.

등급, 나이, 성별별로 생존사 수를 한번에 파악할 수 있을까?

나이를 구간별로 5단계로 정리하기

판다스의 cut - age 컬럼을 bins에 지정해준 구간대로 나눠라 -> age_cat컬럼에 추가

titanic['age_cat'] = pd.cut(titanic['age'], bins=[0, 7, 15, 30, 60, 100],
       include_lowest=True,
       labels=['baby', 'teen', 'young', 'adult', 'old'])
titanic.head()

그래프 그리기 - subplot

subplot131, 132, 133은.. 1행 3열 중 첫번째, 두번재, 세번째 칸을 의미
bar 중간의 막대는 신뢰구간을 의미한다.

pclass별로, age_cat별로, sex별로 그린 그래프이다.

plt.figure(figsize=(14, 6))
plt.subplot(131) 
sns.barplot(x='pclass', y='survived', data=titanic)


plt.subplot(132)
sns.barplot(x='age_cat', y='survived', data=titanic)


plt.subplot(133)
sns.barplot(x='sex', y='survived', data=titanic)

1등실일수록 , baby일수록 , female일수록 생존에 유리했다.

가난한(3등실) 20대 후반 남성이라면... -> 힘없는 사람들을 도왔다는 의미
ㅠㅠ

남/여 나이별 생존 상황을 보다 더 들여다보자

subplot의 row는 1개, col는 2개로 할래 (=그림을 가로로 두개 그릴것임)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6))

women = titanic[titanic['sex'] == 'female']
men = titanic[titanic['sex'] == 'male']

ax = sns.histplot(women[women['survived']==1]['age'], bins=20, label='survived', ax=axes[0], alpha=0.5) 
ax = sns.histplot(women[women['survived']==0]['age'], bins=40, label='not survived', ax=axes[0], alpha=0.5) 
ax.legend(); ax.set_title('Female')

ax = sns.histplot(men[men['survived']==1]['age'], bins=18, label='survived', ax=axes[1], alpha=0.5) 
ax = sns.histplot(men[men['survived']==0]['age'], bins=40, label='not survived', ax=axes[1], alpha=0.5) 
ax.legend(); ax.set_title('Male')

-> 남녀의 생존율 차이가 확연하게 드러나는 그래프이다.
여성의 생존율이 전반적으로 높고, 남성의 생존율이 낮았다.

사회적 신분(귀족/평민) 살펴보기

탑승객의 이름에서 신분을 알 수 있다.

name은 [성, 사회적 신분, 이름] 구조이다. -> 정규식을 사용해서 \,\s\w+(\s\w+)?. 이렇게 표현할 수 있다.
-> 콤마 후 한 칸 띄고 글자들이 우르르 나오다가 또 한칸을 비우고 또 어떤 글자들이 우르르 나오다가 (하나거나~두개거나~) 쩜으로 끝내기
-> 두번째 글자부터 맨 마지막 글자까지 빼주고 [2:-1]
-> title컬럼으로 추가하였다.

import re

title = []

for idx, dataset in titanic.iterrows():
    tmp = dataset['name']
    title.append(re.search('\,\s\w+(\s\w+)?\.', tmp).group()[2:-1])
    
titanic['title'] = title
titanic.head()

성별별로 본 귀족:

pd.crosstab(titanic['title'], titanic['sex'])

titanic['title'].unique()

Mlle, Mme은 Miss, Mrs와 똑같은 개념이라 한다. -> Miss/Mrs로 바꾸고

titanic['title'] = titanic['title'].replace('Mlle', 'Miss')
titanic['title'] = titanic['title'].replace('Ms', 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Mrs')

Rare_f = ['Dona', 'Lady', 'the Countess']
Rare_m = ['Capt', 'Col', 'Don', 'Major', 'Rev', 'Sir', 'Dr', 'Master', 'Jonkheer']

귀족 단어만 모아놓은 리스트를 돌면서 Rare_f / Rare_m으로 replace

for each in Rare_f:
    titanic['title'] = titanic['title'].replace(each, 'Rare_f')

for each in Rare_m:
    titanic['title'] = titanic['title'].replace(each, 'Rare_m')

titanic['title'].unique()

-> 여성귀족, 남성귀족, 평민여성, 평민남성으로 정리했다.

신분으로 그룹화하여 생존율 확인:

titanic[['title', 'survived']].groupby(['title'], as_index=False).mean()

-> 귀족 남성들은 평민 여성들보다도 훨씬 생존율이 낮다.
생존률 순: 평민남성<귀족남성<평민여성<귀족여성

2등 항해사가 선장에게 탈출시 여자와 어린이를 먼저 태울 것을 건의하고 선장은 승인했다는 이야기...
-> 여성, 어린이의 생존률이 높았던 이유!

생존자 예측 - Decision Tree

plcass, sex, age, sibsp, parch, fare 컬럼 등을 쓸 수 있을 것 같다.

머신러닝을 위해서는 '컬럼'들이 다 '숫자'가 되어야 한다.
female/male을 0/1이런 식으로

전처리 모듈의 LabelEncoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(titanic['sex'])

le.classes_

-> sex를 fit했기에, female과 male이 들어왔다.

female/male -> 0/1 변화시키는 기능이 transform()이다. -> gender 컬럼에 추가

titanic['gender'] = le.transform(titanic['sex'])
titanic.head()

결측치 있으면 버리기 - 나이, 요금에선 notnull()만 취급하자

titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]

특성을 선택하고, 데이터를 나누자

from sklearn.model_selection import train_test_split

X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']]
y = titanic['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=13)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(max_depth=4, random_state=13)
dt.fit(X_train, y_train)

pred = dt.predict(X_test)
print(accuracy_score(y_test, pred))

-> 0.7655502392344498

디카프리오 형님의 생존율은?

['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']의 데이터에 맞춰서 타이타닉 영화의 디카프리오를 특정지어줘야 한다.
3등실에 탔고 18살이었고 부모형제자식 없었고 5달러를 탑승요금으로 냈고, 남성이었다.

import numpy as np

dicaprio = np.array([[3, 18, 0, 0, 5, 1]])
print('Dicaprio : ', dt.predict_proba(dicaprio))

-> Dicaprio : [[0.7704918 0.2295082]]

-> 0: 생존하지 못할 확률, 1: 생존할 확률
22%의 생존률을 가진다.

여주인공 winslet

winslet = np.array([[1, 16, 1, 1, 100, 0]])
print('winslet : ', dt.predict_proba(winslet)[0, 1])  #[0,1] - 첫번째의 두번째값을 반환해라..

winslet : 1.0

-> 100%로 살아남는다.

eunbi kim

이전 포스트