Decision Tree (Titanic)

offpython·2023년 5월 15일

Machine Learning

목록 보기

2/3

파일 불러오기

import pandas as pd

titanic_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/titanic.xls'
titanic = pd.read_excel(titanic_url)

EDA

생존상황 전체

f, ax = plt.subplots(1, 2, figsize=(18, 8))

titanic['survived'].value_counts().plot.pie(ax=ax[0], autopct='%1.1f%%', shadow=True, explode=[0, 0.05])
ax[0].set_title('Pie plot - servived')
ax[0].set_ylabel('')

sns.countplot(x='survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot - survived')

plt.show()

성별에 따른 생존 상황

f, ax = plt.subplots(1, 2, figsize=(16, 8))

sns.countplot(x='sex', data=titanic, ax=ax[0])
ax[0].set_title('Count of passengers of sex')
ax[0].set_ylabel('')

sns.countplot(x='sex', data=titanic, hue='survived',ax=ax[1])
ax[1].set_title('Sex : survived ')

탑승객 중 남성 비율이 대략 2배이나 여성의 생존 인원 보다 더 낮음
-> 남성의 생존 가능성이 더 낮다.

경제력 대비 생존률

pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)

1등실의 생존 가능성이 아주 높다 + 여성의 생존률도 높다.
=> 1등실에 여성이 많이 타고 있었을까 ?

grid = sns.FacetGrid(titanic, row='pclass', col='sex', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.8, bins=20)
grid.add_legend();

-> 3등실에는 남성이 많았다. (특히 20대 남성)

나이별 승객 현황

import plotly.express as px

fig = px.histogram(titanic, x='age')
fig.show()

-> 아이들 + 2~30대가 많았다.

등실별 생존률을 연령대 별로 보기

grid = sns.FacetGrid(titanic, row='pclass', col='survived', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.5, bins=20)
grid.add_legend();

-> 선실 등급이 높으면 생존률이 높다.

나이를 5단계로 정리하기

titanic['age_cat'] = pd.cut(titanic['age'], bins=[0, 7, 15, 30, 60, 100],
                            include_lowest=True,
                            labels= ['baby', 'teen', 'young', 'adult', 'old'])
titanic.head()

나이, 성별, 등급별 생존자 수를 시각화

plt.figure(figsize=(14, 6))

plt.subplot(131)
sns.barplot(x='pclass', y='survived', data=titanic)

plt.subplot(132)
sns.barplot(x='age_cat', y='survived', data=titanic)

plt.subplot(133)
sns.barplot(x='sex', y='survived', data=titanic)

plt.show()

-> 어리고 + 여성 + 1등실 일 수록 생존에 유리

남/여 나이별 생존 현황을 더 자세히 보기

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

women = titanic[titanic['sex']=='female']
men = titanic[titanic['sex']=='male']

ax = sns.displot(women[women['survived']==1]['age'], bins=20, label='survived', ax=axes[0], ked=False)
ax = sns.displot(women[women['survived']==0]['age'], bins=40, label='not survived', ax=axes[0])
ax.legend(); ax.set_title('Female')

ax = sns.displot(women[men['survived']==1]['age'], bins=18, label='survived', ax=axes[1], ked=False)
ax = sns.displot(women[men['survived']==0]['age'], bins=40, label='not survived', ax=axes[1])
ax.legend(); ax.set_title('Male')

탑승객의 이름에서 신분 알아보기

import re

title = []
for idx, dataset in titanic.iterrows():
    tmp = dataset['name']
    title.append(re.search('\,\s\w+(\s\w+)?\.', tmp).group()[2:-1])
    
titanic['title'] = title
titanic.head()

사회적 신분으로 귀족 정리하기

pd.crosstab(titanic['title'], titanic['sex'])

titanic['title'].unique()
# array(['Miss', 'Master', 'Mr', 'Mrs', 'Col', 'Mme', 'Dr', 'Major', 'Capt', 'Lady', 'Sir', 'Mlle', 'Dona', 'Jonkheer', 'the Countess', 'Don', 'Rev', 'Ms'], dtype=object)
       
titanic['title'] = titanic['title'].replace('Mlle', 'Miss')
titanic['title'] = titanic['title'].replace('Ms', 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Miss')

Rare_f = ['Dona', 'Lady', 'the Countess']
Rare_m = ['Capt', 'Col', 'Don', 'Major', 'Rev', 'Sir', 'Dr', 'Master', 'Jonkheer']

for each in Rare_f:
    titanic['title'] = titanic['title'].replace(each, 'Rare_f')
    
for each in Rare_m:
    titanic['title'] = titanic['title'].replace(each, 'Rare_m')    
    
titanic['title'].unique()
# array(['Miss', 'Rare_m', 'Mr', 'Mrs', 'Rare_f'], dtype=object)

결과 해석

titanic[['title', 'survived']].groupby(['title'], as_index=False).mean()

-> 생존률: 귀족 여성 > 평민 여성 > 귀족 남성 > 평민 남성

Machine Learning

간단히 구조 확인

titanic.info()

ml을 위해 해당 컬럼을 숫자로 변경

Label 인코더

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(titanic['sex'])

titanic['gender'] = le.transform(titanic['sex'])
titanic.head()

결측치는 무시

titanic = titanic[titanic['age'].notnull()]
titanic =  titanic[titanic['fare'].notnull()]

titanic.info()

특성이해 -> 데이터 나누기

titanic.columns
# Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest', 'age_cat','title', 'gender'], dtype='object')

from sklearn.model_selection import train_test_split

X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']]
y = titanic['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

Decision Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(max_depth=2, random_state=13)
dt.fit(X_train, y_train)

pred = dt.predict(X_test)
print(accuracy_score(y_test, pred))

# 0.7559808612440191

디카프리오 생존률은 ?

import numpy as np

# ['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']
decaprio = np.array([[3, 18, 0, 0, 5, 1]])
print('Decaprio: ', dt.predict_proba(decaprio)[0,1])

# Decaprio:  0.1507537688442211

윈슬릿의 생존률은 ?

winslet = np.array([[1, 16, 1, 1, 100, 0]])
print('Winslet: ', dt.predict_proba(winslet)[0,1])

# Winslet:  0.9326424870466321

offpython

데이터분석 공부중 ..

이전 포스트

Decision Tree(Iris)

다음 포스트

Decision Tree (Titanic)

Machine Learning

파일 불러오기

EDA

생존상황 전체

성별에 따른 생존 상황

경제력 대비 생존률

나이별 승객 현황

등실별 생존률을 연령대 별로 보기

나이를 5단계로 정리하기

나이, 성별, 등급별 생존자 수를 시각화

남/여 나이별 생존 현황을 더 자세히 보기

탑승객의 이름에서 신분 알아보기

사회적 신분으로 귀족 정리하기

결과 해석

Machine Learning

간단히 구조 확인

ml을 위해 해당 컬럼을 숫자로 변경

Label 인코더

결측치는 무시

특성이해 -> 데이터 나누기

Decision Tree

디카프리오 생존률은 ?

윈슬릿의 생존률은 ?

Decision Tree(Iris)

Test

0개의 댓글

관련 채용 정보