ML_Titanic

조천룡·2023년 7월 13일

머신러닝

Machine Learning

목록 보기

3/26

타이타닉 정보

데이터 탐색적 분석

데이터 읽기

!pip install plotly_express

import pandas as pd

titanic_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/titanic.xls'
titanic = pd.read_excel(titanic_url)

titanic.head()

생존 상황

생존률 : 38.2%

import matplotlib.pyplot as plt
import seaborn as sns

f, ax = plt.subplots(1,2, figsize=(16,8))
titanic['survived'].value_counts().plot.pie(ax=ax[0], autopct='%1.1f%%', shadow=True, explode=[0,0.05])
ax[0].set_title('Pie plot_survived')
ax[0].set_ylabel('')

sns.countplot(x='survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot_survived')

plt.show()

성별에 따른 생존 상황

남성의 생존 가능성이 더 낮다.

f, ax = plt.subplots(1,2, figsize=(16,8))

sns.countplot(x='sex', data=titanic, ax=ax[0])
ax[0].set_title('Count of passengers of sex')
ax[0].set_ylabel('')

sns.countplot(x='sex', data=titanic, hue='survived', ax=ax[1])
ax[1].set_title('Sex : Survived')

plt.show()

경제력 대비 생존률

1등실의 생존 가능성이 아주 높다.
여성의 생존률도 높다.
1등실의 여성 비율이 높나?

pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)

선실 등급별 성별 상황

3등실에는 남성이 많았다. 특히 20대 남성

grid = sns.FacetGrid(titanic, row='pclass', col='sex', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.8, bins=20)
grid.add_legend();

나이별 승객 현황

아이들과 20~30대가 많다.

import plotly_express as px 

fig = px.histogram(titanic, x='age')
fig.show()

연령별로 등실별 생존률 파악

선실 등급이 높으면 생존률이 높은듯 하다.

grid = sns.FacetGrid(titanic, row='pclass', col='survived', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.5, bins=20)
grid.add_legend();

나이별 생존자

나이를 5단계로 분류

titanic['age_cat'] = pd.cut(titanic['age'], bins=[0, 7, 15, 30, 60, 100],
       include_lowest=True,
       labels=['baby','teen','young','adult','old'])
titanic.head()

나이, 성별, 등급별 생존자 수를 한번에 파악

어리고, 여성이고, 1등실일 수록 생존율이 유리했을까?

plt.figure(figsize=(14,6))

plt.subplot(131)
sns.barplot(x='pclass', y='survived', data=titanic)

plt.subplot(132)
sns.barplot(x='age_cat', y='survived', data=titanic)

plt.subplot(133)
sns.barplot(x='sex', y='survived', data=titanic)

남/여 나이별 생존 상황을 보다 더 자세히 파악

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6))

women = titanic[titanic['sex'] == 'female']
men = titanic[titanic['sex'] == 'male']

ax = sns.histplot(women[women['survived']==1]['age'], bins=20, label='survived', ax=axes[0], kde=False)
ax = sns.histplot(women[women['survived']==0]['age'], bins=40, label='not survived', ax=axes[0], kde=False)
ax.legend(); ax.set_title('Female')

ax = sns.histplot(men[men['survived']==1]['age'], bins=18, label='survived', ax=axes[1], kde=False)
ax = sns.histplot(men[men['survived']==0]['age'], bins=48, label='not survived', ax=axes[1], kde=False)
ax.legend(); ax.set_title('male')

탑승객의 이름에서 신분을 파악

정규식을 이용해서 문장 사이의 신분에 대한 정보를 얻을 수 있다.
사회적 신분만 얻을 수 있다.

import re
title = []
for idx, dataset in titanic.iterrows():
    tmp = dataset['name']
    title.append(re.search('\,\s\w+(\s\w+)?\.',tmp).group()[2:-1])
titanic['title'] = title
titanic.head()

성별로 본 귀족

pd.crosstab(titanic['title'], titanic['sex'])

titanic['title'].unique()

사회적 신분 정리

titanic['title'] = titanic['title'].replace('Mlle','Miss')
titanic['title'] = titanic['title'].replace('Ms','Miss')
titanic['title'] = titanic['title'].replace('Mme','Mrs')


Rare_f = ['Dona','Lady','the Countess']
Rare_m = ['Capt','Col','Don','Major','Rev','Sir','Dr','Master','Jonkheer']

for each in Rare_f:
    titanic['title'] = titanic['title'].replace(each,'Rare_f')
    
for each in Rare_m:
    titanic['title'] = titanic['title'].replace(each,'Rare_m')
    
titanic['title'].unique()

titanic[['title','survived']].groupby(['title'], as_index=False).mean()

머신러닝을 이용한 생존자 예측

구조 확인

Label Encode를 이용하여 해당 컬럼을 숫자로 변경

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(titanic['sex'])

# le.classes_

titanic['gender'] = le.transform(titanic['sex'])
titanic.head()

결측치 처리

titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]
titanic.info()

먼저 특성을 선택하고 데이터를 나누기

특성은 pclass, age, sibsp, parch, fare, gender로 선정

from sklearn.model_selection import train_test_split

X = titanic[['pclass','age','sibsp','parch','fare','gender']]
y = titanic['survived']

X_train , X_test , y_train , y_test = train_test_split(X, y, test_size=0.8, random_state=13)

Desicion Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(max_depth=4, random_state=13)
dt.fit(X_train,y_train)

pred = dt.predict(X_test)
print(accuracy_score(y_test, pred))

디카프리오의 생존률

import numpy as np

dicaprio = np.array([[3,18,0,0,5,1]])
print('Dicaprio: ', dt.predict_proba(dicaprio)[0,1])

윈슬릿의 생존률

wislet = np.array([[1,16,1,0,100,0]])
print('Winslet: ', dt.predict_proba(wislet)[0,1])

조천룡

10√2 Data

이전 포스트

ML_Titanic

Machine Learning

타이타닉 정보

데이터 탐색적 분석

데이터 읽기

생존 상황

성별에 따른 생존 상황

경제력 대비 생존률

선실 등급별 성별 상황

나이별 승객 현황

연령별로 등실별 생존률 파악

나이별 생존자

나이, 성별, 등급별 생존자 수를 한번에 파악

남/여 나이별 생존 상황을 보다 더 자세히 파악

탑승객의 이름에서 신분을 파악

성별로 본 귀족

머신러닝을 이용한 생존자 예측

구조 확인

Label Encode를 이용하여 해당 컬럼을 숫자로 변경

결측치 처리

먼저 특성을 선택하고 데이터를 나누기

Desicion Tree

ML_Intro

0개의 댓글

관련 채용 정보