๐Ÿ’ฏ[Python EDA 6] Machine Learining์„ ์œ„ํ•œ EDA

๊น€๋ฏธ์—ฐยท2023๋…„ 8์›” 23์ผ
0

[๋‚˜๋งŒ์˜ ๋…ธํŠธ] Python EDA

๋ชฉ๋ก ๋ณด๊ธฐ
6/8
post-thumbnail

Titanic Data๋กœ ์ƒ์กด์ž ์˜ˆ์ธก์„ ์œ„ํ•œ EDA

DataSet : Titanic(kaggle)

  • Jupyter notebook ํ™œ์šฉ
  • Python ํ™œ์šฉ
    โ€‹

1. Setting

# ๋ถ„์„์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

โ€‹

2. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# train.csv ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
titanic = pd.read_csv('./data/titanic/train.csv')

โ€‹

3. ๋ฐ์ดํ„ฐ ํ™•์ธ

๋จธ์‹ ๋Ÿฌ๋‹์„ ์œ„ํ•œ EDAํ•  ๋•Œ ๋ฌด์กฐ๊ฑด ํ™•์ธ ํ•ด์•ผ ํ•˜๋Š” 3๊ฐ€์ง€ ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ
1) ๊ฒฐ์ธก์น˜๊ฐ€ ์กด์žฌ์—ฌ๋ถ€ ํ™•์ธ

titanic[titanic.isnull().any(axis=1)]

โ€‹
2) dtype์ด object์ธ column์ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€ ํ™•์ธ

  • ์žˆ๋‹ค๋ฉด, ๋ฒ„๋ฆฌ๊ฑฐ๋‚˜ ๋ณ€ํ™˜ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ
# ๋ฐฉ๋ฒ• 1
titanic.info()
โ€‹
# ๋ฐฉ๋ฒ• 2
titanic.columns[titanic.dtypes == 'object'] # ๋งŽ์ด ์“ด๋‹ค ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ

โ€‹
3) target value(์˜ˆ์ธก ๊ฐ’)์˜ distribution ํ™•์ธ

titanic['Survived'].value_counts()
sns.countplot(data=titanic, x='Survived', palette='Set3')

์œ„ 3๊ฐœ ์™ธ์—๋„ ์ž์ฃผ ํ™•์ธํ•˜๋Š” ๊ฒƒ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

4) correlation matrix heatmap

sns.heatmap(data=titanic.corr(), annot=True, fmt='.3f', cmap='Blues_r')

โ€‹

4. ๊ฒฐ์ธก์น˜์— ๋Œ€ํ•œ EDA

# Cabin columns์— ๋Œ€ํ•œ ๋ถ„์„
cond1 = titanic['Cabin'].isnull()
print(len(titanic[cond1])) # 687
print(len(titanic[~cond1])) # 204

Cabin์˜ ๊ฒฝ์šฐ, ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋น„์–ด ์žˆ์–ด ์ปฌ๋Ÿผ ์ž์ฒด๋ฅผ ์‚ญ์ œํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ•œ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๋ฝ‘์•„ ๋‚ผ ์ˆ˜ ์žˆ์„์ง€ ๋ณด๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค.

#  Cabin column์ด nan์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋“ค
cabin = titanic[~cond1]
โ€‹
#  Cabin column์ด nan์ธ ๋ฐ์ดํ„ฐ๋“ค
cabin_nan = titanic[cond1]
โ€‹
display(cabin.describe()) # display on jupyter notebook
display(cabin_nan.describe())
โ€‹

Cabin ๊ฐ’์ด ์กด์žฌํ•˜๋ฉด ์ƒ์กด๋ฅ ์ด ๋†’์€ ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.
์ƒˆ๋กœ์šด feature์ธ is_cabin(๊ฒฐ์ธก์น˜์ธ์ง€ ์•„๋‹Œ์ง€ ์—ฌ๋ถ€)์„ ๋งŒ๋“ค๊ธฐ๋กœ ๊ฒฐ์ •ํ•œ๋‹ค.

โ€‹

5. ์ „์ฒ˜๋ฆฌ

1) ๊ฒฐ์ธก์น˜์— ๋Œ€ํ•œ ์ „์ฒ˜๋ฆฌ

  • 'is_cabin' column ์ถ”๊ฐ€(0 ๋˜๋Š” 1)
  • 'Age'์˜ ๊ฒฐ์ธก๊ฐ’ : Age์˜ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
  • 'Embarked' ๊ฒฐ์ธก์น˜ ์žˆ๋Š” ํ–‰ ์‚ญ์ œ(2๊ฐœ)

2) ์˜ˆ์ธก์— ๋ถˆํ•„์š”ํ•œ columns ์‚ญ์ œ

  • 'PassengerId', 'Name', 'Ticket', 'Cabin' columns ์‚ญ์ œ

3) dtype์ด object์ธ columns์— ๋Œ€ํ•œ ์ „์ฒ˜๋ฆฌ : Ordinal Encoding(์ˆซ์ž๋กœ ๋ณ€ํ™˜)

  • Ordinal Encoding : ์ฃผ์–ด์ ธ ์žˆ๋Š” ๋ฌธ์ž์—ด๋“ค์„ ์ˆœ์„œ๋Œ€๋กœ ์ˆ˜์น˜ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜
'is_cabin' column ์ถ”๊ฐ€
titanic['is_cabin'] = ~titanic['Cabin'].isnull() * 1
# titanic['is_cabin'] = titanic['Cabin'].mask(cond1, 0).mask(~cond1, 1) # ์œ„์™€ ๋™์ผ

# 'Age'์˜ ๊ฒฐ์ธก๊ฐ’์„ Age์˜ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean())

์˜ˆ์ธก์— ๋ถˆํ•„์š”ํ•œ columns ์‚ญ์ œ
titanic = titanic.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])

# 'Embarked' ๊ฒฐ์ธก์น˜ ์žˆ๋Š” ํ–‰ ์‚ญ์ œ
titanic = titanic.dropna()
titanic

# Ordinal Encoding
titanic['Sex'] = pd.factorize(titanic['Sex'])[0]
titanic['Embarked'] = pd.factorize(titanic['Embarked'])[0]

โ€‹

6. Machine Learing ๊ฐ„๋‹จํ•˜๊ฒŒ ๋Œ๋ ค๋ณด๊ธฐ

from sklearn.linear_model import LogisticRegression

X = titanic.drop(columns=['Survived']) # define feature vector
y = titanic['Survived'] # define target value

clf = LogisticRegression() # define model
clf.fit(X, y) # fitting(=training)
clf.score(X, y) # Get Accuracy(=์ •ํ™•๋„;๋งž์€ ๊ฐœ์ˆ˜์˜ ๋น„์œจ) # 0.80427

โ€‹

7. [์ถ”๊ฐ€] ๋ฐ์ดํ„ฐ ๋ถ„์„ ์‹œ๊ฐํ™”

# ํƒ€์ดํƒ€๋‹‰ ์ƒ์กด ๋ถ„์„ ๊ฒฐ๊ณผ
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 5), constrained_layout=True)

# ์ƒ์กด์ž/์‚ฌ๋ง์ž ์ˆ˜ ๊ทธ๋ž˜ํ”„
sns.countplot(data=df_ex3, y='Sex', hue='Survived', palette='Set1', ax=ax[0][0])
ax[0][0].legend(title='', labels=['dead', 'survivors'])
ax[0][0].set(xlabel='', ylabel='', title='Number of dead & survivors')

# ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ์ƒ์กด์œจ ๊ทธ๋ž˜ํ”„
sns.barplot(data=df_ex3, x='Pclass', y='Survived', hue='Sex', palette='Set2', ax=ax[0][1], ci=False)
ax[0][1].legend(title='')
ax[0][1].set(ylabel='', title='Survival rate by gender')

# 15์„ธ ์ดํ•˜์˜ ๋น„์œจ ๊ทธ๋ž˜ํ”„
sns.barplot(data=df_ex3, x='Pclass', y='Age', palette='RdPu_r', ax=ax[1][0], ci=False, estimator=lambda x: (x <= 15).mean())
ax[1][0].set(ylim=(0, 0.3), ylabel='', title='Ratio under 15 years')

# ์•„์ด์™€ ์–ด๋ฅธ์˜ ์ƒ์กด์œจ ๋น„๊ต ๊ทธ๋ž˜ํ”„
sns.barplot(data=df_ex3, x='Pclass', y='Survived', palette='Set3', hue='A or C', ax=ax[1][1], ci=False)
ax[1][1].set(ylim=(0, 1),  ylabel='', title='Survival rate by Age')
ax[1][1].legend(title='', loc='upper right')

0๊ฐœ์˜ ๋Œ“๊ธ€