[Kaggle] Titanic - Machine Learning from Disaster

ByungJik_Ohยท2025๋…„ 4์›” 16์ผ

[Kaggle]

๋ชฉ๋ก ๋ณด๊ธฐ
1/3
post-thumbnail


๐Ÿ’ก ๋ฌธ์ œ

์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํƒ€์ดํƒ€๋‹‰ํ˜ธ์— ํƒ‘์Šนํ•œ ์Šน๊ฐ๋“ค์˜ ์ƒ์กด์—ฌ๋ถ€(survival)์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•œ๋‹ค.

๐Ÿ”ฅ ์˜ˆ์ธก์— ์‚ฌ์šฉํ•  ๋ชจ๋ธ : DNN (Deep Neural Network) - ์ด์ง„๋ถ„๋ฅ˜


๐Ÿ“– ๋ฐ์ดํ„ฐ ์…‹


๐Ÿ“’ ์ฝ”๋“œ

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

from sklearn.metrics import classification_report

๐Ÿ“š Raw Data Loading

# Raw Data Loading
df = pd.read_csv('/content/drive/MyDrive/KDT/data/Titanic/train.csv')
display(df)


๐Ÿ“š Feature Selection

df1 = df.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'], axis=1, inplace=False)

์šฐ์„ , ํ•„์š”์—†๊ฑฐ๋‚˜ ์˜๋ฏธ๊ฐ€ ์ค‘๋ณต๋˜๋Š” feature๋ฅผ ์‚ญ์ œํ•˜์˜€๋‹ค. ์šฐ์„  PassengerId, Name, Ticket๊ณผ ๊ฐ™์ด ์ƒ์กด์—ฌ๋ถ€์™€ ์ƒ๊ด€์—†๋Š” feature๋ฅผ ์‚ญ์ œํ•˜๊ณ  Pclass์™€ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ Fare์„ ์‚ญ์ œํ•˜์˜€๋‹ค. ๋˜ํ•œ ๊ฒฐ์ธก์น˜๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์•ฝ 70%๊ฐ€ ๋„˜๋Š” Cabin์˜ ๊ฒฝ์šฐ ๋น„๊ต์  ์ ์€ ๋ฐ์ดํ„ฐ์—์„œ ์ž„์˜๋กœ ๊ฒฐ์ธก์น˜๋ฅผ ๋Œ€์ฒดํ•˜๊ฒŒ ๋˜๋ฉด ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์™œ๊ณก์ด ๋ฐœ์ƒํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฏ€๋กœ ์‚ญ์ œํ•˜์˜€๋‹ค.


๐Ÿ“š SibSp, Parch ์ฒ˜๋ฆฌ

df1['Family'] = df1['SibSp'] + df1['Parch']
df2 = df1.drop(['SibSp', 'Parch'], axis=1, inplace=False)

ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ํ˜•์ œ์ž๋งค, ๋ฐฐ์šฐ์ž์˜ ์ˆ˜๋ฅผ ๋‹ด๊ณ ์žˆ๋Š” SibSp์™€ ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ๋ถ€๋ชจ, ์ž์‹์˜ ์ˆ˜๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” Parch๋Š” ์˜๋ฏธ๊ฐ€ ๋น„์Šทํ•˜๊ธฐ์— ์ด ๋‘˜์„ ๋”ํ•ด์„œ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ(Family)๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.


๐Ÿ“š Sex, Embarked ์ฒ˜๋ฆฌ

df2['Sex'] = np.where(df2['Sex'] == 'female', 0, 1)

embarked_mapping = {'S' : 0, 'C' : 1, 'Q' : 2}
df2['Embarked'] = df2['Embarked'].map(embarked_mapping)

์„ฑ๋ณ„์„ ๋‹ด๊ณ  ์žˆ๋Š” ์ด์ง„ ๋ฐ์ดํ„ฐ Sex ์ปฌ๋Ÿผ์„ ๋ชจ๋ธ์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์—ฌ์ž๋Š” 0, ๋‚จ์ž๋Š” 1๋กœ ๋ณ€ํ™˜ํ•˜์˜€๊ณ , ์Šน๊ฐ๋“ค์ด ์–ด๋””์„œ ํƒ‘์Šนํ•˜์˜€๋Š”์ง€๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” Embarked ์ปฌ๋Ÿผ ๋˜ํ•œ S(Southampton)๋Š” 0, C(Cherbourg)๋Š” 1, Q(Queenstown)์€ 2๋กœ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋กœ ์ฒ˜๋ฆฌํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

df2.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 891 entries, 0 to 890
# Data columns (total 6 columns):
#  #   Column    Non-Null Count  Dtype  
# ---  ------    --------------  -----  
#  0   Survived  891 non-null    int64  
#  1   Pclass    891 non-null    int64  
#  2   Sex       891 non-null    int64  
#  3   Age       714 non-null    float64
#  4   Embarked  889 non-null    float64
#  5   Family    891 non-null    int64  
# dtypes: float64(2), int64(4)
# memory usage: 41.9 KB

ํ˜„์žฌ๊นŒ์ง€ ์ฒ˜๋ฆฌํ•œ DataFrame์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜์–ด์žˆ๊ณ , info()๋ฅผ ๋ณด๋ฉด ์•„์ง Age์™€ Embarked์—ด์— ๊ฒฐ์ธก์น˜๊ฐ€ ๋‚จ์•„์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

df2['Age'] = df2['Age'].fillna(value=df2['Age'].median(), axis=0)

df2['Embarked'] = df2['Embarked'].ffill()

๋”ฐ๋ผ์„œ ๊ฒฐ์ธก์น˜๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์€ Age์—ด๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” median(์ค‘์•™๊ฐ’)์œผ๋กœ ์ฑ„์›Œ์ฃผ์—ˆ๊ณ , ๊ฒฐ์ธก์น˜๊ฐ€ 2๊ฐœ ๋ฐ–์— ์กด์žฌํ•˜์ง€ ์•Š์€ Embarked์—ด์€ ๊ฐ ๊ฒฐ์ธก์น˜์˜ ์•ž์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€ ์ฑ„์›Œ์ฃผ์—ˆ๋‹ค.


๐Ÿ“š ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ

plt.boxplot(df2['Age'].values)
plt.show()

๋‹ค๋ฅธ Feature๋“ค์€ ๋ชจ๋‘ ์ด์ง„, ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š”๋ฐ ๋ฐ˜ํ•ด, Age๋Š” ์—ฐ์†์ ์ธ ์‹ค์ˆ˜๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์ด๊ธฐ์— ์ด์ƒ์น˜๋ฅผ ๋จผ์ € ํ™•์ธํ•ด ์ฃผ์—ˆ๋‹ค.

ํ™•์ธ ๊ฒฐ๊ณผ, ๋ช‡๊ฐœ์˜ ์ด์ƒ์น˜๊ฐ€ ๋ฐœ๊ฒฌ๋˜์—ˆ์ง€๋งŒ ๋ชจ๋‘ ์‹ค์กด๊ฐ€๋Šฅํ•œ ๋‚˜์ด๋ผ๊ณ  ํŒ๋‹จ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด์— ๋”ฐ๋ผ ์ด์ƒ์น˜๋Š” ๋”ฐ๋กœ ๋Œ€์ฒด ๋˜๋Š” ์‚ญ์ œ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜์ง€ ์•Š์•˜๋‹ค.


๐Ÿ“š Binning

df2.loc[df2['Age'] < 8, 'Age'] = 0
df2.loc[(df2['Age'] >= 8) & (df2['Age'] < 20), 'Age'] = 1
df2.loc[(df2['Age'] >= 20) & (df2['Age'] < 50), 'Age'] = 2
df2.loc[(df2['Age'] >= 50) & (df2['Age'] < 80), 'Age'] = 3
df2.loc[df2['Age'] >= 80, 'Age'] = 4

df2['Age'].value_counts()

์Šน๊ฐ๋“ค์˜ ์ƒ์กด ์—ฌ๋ถ€๋Š” ๋‚˜์ด์— ๋”ฐ๋ผ ์ƒ์กด ํ™•๋ฅ ์ด ๋‹ฌ๋ผ์งˆ ๊ฒƒ์ด๋ผ๊ณ  ํŒ๋‹จํ•˜์˜€๋‹ค. ์ด์— ๋”ฐ๋ผ ์Šน๊ฐ๋“ค์˜ ์—ฐ๋ น๋Œ€์— ๋”ฐ๋ผ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ๊ฐ„ํ™” ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š ์ •๊ทœํ™”

x_data = df2.drop('Survived', axis=1, inplace=False).values
t_data = df2['Survived'].values

scaler = MinMaxScaler()
scaler.fit(x_data)
x_data_norm = scaler.transform(x_data)

๊ฐ feature๋งˆ๋‹ค ๋ฐ์ดํ„ฐ์˜ scale์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜๋ฅผ ๋‚˜๋ˆ„๊ณ  ๋…๋ฆฝ๋ณ€์ˆ˜์— ๋Œ€ํ•ด Min-Max Scaling ์ฒ˜๋ฆฌ๋ฅผ ํ•˜์˜€๋‹ค.


๐Ÿ“š ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

x_data_train_norm, x_data_test_norm, t_data_train, t_data_test = \
train_test_split(x_data_norm,
                 t_data,
                 test_size=0.2,
                 stratify=t_data)

๋ชจ๋ธ ํ•™์Šต ํ›„ ๋ชจ๋ธ ๊ฒ€์ฆ์„ ์œ„ํ•ด ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„์–ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š DNN Model ๊ตฌํ˜„

model = Sequential()

model.add(Flatten(input_shape=(5,)))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

model.compile(optimizer=Adam(learning_rate=1e-2),
              loss='binary_crossentropy',
              metrics=['acc'])

es_callback = EarlyStopping(monitor='val_loss',
                            patience=5,
                            restore_best_weights=True,
                            verbose=1)

model.fit(x_data_train_norm,
          t_data_train,
          epochs=1000,
          validation_split=0.2,
          batch_size=100,
          callbacks=[es_callback],
          verbose=1)

์ด์ง„ ๋กœ์ง€์Šคํ‹ฑ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ 'sigmoid' ํ•จ์ˆ˜๋ฅผ, ์†์‹ค ํ•จ์ˆ˜๋กœ 'binary_crossentropy' ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งค์šฐ ๊ฐ„๋‹จํ•œ DNN ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜์˜€๋‹ค.


๐Ÿ“š ๋ชจ๋ธ ํ‰๊ฐ€

result = model.predict(x_data_test_norm)
result = np.where(result >= 0.5, 1, 0).reshape(-1)

print(classification_report(t_data_test, result))
#               precision    recall  f1-score   support

#            0       0.79      0.94      0.85       110
#            1       0.85      0.59      0.70        69

#     accuracy                           0.80       179
#    macro avg       0.82      0.77      0.78       179
# weighted avg       0.81      0.80      0.80       179

๋ชจ๋ธ ํ‰๊ฐ€ ๊ฒฐ๊ณผ F1 Score๊ฐ€ 0.8๋กœ ์ถœ๋ ฅ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.


๐Ÿ“š Test Set ์ฒ˜๋ฆฌ ๋ฐ ์ •๋‹ต ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ

# Test Data Loading
test_df = pd.read_csv('/content/drive/MyDrive/KDT/data/Titanic/test.csv')

# Test Data Preprocessing
test_df1 = test_df.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'], axis=1, inplace=False)

# Family = SibSp + Parch
test_df1['Family'] = test_df1['SibSp'] + test_df1['Parch']
test_df2 = test_df1.drop(['SibSp', 'Parch'], axis=1, inplace=False)

# Sex ๋ฐ”๊พธ๊ธฐ
test_df2['Sex'] = np.where(test_df2['Sex'] == 'female', 0, 1)

# Embarked ๋ฐ”๊พธ๊ธฐ
embarked_mapping = {'S' : 0, 'C' : 1, 'Q' : 2}
test_df2['Embarked'] = test_df2['Embarked'].map(embarked_mapping)

# ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
test_df2['Age'] = test_df2['Age'].fillna(test_df2['Age'].median(), axis=0)

test_df2['Embarked'] = test_df2['Embarked'].ffill()

# ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ
plt.boxplot(test_df2['Age'].values)
plt.show()

# Age Binning
test_df2.loc[test_df2['Age'] < 8, 'Age'] = 0
test_df2.loc[(test_df2['Age'] >= 8) & (test_df2['Age'] < 20), 'Age'] = 1
test_df2.loc[(test_df2['Age'] >= 20) & (test_df2['Age'] < 50), 'Age'] = 2
test_df2.loc[(test_df2['Age'] >= 50) & (test_df2['Age'] < 80), 'Age'] = 3
test_df2.loc[test_df2['Age'] >= 80, 'Age'] = 4

# ์ •๊ทœํ™”
test_data_norm = scaler.transform(test_df2.values)

# ์˜ˆ์ธก
test_result = model.predict(test_data_norm)
test_result = np.where(test_result >= 0.5, 1, 0).reshape(-1)

# Submission Data Loading
submission = pd.read_csv('/content/drive/MyDrive/KDT/data/Titanic/gender_submission.csv')

# ์ •๋‹ต ์ž…๋ ฅ ๋ฐ ์ถ”์ถœ
submission['Survived'] = test_result

submission.to_csv('Titanic_DNN.csv', index=False)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ๋˜‘๊ฐ™์ด ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๊ณ , ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์ œ์ถœ ๋ฐ์ดํ„ฐ์— ์‚ฝ์ž…ํ•˜์—ฌ ์ œ์ถœํ•˜์˜€๋‹ค.


โœจ ๊ฒฐ๊ณผ


๐Ÿ’ญ ํ›„๊ธฐ

์ตœ๊ทผ์— ๊ณต๋ถ€ํ•œ ๋งค์šฐ ๊ฐ„๋‹จํ•œ DNN๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜์—ฌ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ์—ฐ์Šตํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์žฅ ์œ ๋ช…ํ•œ Titanic Data Set์„ ํ™œ์šฉํ•˜์—ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ๊ตฌํ˜„์„ ์—ฐ์Šตํ•ด ๋ณด์•˜๋‹ค. ๋ฐ์ดํ„ฐ ์–‘์ด ๋ถ€์กฑํ•œ ํƒ“๋„ ์žˆ๊ฒ ์ง€๋งŒ, ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์ง€ ์•Š์•„ ๋” ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์žˆ์„์ง€ ๋” ๊ณ ๋ฏผํ•˜๊ณ  ๊ณต๋ถ€ํ•ด์•ผ๊ฒ ๋‹ค.


๐Ÿ”— ๋ฌธ์ œ ์ถœ์ฒ˜

https://www.kaggle.com/competitions/titanic


profile
็ฒพ้€ฒ "์ •์„ฑ์„ ๊ธฐ์šธ์—ฌ ๋…ธ๋ ฅํ•˜๊ณ  ๋งค์ง„ํ•œ๋‹ค"

0๊ฐœ์˜ ๋Œ“๊ธ€