[Kaggle]타이타닉 데이터 분석(Titanic - Machine Learning from Disaster) - 2

KyeongHun Kim·2024년 2월 6일

ML/DL

목록 보기

7/8

지난 시간에는 Datset을 처리하는 과정을 진행했다. 이제 학습 모델을 선정하고 학습을 시킨 후 최종 제출을 해보자!

📌 이전 글
[Kaggle]타이타닉 데이터 분석(Titanic - Machine Learning from Disaster) - 1

✍🏻 Train and Prediction

Dataset Split

제공된 데이터셋은 train 데이터와 test 데이터가 나뉘어져 있지만 검증을 위해 valid 데이터를 추가하도록 하자.

from sklearn.model_selection import train_test_split
X_train = train_df.drop("Survived", axis = 1)
y_train = train_df["Survived"]

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state = 77)

X_train.shape, X_val.shape, y_train.shape, y_val.shape
>>>
((712, 6), (179, 6), (712,), (179,))

모델 선정하기

목표인 "Survived" Column은 0과 1로 표현되기 때문에 이진 분류 문제라고 할 수 있다. 그렇기 때문에 Tree 기반의 모델과, Naive Bayes 모델을 선택하여 학습을 진행해보자.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

모델 학습 및 예측 결과

선택한 모델에 처리한 데이터를 학습시키고 예측 결과를 반환하도록 하자. 분류 문제의 경우 Accuracy를 많이 사용하기 때문에 성능 평가에 이를 반영한다.

# RandomForest Prediction
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_val)
print(f'{accuracy_score(y_val, rf_pred):.2f}')
>>>
0.85

# Decision Tree Prediction
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_val)
print(f'{accuracy_score(y_val, dt_pred):.2f}')
>>>
0.84

# Gaussian NB Prediction
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_pred = nb_clf.predict(X_val)
print(f'{accuracy_score(y_val, nb_pred):.2f}')
>>>
0.75

하이퍼 파라미터를 조정하지 않고 예측을 한 결과 RandomForest 모델이 가장 좋은 결과를 내고 있었다. GirdSearchCV를 사용해서 최적의 파라미터를 찾아보도록 하자.

param_rf = {"n_estimators" : [50, 100, 200, 300], "max_depth": [10, 20, 30, 50]}
rf_clf = RandomForestClassifier()
grs_rf = GridSearchCV(rf_clf, param_grid = param_rf, scoring = 'accuracy', cv = 2)
grs_rf.fit(X_train, y_train)
print(grs_rf.best_estimator_)
>>>
RandomForestClassifier(max_depth=10, n_estimators=200)

결과 제출

test = test_df.copy()
test_pred = rf_clf.predict(test)

result = pd.DataFrame({"PassengerId": test_id, "Survived": test_pred})
result.to_csv("/content/drive/MyDrive/kaggle/titanic/submission.csv", index = False)

해당 결과물을 제출한 결과 Score가 0.76555로 12,200번째 순위에 들었다. 더 개선할 수 있는 점을 찾아보자.

✍🏻 모델 재선정 및 결과 제출

모델을 개선시킬 수 있는 방법은 데이터를 다시 뜯어보거나, 다른 모델을 사용해보는 등의 방법이 있다. Competition LeaderBoard를 살펴보니 다른 사람들이 경사하강법(Gradient Descent)을 적용한 모델을 사용한 것을 발견했다.

params = {"n_estimators" : [50, 100, 200, 300], "max_depth": [10, 20, 30, 50], "learning_rate": [0.001, 0.003, 0.005, 0.01]}
model = GradientBoostingClassifier()
grid_model = GridSearchCV(model, param_grid = params, scoring = 'accuracy')
grid_model.fit(X_train, y_train)

print(grid_model.best_score_)
print(grid_model.best_params_)
>>>
0.8019994090416626
{'learning_rate': 0.003, 'max_depth': 10, 'n_estimators': 200}