Today
강의
스터디 내용
- 머신러닝 강의 복습 scikit-learn 이용 kaggle titanic 생존자 예측 코드 복습 및 따라서 작성
- 평가(재현율, 정확도) 파트 수강
결과
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
titanic_df = pd.read_csv('./titanic_train.csv')
# [데이터 전처리]
titanic_df['Age'].fillna(titanic_df['Age'].mean(),inplace=True)
titanic_df['Cabin'].fillna('N',inplace=True)
titanic_df['Embarked'].fillna('N',inplace=True)
titanic_df['Cabin'] = titanic_df['Cabin'].str[:1]
# [인코딩]
from sklearn import preprocessing
features = ['Cabin', 'Sex', 'Embarked']
for feature in features:
le = preprocessing.LabelEncoder()
le = le.fit(titanic_df[feature])
titanic_df[feature] = le.transform(titanic_df[feature])
# 독립변수 추가 전처리
x_titanic_df = titanic_df.drop('Survived', axis = 1)
x_titanic_df.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)
# 생존자 데이터 프레임 추출
titanic_df_for_y = pd.read_csv('./titanic_train.csv')
y_titanic_df = titanic_df_for_y['Survived']
# train, test set 구분
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_titanic_df, y_titanic_df, test_size = 0.2)
# [머신러닝 알고리즘 적용]
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 모델 설정
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
lr = LogisticRegression()
# 모델 학습 및 예측
dt.fit(x_train, y_train)
df_predict = dt.predict(x_test)
rf.fit(x_train, y_train)
rf_predict = rf.predict(x_test)
lr.fit(x_train, y_train)
lr_predict = lr.predict(x_test)
# 정확도 확인
predict = [df_predict, rf_predict, lr_predict]
scores = []
for i in predict:
score = accuracy_score(y_test, i)
scores.append(score)
np.mean(scores)
# [교차 검증]
# cross_var_score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(dt, x_titanic_df, y_titanic_df, cv = 5)
np.argmax(scores)
# GridSearchCV
from sklearn.model_selection import GridSearchCV
params = {'max_depth':[2,3,5,10,15,30],
'min_samples_split':[2,3,5,8],
'min_samples_leaf':[1,5,8,10]}
%timeit grid_dt = GridSearchCV(dt, param_grid = params, scoring = 'accuracy', cv = 5)
grid_dt.fit(x_train, y_train)
grid_dt.best_params_
best_estimator = grid_dt.best_estimator_
grid_dt.best_score_
# best estimator 사용해서 예측 및 정확도 확인
predict = best_estimator.predict(x_test)
accuracy_score(y_test, predict)
Tomorrow
Summary
- 텐서플로를 아주 얇게 다뤄본 경험이 있는데 scikit-learn이 좀 더 lowlevel 라이브러리인가? 라는 느낌을 받았다.
- 회귀와 분류까지만 공부하고 이후에는 이전 강의들 복습 및 개인 프로젝트 포트폴리오를 만드는데 시간을 써야겠다.