[머신러닝] 와인데이터분석

쩡이·2023년 9월 20일

ML

목록 보기
4/14

와인 데이터 분석

데이터 불러오기

import pandas as pd

red_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv'
white_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv'

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine, white_wine 합치기
(단, red, white 구분해주는 컬럼 추가 red는 1, white는 0)

red_wine['color'] = 1
white_wine['color'] = 0

wine = pd.concat([red_wine, white_wine])

quality컬럼은 어떻게 구성돼있나?

wine['quality'].unique()

출력 : 
array([5, 6, 7, 4, 8, 3, 9], dtype=int64)

판다스에서 특정 컬럼 값의 개수 세는 법

wine['quality'].value_counts()

등급 histogram

import plotly.express as px

fig = px.histogram(wine, x='quality')
fig.show()

레드/화이트 와인별 등급 histogram

fig = px.histogram(wine, x='quality', color='color')
fig.show()

레드와인과 화이트와인 분류

Decision Tree를 이용하여 레드와인인지 화이트와인인지 분류해보자.

X = wine.drop(['color'], axis=1)
y = wine['color'] #label 데이터

#train/test용 나누기
from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=13)
#잘 나눠졌는지 확인, 개수 확인
np.unique(y_train, return_counts=True)

(array([0, 1], dtype=int64), array([3913, 1284], dtype=int64))

train/test용 데이터가 와인 종류에 따라 어느 정도 구분되었을지 알아보자

import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Histogram(x=X_train['quality'], name='Train'))
fig.add_trace(go.Histogram(x=X_test['quality'], name='Test'))

fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.7)
fig.show()

#결정나무 훈련
from sklearn.tree import DecisionTreeClassifier

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train,y_train)

#accuracy 확인
from sklearn.metrics import accuracy_score

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

accuracy_score(y_train, y_pred_tr)
출력 : 0.9553588608812776
accuracy_score(y_test, y_pred_test)
출력 : 0.9569230769230769

train/test용 accuracy가 크게 차이가 나지 않는 것으로 보아 성능은 유사하다

와인 데이터의 몇 개 항목의 boxplot을 그려보자

fig = go.Figure()
fig.add_trace(go.Box(y=X['fixed acidity'], name='fixed acidity'))
fig.add_trace(go.Box(y=X['chlorides'], name='chlorides'))
fig.add_trace(go.Box(y=X['quality'], name='quality'))
fig.show()

컬럼들간의 범위 격차가 심한 경우 MinMaxScaler와 StandardScaler를 사용해본다.

def px_box(target_df):
    fig = go.Figure()
    fig.add_trace(go.Box(y=target_df['fixed acidity'], name='fixed acidity'))
    fig.add_trace(go.Box(y=target_df['chlorides'], name='chlorides'))
    fig.add_trace(go.Box(y=target_df['quality'], name='quality'))
    fig.show()

MinMaxScaler

px_box(X_mms_pd)

StandardScaler

px_box(X_ss_pd)

결정나무에서는 이런 전처리는 의미를 가지지 않는다..
주로 cost function을 최적화할 때 유효할 때가 있다
MinMaxScaler와 StandardScaler 중 어떤 것이 좋을 지는 해봐야 알 수 있다.

MinMaxScaler를 적용해서 학습

X_train, X_test, y_train, y_test = train_test_split(X_mms_pd, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('Test Acc : ', accuracy_score(y_test, y_pred_test))

Train Acc :  0.9553588608812776
test Acc :  0.9569230769230769

StandardScaler를 적용해서 학습

X_train, X_test, y_train, y_test = train_test_split(X_ss_pd, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('Test Acc : ', accuracy_score(y_test, y_pred_test))

Train Acc :  0.9553588608812776
test Acc :  0.9569230769230769

레드/화이트 와인을 구분하는 중요한 특성은 무엇일지 알아보자
중요도는 max_dept에 따라 달라진다.

dict(zip(X_train.columns, wine_tree.feature_importances_))

와인 맛의 분류

quality가 여러단계로 나눠져있어서 간단하게 quality 컬럼을 이진화해보자

wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]

레드/화이트 와인 분류와 동일하게 데이터를 나누고, fit 하고
accuracy를 알아보자

X = wine.drop(['taste'], axis=1)
y = wine['taste']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('Test Acc : ', accuracy_score(y_test, y_pred_test))

결과 train, test 모두 1이 나왔다.
100%가 나왔다는 것은 뭔가 잘못된 것이라는 것이다.
어떻게 구분된 건지 알아보자

import matplotlib.pyplot as plt
import sklearn.tree as tree

plt.figure(figsize=(12,8))
tree.plot_tree(wine_tree, feature_names=X.columns )


taste컬럼은 지우고 이 데이터의 원본인 quality는 그대로 두었다. quality도 삭제해서 위 과정을 다시 해보자

#위 과정을 다시
X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('Test Acc : ', accuracy_score(y_test, y_pred_test))

Train Acc : 0.7294593034442948
Test Acc : 0.7161538461538461

어떤 특성이 와인의 맛을 결정할까?

import matplotlib.pyplot as plt
import sklearn.tree as tree

plt.figure(figsize=(12,8))
tree.plot_tree(wine_tree, feature_names=X.columns, rounded=True, filled=True)
plt.show()

알코올이 꽤 중요해보인다.

Pipeline

앞서 iris, wine 데이터를 공부해보면 혼돈이 크고, 데이터 전처리와 여러 알고리즘의 반복 실행, 하이퍼 파라미터의 튜닝 과정을 번갈아 하다 보면 코드의 실행 순서에 혼돈이 올 수 있다.
그래서 Pipeline을 사용해보자

앞선 wine 데이터를 이용하면

import pandas as pd

red_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv'
white_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv'

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1
white_wine['color'] = 0

wine = pd.concat([red_wine, white_wine])

X = wine.drop(['color'], axis=1)
y = wine['color'] 

여기서 test_train_split은 Pipeline 내부가 아니다

from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

estimators = [
    ('scaler', StandardScaler()),
    ('clf', DecisionTreeClassifier())
]

pipe = Pipeline(estimators)

#set_params
pipe.set_params(clf__max_depth=2)
pipe.set_params(clf__random_state=13)

#train/test용 나누기
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=13,stratify=y)

pipe.fit(X_train,y_train)
#원래는 scaler를 통과시키고 clf를 학습시키는 과정을 거쳐야함
#pipe를 만들어두어서 이를 fit하기만 하면 알아서 scaler 통과후 clf 학습을 함

이후 accuracy 확인

from sklearn.metrics import accuracy_score

y_pred_tr = pipe.predict(X_train)
y_pred_test = pipe.predict(X_test)

print('Train Acc : ', accuracy_score(y_train,y_pred_tr))
print('Test Acc : ', accuracy_score(y_test,y_pred_test))

Train Acc :  0.9657494708485664
Test Acc :  0.9576923076923077

하이퍼파라미터 튜닝

교차검증

나에게 주어진 데이터에 적용한 모델의 성능을 정확히 표현하는데 유용하다

import numpy as np
from sklearn.model_selection import KFold

X = np.array([
    [1,2], [3,4], [1,2], [3,4]
])
y = np.array([1,2,3,4])
kf = KFold(n_splits=2) #몇등분할지

for train_idx, test_idx in kf.split(X):
    print('Train idx : ', train_idx)
    print('Test idx : ', test_idx)
    
    
Train idx :  [2 3]
Test idx :  [0 1]
Train idx :  [0 1]
Test idx :  [2 3]
for train_idx, test_idx in kf.split(X):
    print('---idx')
    print(train_idx, test_idx)
    print('--- train data')
    print(X[train_idx])
    print('---validation data')
    print(X[test_idx])


---idx
[2 3] [0 1]
--- train data
[[1 2]
 [3 4]]
---validation data
[[1 2]
 [3 4]]
---idx
[0 1] [2 3]
--- train data
[[1 2]
 [3 4]]
---validation data
[[1 2]
 [3 4]]

다시 와인데이터에서

import pandas as pd

red_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv'
white_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv'

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1
white_wine['color'] = 0

wine = pd.concat([red_wine, white_wine])

wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]

X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('test Acc : ', accuracy_score(y_test, y_pred_test))

Train Acc :  0.7294593034442948
test Acc :  0.7161538461538461

가장 기본적인 형태로 교차검증 해보자

from sklearn.model_selection import KFold

kfold = KFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

for train_idx, test_idx in kfold.split(X):
    print(len(train_idx), len(test_idx))

각각의 fold에 대한 학습 후 accuracy 확인

cv_accuracy = []

for train_idx, test_idx in kfold.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_test)
    cv_accuracy.append(accuracy_score(y_test,pred))

cv_accuracy

각 acc의 분산이 크지않다면 평균을 대표값으로

np.mean(cv_accuracy)

0.709578255462782

StratifiedKFold을 이용하는 방법

from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cv_accuracy = []

for train_idx, test_idx in skfold.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_test)
    cv_accuracy.append(accuracy_score(y_test,pred))

다음은 가장 많이 사용하는 방법으로 보다 간편한 교차검증 방법인 cross_val_score를 알아보자

from sklearn.model_selection import cross_val_score

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cross_val_score(wine_tree_cv, X, y, scoring=None, cv=skfold)

array([0.55230769, 0.68846154, 0.71439569, 0.73210162, 0.75673595])
#값을 변경해야할때 일일이 찾아서 변경하기 어려우니 함수를 사용하면 간단함
def skfold_dt(depth):
    from sklearn.model_selection import cross_val_score

    skfold = StratifiedKFold(n_splits=5)
    wine_tree_cv = DecisionTreeClassifier(max_depth=depth, random_state=13)

    print(cross_val_score(wine_tree_cv, X, y, scoring=None, cv=skfold))
    
skfold_dt(3)
[0.56846154 0.68846154 0.71439569 0.73210162 0.75673595]

train score와 함께 보고 싶다면

from sklearn.model_selection import cross_validate
cross_validate(wine_tree_cv, X, y, scoring=None, cv=skfold, return_train_score=True)

{'fit_time': array([0.00653815, 0.00642395, 0.01053619, 0.00875425, 0.0085156 ]),
 'score_time': array([0.00299287, 0.00301027, 0.00151706, 0.00356221, 0.00328612]),
 'test_score': array([0.55230769, 0.68846154, 0.71439569, 0.73210162, 0.75673595]),
 'train_score': array([0.74773908, 0.74696941, 0.74317045, 0.73509042, 0.73258946])}
#앞쪽 값에서는 과적합 현상도 보인다.

하이퍼파라미터 튜닝

모델의 성능을 확보하기 위해 조절하는 설정 값

feature engineering
특성을 관찰하고 특성에서 모델이 보다 잘 학습결과를 이끌어낼 수 있도록 특성을 찾거나 바꾸는 작업

max_depth 를 바꿔가면서 튜닝

GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

params = {'max_depth' : [2, 4, 7, 10]}

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
gridsearch = GridSearchCV(estimator=wine_tree, param_grid=params, cv=5)
gridsearch.fit(X,y)
#cv : cross validation

GridSearchCV 결과

import pprint

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(gridsearch.cv_results_)


{   'mean_fit_time': array([0.00710855, 0.01062279, 0.01638565, 0.02294984]),
    'mean_score_time': array([0.00209279, 0.00160728, 0.00210023, 0.00183864]),
    'mean_test_score': array([0.6888005 , 0.66356523, 0.65340854, 0.64401587]),
    'param_max_depth': masked_array(data=[2, 4, 7, 10],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object),
    'params': [   {'max_depth': 2},
                  {'max_depth': 4},
                  {'max_depth': 7},
                  {'max_depth': 10}],
    'rank_test_score': array([1, 2, 3, 4]),
    'split0_test_score': array([0.55230769, 0.51230769, 0.50846154, 0.51615385]),
    'split1_test_score': array([0.68846154, 0.63153846, 0.60307692, 0.60076923]),
    'split2_test_score': array([0.71439569, 0.72363356, 0.68360277, 0.66743649]),
    'split3_test_score': array([0.73210162, 0.73210162, 0.73672055, 0.71054657]),
    'split4_test_score': array([0.75673595, 0.7182448 , 0.73518091, 0.72517321]),
    'std_fit_time': array([0.00162137, 0.0007936 , 0.00069895, 0.00153315]),
    'std_score_time': array([0.0001181 , 0.00048919, 0.00064841, 0.00047341]),
    'std_test_score': array([0.07179934, 0.08390453, 0.08727223, 0.07717557])}

bestestimator : 최고의 성능을 가진 모델

만약 pipeline을 적용한 모델에 GridSearch를 적용하고 싶다면

from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

estimators = [('scaler', StandardScaler()),
              ('clf', DecisionTreeClassifier(random_state=13))]

pipe = Pipeline(estimators)

param_grid = [{'clf__max_depth' : [2,4,7,10]}]

GridSearch = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=5)
GridSearch.fit(X,y)

위에서 구한 성능결과를 표로 정리해보면

import pandas as pd

score_df = pd.DataFrame(GridSearch.cv_results_)
score_df[['params', 'rank_test_score', 'mean_test_score', 'std_test_score']]

0개의 댓글