ML) Logistic Regression / 로지스틱 회귀

dothouse·2024년 2월 1일
0

ML

목록 보기
3/9

0. Logistic Regression

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

  • 어떤 사건이 발생할 확률
  • 어떤 클래스로 분류된 확률

1. 분석 준비

1) 준비

import numpy as np
import pandas as pd

# dataset -> wine
from sklearn.datasets import load_wine

# model - logistic regression
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
wine = load_wine()
print(wine['DESCR'])

✅Number of Instances: 178
✅Number of Attributes: 13 numeric, predictive attributes and the class
✅Attribute Information: (feature)
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
✅class: class_0 / class_1 / class_2

2) train / test set

X = wine['data']
y = wine['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 42)

2. KNN과 비교를 위해 - KNN 연습


from sklearn.neighbors import KNeighborsClassifier
knn = Pipeline([('scaler', StandardScaler()),
                ('knn', KNeighborsClassifier(n_neighbors = 5))])

knn.fit(X_train, y_train)
print('train', knn.score(X_train, y_train))
print('test', knn.score(X_test, y_test))
  • train - 0.9774436090225563
  • test - 0.9555555555555556
proba = knn.predict_proba(X_test)

print(proba[:5])

# 결과
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
  • 1번 케이스는 0 class, 3번 케이스는 2 class .....

3. logistic regression 1 - 이진분류

1) 2진분류를 위해 클래스를 2개만 선택

select_index = (y_train == 0) | (y_train == 1)
X_train_select = X_train[select_index]
y_train_select = y_train[select_index]

2) logistic regression

a. 모델 학습


lr_pipe = Pipeline([('scaler', StandardScaler()),
               ('lr', LogisticRegression( C= 20, max_iter= 1000))])

lr_pipe.fit(X_train_select, y_train_select)
lr_pipe.score(X_train_select, y_train_select)

b. 모델 확인

(1) coefficient, intercept 확인

## pipeline은 각 과정이 dict 형식으로 저장되기 때문에 ['lr']을 붙이고 확인
print(lr_pipe['lr'].coef_, lr_pipe['lr'].intercept_)
  • coefficient
    [[-3.05594772 -0.61682265 -2.17362749 2.61797599 -0.28516922 -0.61354747
    0.05602002 -0.19137026 0.2360696 -1.40516461 0.73530275 -1.01007836
    -3.12745238]]
  • intercept
    [0.02149047]

(2) decision fuction

decisions = lr_pipe.decision_function(X_train_select[:10])
decisions
  • 결과값
    [ -8.77430986, 9.09262204, 7.80654027, -10.8719392 ,
    8.55098343, -11.29391432, -11.04927381, 6.64744888,
    5.44921957, -7.9400244 ]

✅ sigmoid 함수(logistic 함수)에 적용하기 이전의 결과값
해당 결과를 sigmoid함수를 통해 확률로 변환

(3) probability

  • 로지스틱 함수적용한 결과
from scipy.special import expit

print(np.round(expit(decisions), 3))

[0. 1. 1. 0. 1. 0. 0. 0.999 0.996 0. ]


  • probability
np.round(lr_pipe.predict_proba(X_train_select[:10]), 3)

[1.   , 0.   ],
[0.   , 1.   ],
[0.   , 1.   ],
[1.   , 0.   ],
[0.   , 1.   ],
[1.   , 0.   ],
[1.   , 0.   ],
[0.001, 0.999],
[0.004, 0.996],
[1.   , 0.   ]

✅ 각 사례가 class 0, 1로 선택될 probability


- 선택
lr_pipe.predict(X_train_select[0:10])

[0, 1, 1, 0, 1, 0, 0, 1, 1, 0]

✅ 분석결과 class 분류

4. logistic regression 2 - 다중분류

1) logistic regression

a. 모델 학습

lr_pipe.fit(X_train, y_train)

print('train', lr_pipe.score(X_train, y_train))
print('test', lr_pipe.score(X_test, y_test))

train -> 1.0
test -> 0.9777777777777777

b. decision fuction

decision = lr_pipe.decision_function(X_test[:5])
print(np.round(decision, 2))

 [ 7.6  -4.46 -3.14]
 [ 7.58 -7.   -0.57]
 [-2.26 -2.6   4.85]
 [ 6.82 -2.52 -4.3 ]
 [-2.55  8.22 -5.67]

✅ sigmoid 함수(logistic 함수)에 적용하기 이전의 결과값
해당 결과를 sigmoid함수를 통해 확률로 변환

c. probability

(1) predict_proba

proba1 = lr_pipe.predict_proba(X_train[0:5])
print(np.round(proba1, 3))

 [1.    0.    0.   ]
 [0.    1.    0.   ]
 [0.    1.    0.   ]
 [0.    0.004 0.996]
 [1.    0.    0.   ]

(2) softmax 함수

✅ softmax 함수는 다항 로지스틱 회귀에서 사용

  • 각 class별 probability를 계산하고, 이를 전체 class 중에서 특정 class로 분류할 probability를 계산

  • class 별 확률 분포를 계산


from scipy.special import softmax

proba = softmax(decision, axis=1)
print(np.round(proba, decimals=3))


 [1.    0.    0.   ]
 [0.    1.    0.   ]
 [0.    1.    0.   ]
 [0.    0.004 0.996]
 [1.    0.    0.   ]

d. coefficient / intercept


print(lr_pipe['lr'].coef_, lr_pipe['lr'].intercept_)
  • coefficient
    [-3.05594772 -0.61682265 -2.17362749 2.61797599 -0.28516922 -0.61354747
    0.05602002 -0.19137026 0.2360696 -1.40516461 0.73530275 -1.01007836
    -3.12745238]
  • intercept
    [0.02149047]

✅ coef가 가장 큰 순서로 확인하기

import pandas as pd

pd.Series(lr_pipe['lr'].coef_[0], wine['feature_names']).sort_values(ascending=False)
# 결과
alcalinity_of_ash               2.617976
hue                             0.735303
proanthocyanins                 0.236070
flavanoids                      0.056020
nonflavanoid_phenols           -0.191370
magnesium                      -0.285169
total_phenols                  -0.613547
malic_acid                     -0.616823
od280/od315_of_diluted_wines   -1.010078
color_intensity                -1.405165
ash                            -2.173627
alcohol                        -3.055948
proline                        -3.127452

e. 모델 최적화, 재학습

(1) 최적의 hyperparameter 찾기


from sklearn.model_selection import GridSearchCV

lr_pipe3 = Pipeline([('scaler', StandardScaler()),
               ('lr', LogisticRegression(max_iter= 1000))])

params={'lr__C': [0.001, 0.01, 0.1, 1, 10, 100]}


gs = GridSearchCV(lr_pipe3,
                  param_grid=params,
                  cv=5,
                  n_jobs=-1) # cv = fold 수 / default = 3)


gs.fit(X_train, y_train)
             
print('best params', gs.best_params_)
print('best score', gs.best_score_)
print('estimator', gs.best_estimator_)

best params -> {'lr__C': 10}
best score -> 0.9849002849002849
estimator -> Pipeline(steps=[('scaler', StandardScaler()), ('lr', LogisticRegression(C=10, max_iter=1000))])

(2) 최종 모델링


lr_final = gs.best_estimator_
lr_final.fit(X_train, y_train)

print('train set', lr_final.score(X_train, y_train))
print('test set', lr_final.score(X_test, y_test))

-train set 1.0
-test set 0.9777777777777777

✅ GridSearchCV의 Best Score와 최종 모델링의 Score가 다른 이유는

  • GridSearchCV는 fold로 검증하기 때문에 달라진다.

decision3 = lr_final.decision_function(X_test[:5])
print(np.round(decision3, 2))

 [ 6.82 -3.82 -3.  ]
 [ 6.77 -6.02 -0.75]
 [-2.14 -2.14  4.28]
 [ 6.16 -2.08 -4.08]
 [-2.28  7.26 -4.98]

proba = lr_final.predict_proba(X_test[0:5])
print(np.round(proba, 3))

 [1.    0.    0.   ]
 [0.999 0.    0.001]
 [0.002 0.002 0.997]
 [1.    0.    0.   ]
 [0.    1.    0.   ]

from scipy.special import softmax

proba1 = softmax(decision3, axis=1)
print(np.round(proba1, decimals=3))

 [1.    0.    0.   ]
 [0.999 0.    0.001]
 [0.002 0.002 0.997]
 [1.    0.    0.   ]
 [0.    1.    0.   ]

print(np.round(lr_final.predict(X_test[0:5]),3))
-> [0 0 2 0 1]

profile
새로운 길

0개의 댓글