회귀(Regression).Araboza

양우민·2022년 1월 28일

AIFFEL Araboza

아이펠 공부

목록 보기

2/9

회귀분석(Regression Analysis)이란?

통계학에서 많이 사용되던 분석 방법으로 관찰된 여러 데이터를 기반으로 각 연속형 변수 간의 관계를 모델링하고 이에 대한 적합도를 측정하는 분석 방법이다.

EX) 부모의 키와 자식의 키 사이의 관계

위와 같은 예처럼 두 변수 사이의 관계를 직선 형태로 가정하고 분석하는 것을 선형 회귀분석이라고 하며 실생활에서 적용되는 경우가 많아 선형 회귀분석의 기본 가정(대상 변수의 선형성, 독립성, 등분산성, 정규성) 내에 문제가 있다면 적용해볼 가치가 있다.

선형 회귀분석(Linear Reression Analysis)

선형 회귀분석이란 종속변수 Y와 한 개 이상의 독립변수 X와의 선형적 상관관계를 모델링하는 회귀분석 기법이다.

1. 선형 회귀 모델링

선형 회귀식

β = 회귀계수(Parameter)

ϵ = 종석변수와 독립변수 사이의 오차(Parameter)

y, x = 데이터

선형회귀 모델링이란 데이터가 있을 때, 데이터로부터 적절한 파라미터 값들을 추정하고 그 추정값들을 바탕으로 모델링을 수정해 나가는 것이며 결론적으로 주어진 데이터에 추정한 파라미터 값의 선형식이 잘 맞도록 하는 것이 선형회귀 모델링이다.

2. 머신러닝에서의 선형 회귀모델 표기법

머신러닝의 선형 회귀식

H(Hypothesis) = 가정

W(Weight) = 가중치

b(bias) = 편향

딥러닝이나 머신러닝에서 회귀 모델을 구한다는 것은 주어진 데이터를 이용하여 W(가중치)와 b(편향)를 구한다는 것이고 W나 b의 경우 단순한 스칼라 값이 아닌 고차원 행렬의 형태를 많이 가지며 이 파라미터의 개수가 많을 수록 모델의 크기가 커지고 학습 난이도도 올라간다.

3. 선형 회귀모델 in Python

💡 ( Tip ) 데이터셋은 scikit-learn 라이브러리에서 기본으로 제공하는 Boston house prices dataset을 활용

from sklearn.datasets import load_boston
from sklearn import model_selection
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# 데이터 로드
boston = load_boston()
data, price = boston['data'], boston['target']
x_train, x_test, y_train, y_test = model_selection.train_test_split(data, price, test_size=0.2)

df = pd.DataFrame(x_train, columns=boston['feature_names'])
print("boston dataset의 차원: ", data.shape)
print("price의 차원", price.shape)
print("boston train dataset의 차원: ", x_train.shape)
print("boston test dataset의 차원: ", x_test.shape)

df.head() # data 요약

📄Output

위의 결과를 보면 Boston dataset은 총 506개의 행과 13가지의 속성으로 구성되어 있고, 각 행에 대응되는 가격이 Price에 저장되어 있다.

각 속성에 대한 설명

CRIM : per capita crime rate by town

ZN : proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS : proportion of non-retail business acres per town

CHAS : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX : nitric oxides concentration (parts per 10 million)

RM : average number of rooms per dwelling

AGE : proportion of owner-occupied units built prior to 1940

DIS : weighted distances to five Boston employment centres

RAD : index of accessibility to radial highways

TAX : full-value property-tax rate per $10,000

PTRATIO : pupil-teacher ratio by town

B : 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town

LSTAT : % lower status of the population

MEDV : Median value of owner-occupied homes in $1000's

Boston Dataset에 선형 회귀분석 적용

# Boston dataset의 각 attribute에 선형회귀 적용하는 예제
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import datasets
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10,35))
fig.suptitle('Boston dataset - (X:Y = each attr: price) with R2', fontsize=16, y=0.9)

for i in range(data.shape[1]): # boston dataset에서 i번째 attribute(column)을 살펴볼 거에요.

    single_attr, attr_name = data[:, i].reshape(-1, 1), boston['feature_names'][i] # i번째 attribute에 대한 data 및 이름
    estimator = LinearRegression() # 선형 회귀 모델이에요.

    #x에는 single_attr, y에는 price에 해당하는 데이터를 대입해서 최소제곱법을 이용하여 모델 내에서 W, b를 구하는 과정이에요
    estimator.fit(single_attr, price) 

    #위 fit() 과정을 통해 구한 회귀계수를 기반으로 회귀모델에 X값을 대입했을 때의 예측 Y 값이에요. 
    pred_price = estimator.predict(single_attr)

    score = metrics.r2_score(price, pred_price) # 결정계수를 구하는 함수에요. 

    # 캔버스 생성
    ax = fig.add_subplot(7, 2, i+1)
    ax.scatter(single_attr, price) # 실제 데이터에 대한 산포도
    ax.plot(single_attr, pred_price, color='red') # 선형회귀모델의 추세선
    ax.set_title("{} x price, R2 score={:.3f}".format(attr_name ,score)) #subplot의 제목이에요
    ax.set_xlabel(attr_name) # x축
    ax.set_ylabel('price') # y축

📄Output

로지스틱 회귀분석(Logistic Reression Analysis)

데이터가 어떤 범주에 속할 확률을 예측하고 그 확률에 따라 가능성이 더 높은 범주에 속하는 것으로 분류해주는 지도 학습 알고리즘이다.

1. 로지스틱 회귀식

아래와 같은 식을 Odds(사건이 발생할 확률을 발생하지 않을 확률로 나눈 값)이라고 한다.

Odds값에 로그를 취해주면 Log-Odds라고 부르는 아래의 식이 나온다.

위의 식을 이용하여 주여진 데이터를 설명하는 회귀계수 β값을 구할 수 있고 우리가 원하는 종속변수가 0일 확률이나 1일 확률 같은 확률 자체를 원하기에 Log-Odds식을 P(Y=0∣x)에 대해서 다시 정리하면 아래의 식이 도출된다.

이것이 바로 머신러닝 혹은 딥러닝에서 많이 나오는 sigmoid function의 형태이다.

2. 로지스틱 회귀식 in Python

💡 ( Tip ) 데이터셋은 scikit-learn 라이브러리에서 기본으로 제공하는 breast cancer dataset을 활용

# 로지스틱 회귀분석 예제: 유방암 데이터셋
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# 데이터 로드
cancer=load_breast_cancer()

# y = 0(Malignant - 악성 종양), y=1(Benign - 양성 종양)
cancer_X, cancer_y= cancer.data, cancer['target']
train_X, test_X, train_y, test_y = train_test_split(cancer_X, cancer_y, test_size=0.1, random_state=10) # train 데이터셋과 test 데이터셋으로 나눔
print("전체 검사자 수: {}".format(len(cancer_X)))
print("Attribute 수: {}".format(len(cancer_X[0])))
print("Train dataset에 사용되는 검사자 수: {}".format(len(train_X)))
print("Test dataset에 사용되는 검사자 수: {}".format(len(test_X)))
cancer_df = pd.DataFrame(cancer_X, columns=cancer['feature_names'])
cancer_df.head()

📄Output

로지스틱 회귀분석 예제

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

LR = LogisticRegression() #로지스틱 회귀분석
LR.fit(train_X, train_y) # 유방암 train data를 이용하여 로지스틱 회귀분석 모델 학습
pred = LR.predict(test_X) # 학습된 로지스틱 회귀분석 모델을 이용한 예측값 출력

# 로지스틱 회귀분석 모델의 예측값과 실제값 비교결과를 나타낸 통계표
print(classification_report(test_y, pred))

📄Output

Softmax Function과 Cross Entropy

1. Softmax Function

Softmax Function은 2가지가 아닌 여러 범주로 분류하는 함수이다.

함수식을 자세히 보면 이는 각 범주의 확률 값이 0에서 1사이의 값이고 또 하나의 큰 특징은 모든 범주에 해당하는 softmax의 값을 전부 더했을 대 그 합이 1이 된다는 것이ㅏㄷ. Softmax Function은 큰 log-odds와 작은 log-odds의 차이를 극대화 시켜주며 마지막에 Softmax Function에 모든 범주의 log-odds를 통과시키면 해당 데이터가 어떤 범주로 분류되는지 확실히 알 수 있게 되고 가장 큰값을 1 그외 나머지 값들을 0으로 인코딩하는 one-hot encoding을 통해 표현한다.

2. Cross Entropy

Cross Entropy는 Softmax의 손실함수로 쓰인다.

Cross Entropy는 손실함수 이기 때문에 가중치가 최적화 될수록 H(p,q)의 값이 감소하게 되는 방향으로 가중치 학습이 된다. 그리고 p(x)는 실제 데이터의 범주 값, q(x)는 Softmax의 결과값을 대입하게 된다.

3. Breast Cancel Dataset을 이용한 Softmax와 Cross Entropy 예제

import tensorflow as tf
from tensorflow import keras

n_dense=30
n_train_epoch=20
num_classes = 2 # 악성, 양성

model=keras.models.Sequential()
model.add(keras.layers.Dense(num_classes, use_bias=True, activation='softmax', input_shape=(30,)))

model.summary()
model.compile(optimizer='adam',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

# 모델 훈련
model.fit(train_X, train_y, epochs=n_train_epoch)

# 모델 시험
test_loss, test_accuracy = model.evaluate(test_X, test_y, verbose=1)
print("test_loss: {} ".format(test_loss))
print("test_accuracy: {}".format(test_accuracy))

📄Output

import tensorflow as tf
from tensorflow import keras

n_dense=30
n_train_epoch=20
num_classes = 2 # 악성, 양성

model=keras.models.Sequential()

#레이어 3장을 추가
model.add(keras.layers.Dense(n_dense, input_shape=(30,), use_bias=True))
model.add(keras.layers.Dense(n_dense,  use_bias=True))
model.add(keras.layers.Dense(n_dense,  use_bias=True))

model.add(keras.layers.Dense(num_classes, use_bias=True, activation='softmax'))

model.summary()
model.compile(optimizer='adam',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

# 모델 훈련
model.fit(train_X, train_y, epochs=n_train_epoch)

# 모델 시험
test_loss, test_accuracy = model.evaluate(test_X, test_y, verbose=1)
print("test_loss: {} ".format(test_loss))
print("test_accuracy: {}".format(test_accuracy))