[머신러닝] Ensemble 앙상블

syEON·2023년 9월 1일

머신러닝

앙상블: 약한 모델이 올바르게 결합하면 더 정확하고 견고한 모델을 얻을 수 있다
앙상블의 종류는 voting, bagging, boosting, stacking이 있다.

여기에 나오는 알고리즘은 cross validatoin(교차검증)이 기본적으로 포함되어 있으며 Decision Tree에서 고도화된 모델들이다.

voting

여러 모델들의 예측 결과를 투표를 통해 결정하는 방법
하드 보팅: 다수의 모델이 예측한 값(최빈값)을 최종 결과값으로 선택
소프트 보팅: 모든 모델의 레이블 값의 결정 확률 평균중에서 가장 높은 값 선택

bagging

같은 유형의 알고리즘 기반 모델을 사용
복원 랜덤 샘플링 방식(데이터 분할 시 중복을 허용)
범주형 데이터는 투표 방식으로 결과를 집계
수치형 데이터는 평균으로 결과를 집계
대표 알고리즘: Random Foreset

Random Forest

랜덤하게 데이터를 샘플링
개별 트리를 구성할 때 데이터의 Feature를 랜덤하게 선정 (모든 Feature를 사용하는 것이 아님)

Random Forest의 기본이 되는 알고리즘이 DecisionTree이기 때문에 하이퍼파리미터들이나 속성은 거의 유사하다. 다만 n_estimators를 통해 Decision Tree의 개수를 지정할 수 있다. 복원 추출 데이터셋 기본이 100개(n_estimators), 100가지의 다른 데이터셋으로 다르게 학습된 모델이 생성된다. 모델들의 결과를 평균이나 투표 방식으로 결과를 집계한다.

속성
random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!
n_jobs: CPU 사용 갯수
max_depth: 깊어질 수 있는 최대 깊이. 과대적합 방지용
n_estimators: 앙상블하는 트리의 갯수
max_features: 최대로 사용할 feature의 갯수. 과대적합 방지용
min_samples_split: 트리가 분할할 때 최소 샘플의 갯수. default=2. 과대적합 방지용

RandomForestRegressor, RandomForestClassifier 각각 맞는 모델로 구현한다.

# 불러오기
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import *

# 선언하기
model = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=1)
#model = RandomForestClassifier(max_depth=5, n_estimators=100, random_state=1)

# 성능예측
model.fit(x_train, y_train)

# 결과확인
y_pred = model.predict(x_test)

# 평가하기, 다른 분류/회귀와 동일하게 평가
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred))
#print(classification_report(y_test, y_pred))

boosting

오차를 계속해서 줄여가는 방식으로 학습
• 같은 유형의 알고리즘 기반 모델 여러 개에 대해 순차적으로 학습을 수행
• 이전 모델이 제대로 예측하지 못한 데이터에 대해서 가중치를 부여하여 다음 모델이 학습과 예측을 진행하는 방법
• 계속하여 모델에게 가중치를 부스팅하며 학습을 진행해 부스팅 방식이라 함
• 예측 성능이 뛰어나 앙상블 학습을 주도함
• 배깅에 비해 성능은 좋지만 속도가 느리고 과적합 발생 가능성이 있다.

XGBoost(eXtreme Gradient Boosting)

특징: 병렬 학습이 가능하다.

라이브러리 : xgboost (sklearn이 아님으로 없으면 설치가 필요하다,!pip install xgboost)

XGBClassifier, XGBRegressor 회귀와 분류 모두 지원함으로 각각 맞는 모델을 사용한다.

from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import *

# 선언하기
model = XGBClassifier(max_depth=5,random_state=1)
# 학습하기
model.fit(x_train, y_train)
# 예측하기
y_pred = model.predict(x_test)
# 평가하기
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred))
#print(classification_report(y_test, y_pred))

LightGBM

XGBoost와 유사하지만 좀 더 가벼운 모델이다.

from lightgbm import LGBMClassifier
# 선언하기
model = LGBMClassifier(max_depth=5, random_state=1)
# 학습하기
model.fit(x_train, y_train)
# 예측하기
y_pred = model.predict(x_test)
# 평가하기
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

stacking

여러 모델의 예측 값을 최종 모델의 학습 데이터로 사용하여 예측하는 방법
예시)
• KNN, Logistic Regression, XGBoost 모델을 사용해 4종류 예측값을 구한 후
이 예측 값을 최종 모델인 Randomforest 학습 데이터로 사용

syEON

이전 포스트

[머신러닝] K-Fold Cross Validataion, GridSearch, RandomSearch

다음 포스트