머신러닝 - 앙상블(Ensemble) 기법

dumbbelldore·2025년 1월 1일

ML python 데이터취업스쿨 랜덤포레스트 배깅 보팅 부스팅 스태킹 앙상블 제로베이스

zero-base 33기

목록 보기

57/97

1. 앙상블(Ensemble) 기법

단일 ML 모델의 약점을 보완하기 위해, 여러 모델을 결합하여 더 나은 의사결정을 수행하는 방법
대표 예시로 배깅(Bagging)과 보팅(Voting), 스태킹(Stacking)이 있음

2. 배깅(Bagging)

데이터를 무작위로 복원 추출(부트스트래핑, Bootstrapping)하여 동일한 유형의 모델을 학습시키고, 모든 모델의 예측 결과를 평균(회귀)하거나 투표(분류)로 다수결 결정하는 기법
대표적으로 유명한 Random Forest 모델도 배깅의 한 종류로써, 비교적 빠른 학습속도와 우수한 성능을 보유함

# 데이터 로드
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

wine = pd.read_csv("./data/wine.csv")

X = wine.drop(columns="type")
y = wine["type"]

# 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 모델 성능 출력용 공용함수 정의
def print_perf(estimator):
    estimator.fit(X_train, y_train)
    y_pred = estimator.predict(X_test)
    print(classification_report(y_test, y_pred))
    
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# 배깅
bg = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=10,
    random_state=42,
)

print_perf(bg)

## 출력 결과
#               precision    recall  f1-score   support
# 
#            0       0.99      1.00      0.99       959
#            1       0.99      0.96      0.97       341
# 
#     accuracy                           0.99      1300
#    macro avg       0.99      0.98      0.98      1300
# weighted avg       0.99      0.99      0.99      1300

from sklearn.ensemble import RandomForestClassifier

# 랜덤 포레스트
rf = RandomForestClassifier(random_state=42)

print_perf(rf)

## 출력 결과
#               precision    recall  f1-score   support
# 
#            0       0.99      1.00      1.00       959
#            1       1.00      0.98      0.99       341
# 
#     accuracy                           0.99      1300
#    macro avg       0.99      0.99      0.99      1300
# weighted avg       0.99      0.99      0.99      1300

3. 보팅(Voting)

동일한 데이터로 서로 다른 유형의 모델을 학습시키고, 모든 모델의 예측 결과를 평균(회귀)하거나 투표(분류)로 다수결 결정하는 기법
Soft Voting: 각 모델이 출력하는 확률값을 평균하여 결정 (확률값 반환 가능한 모델만 사용 가능)
Hard Voting: 각 모델의 출력하는 클래스 값을 다수결로 결정

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# 보팅(Soft)
vs = VotingClassifier(
    [("LR", LogisticRegression(solver="liblinear", random_state=42)),
     ("KNN", KNeighborsClassifier()),
     ("DT", DecisionTreeClassifier(random_state=42))],
    voting="soft"
)

print_perf(vs)

## 출력 결과
#              precision    recall  f1-score   support
# 
#            0       0.99      1.00      0.99       959
#            1       0.99      0.96      0.98       341
# 
#     accuracy                           0.99      1300
#    macro avg       0.99      0.98      0.98      1300
# weighted avg       0.99      0.99      0.99      1300

# 보팅(Hard)
vh = VotingClassifier(
    [("LR", LogisticRegression(solver="liblinear", random_state=42)),
     ("KNN", KNeighborsClassifier()),
     ("DT", DecisionTreeClassifier(random_state=42))],
    voting="hard"
)

print_perf(vh)

## 출력 결과
# 
#               precision    recall  f1-score   support
# 
#            0       0.98      1.00      0.99       959
#            1       0.99      0.95      0.97       341
# 
#     accuracy                           0.98      1300
#    macro avg       0.98      0.97      0.98      1300
# weighted avg       0.98      0.98      0.98      1300

4. 부스팅(Boosting)

이전 단계의 모델이 제대로 학습하지 못한 오류 데이터에 가중치를 부여함으로써, 다음 단계의 모델이 더 잘 학습할 수 있도록 보완하는 기법
대표적으로 유명한 AdaBoost, Gradient Boosting, XGBoost, LGBM이 부스팅에 기반함

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(random_state=42, algorithm="SAMME", n_estimators=100)
print_perf(ada)

## 출력 결과
#               precision    recall  f1-score   support
# 
#            0       0.99      1.00      0.99       959
#            1       0.99      0.98      0.98       341
# 
#     accuracy                           0.99      1300
#    macro avg       0.99      0.99      0.99      1300
# weighted avg       0.99      0.99      0.99      1300

5. 스태킹(Stacking)

서로 다른 유형의 모델의 학습 결과를 바탕으로 종합 모델(메타 학습기, Meta Learner)을 학습시켜 예측하는 기법

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

stk = StackingClassifier(
    estimators=[
        ("KNN", KNeighborsClassifier()),
        ("DT", DecisionTreeClassifier(random_state=42)),
    ],
    final_estimator=LogisticRegression(solver="liblinear", random_state=42), # 메타 학습기
)

print_perf(stk)

## 출력 결과
#               precision    recall  f1-score   support
# 
#            0       0.99      0.99      0.99       959
#            1       0.97      0.96      0.97       341
# 
#     accuracy                           0.98      1300
#    macro avg       0.98      0.98      0.98      1300
# weighted avg       0.98      0.98      0.98      1300

*이 글은 제로베이스 데이터 취업 스쿨의 강의 자료 일부를 발췌하여 작성되었습니다.

dumbbelldore

데이터 분석, 데이터 사이언스 학습 저장소

이전 포스트

머신러닝 - k-Nearest Neighbor (k-NN)

다음 포스트

머신러닝 - 앙상블(Ensemble) 기법

zero-base 33기

1. 앙상블(Ensemble) 기법

2. 배깅(Bagging)

3. 보팅(Voting)

4. 부스팅(Boosting)

5. 스태킹(Stacking)

머신러닝 - k-Nearest Neighbor (k-NN)

Python 프로젝트 - 신용카드 사기 탐지 모델 설계

0개의 댓글