Predictive Analytics with Ensemble Learning

oyoi·2024년 3월 28일

AI ECC

ECC_Artificial Intelligence with Python

목록 보기

3/16

1 Building learning models with Ensemble Learning

1 앙상블 학습

하나의 모델만 사용할 때보다 더 좋은 결과를 낼 수 있도록 여러 가지 모델을 조합하는 방법

2 앙상블 모델이 효과적인 이유

학습 데이터의 편향성이나 오버피팅 등 형편없는 모델을 선택하는 리스크를 줄여준다.
다양한 모델을 활용하기 때문에 새로운 데이터에 효과적이다.

3 앙상블 학습의 효과를 보기 위해서는, 다양성이 중요하다.

차별성 있는 다양한 모델을 사용
각 모델에 서로 다른 학습 매개변수를 사용

2 What are Decision Trees and how to build a Decision Trees classifier

1 의사 결정 트리

규칙에 따라 데이터를 분할하는 분기를 만들고 분기에 따라 데이터를 세분화해 최종적으로 의사 결정을 내릴 수 있도록 구조화하는 방법이다. 학습을 통해 데이터 분기를 구축하고 데이터를 분류해 결과값(레이블)을 예측한다. 학습 알고리즘은 학습 데이터의 입력 데이터와 타깃 레이블 사이의 관계에 따라 규칙을 생성하고 이러한 규칙은 트리의 노드로 표현된다.

데이터를 이용해 최적의 트리를 구축하기 위해서는 엔트로피라는 개념을 알아야 한다. 여기서 엔트로피란 정보의 불확실성 척도를 의미한다. 의사 결정 트리는 각 층에서 불확실성을 줄일 수 있도록 구축되어야 한다. 즉, 트리 아래로 이동할수록 엔트로피가 줄어들어야 한다는 것이다.

3 의사 결정 트리 기반 분류기 구축하기

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.metrics import classification_report 
from sklearn import cross_validation
from sklearn.tree import DecisionTreeClassifier 
 
from utilities import visualize_classifier

# Load input data 
input_file = 'data_decision_trees.txt' 
data = np.loadtxt(input_file, delimiter=',') 
X, y = data[:, :-1], data[:, -1] 

# Separate input data into two classes based on labels 
class_0 = np.array(X[y==0]) 
class_1 = np.array(X[y==1])

# Visualize input data 
plt.figure() 
plt.scatter(class_0[:, 0], class_0[:, 1], s=75, facecolors='black',  
            edgecolors='black', linewidth=1, marker='x') 
plt.scatter(class_1[:, 0], class_1[:, 1], s=75, facecolors='white',  
            edgecolors='black', linewidth=1, marker='o') 
plt.title('Input data')

# Split data into training and testing datasets  
X_train, X_test, y_train, y_test = cross_validation.train_test_split( 
            X, y, test_size=0.25, random_state=5)
            
# Decision Trees classifier  
params = {'random_state': 0, 'max_depth': 4} 
classifier = DecisionTreeClassifier(**params)
‘classifier.fit(X_train, y_train) 
visualize_classifier(classifier, X_train, y_train, 'Training dataset') 

y_test_pred = classifier.predict(X_test) 
visualize_classifier(classifier, X_test, y_test, 'Test dataset') 

# Evaluate classifier performance 
class_names = ['Class-0', 'Class-1'] 
print("\n" + "#"*40) 
print("\nClassifier performance on training dataset\n") 
print(classification_report(y_train, classifier.predict(X_train), target_names=class_names)) 
print("#"*40 + "\n") 
 
print("#"*40) 
print("\nClassifier performance on test dataset\n") 
print(classification_report(y_test, y_test_pred, target_names=class_names)) 
print("#"*40 + "\n") 
 
plt.show()

+) 정확률은 분류의 정확도를, 재현율은 관련 데이터 중에서 올바로 분류된 데이터의 비율을 나타낸다. 좋은 분류기는 높은 정확률과 재현율을 보이지만 일반적으로 둘은 trade-off 관계다. 따라서 정확률과 재현율의 조화 평균인 F1 점수를 사용한다.

3 What are Random Forests and Extremely Random Forests, and how to build classifiers based on them

1 랜덤 포레스트

앙상블 학습에서 사용되는 방법 중 하나로, 다양한 의사 결정 트리 모델을 구축하고 사용하는 방법이다. 전체 학습 데이터를 임의로 여러 개의 부분 데이터로 나눈 뒤, 각 부분 데이터별로 의사 결정 트리 모델을 학습시켜 다양한 모델을 생성한다. 랜덤 포레스트는 학습 데이터를 임의로 분할해 다양성을 보장한다.

랜덤 포레스트의 최대 장점 중 하나는 오버피팅을 피할 수 있다는 것이다. 트리를 구축하는 과정에서 노드들은 클래스에 따라 성공적으로 데이터를 나누고, 각각의 계층별로 엔트로피를 줄이도록 최적의 임계 값을 선택된다. 이와 같은 방식으로 데이터를 나눌 때 모든 특징을 고려하지는 않는다. 대신 임의의 몇 개의 특징만을 선별하고 사용해 데이터를 나눈다. 임의성을 추가함으로써 데이터 편향성은 증가할 수 있지만 평균화를 통해 분산은 감소하고 결과적으로 견고한 모델을 얻을 수 있다.

import argparse  
 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.metrics import classification_report 
from sklearn import cross_validation 
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier 
from sklearn import cross_validation 
from sklearn.metrics import classification_report 
 
from utilities import visualize_classifier 

# Argument parser  
def build_arg_parser(): 
      parser = argparse.ArgumentParser(description='Classify data using \ 
                  Ensemble Learning techniques') 
      parser.add_argument('--classifier-type', dest='classifier_type',  
                  required=True, choices=['rf', 'erf'], help="Type of classifier 
                           \to use; can be either 'rf' or 'erf'") 
      return parser
      
if __name__=='__main__': 
      # Parse the input arguments 
      args = build_arg_parser().parse_args() 
      classifier_type = args.classifier_type 
      
      # Load input data 
      input_file = 'data_random_forests.txt' 
      data = np.loadtxt(input_file, delimiter=',') 
      X, y = data[:, :-1], data[:, -1]
      
      # Separate input data into three classes based on labels 
      class_0 = np.array(X[y==0]) 
      class_1 = np.array(X[y==1]) 
      class_2 = np.array(X[y==2])
      
      # Visualize input data 
      plt.figure() 
      plt.scatter(class_0[:, 0], class_0[:, 1], s=75, facecolors='white',  
                              edgecolors='black', linewidth=1, marker='s') 
      plt.scatter(class_1[:, 0], class_1[:, 1], s=75, facecolors='white',  
                              edgecolors='black', linewidth=1, marker='o') 
      plt.scatter(class_2[:, 0], class_2[:, 1], s=75, facecolors='white',  
                              edgecolors='black', linewidth=1, marker='^') 
      plt.title('Input data')
      
      # Split data into training and testing datasets  
      X_train, X_test, y_train, y_test = cross_validation.train_test_split( 
                  X, y, test_size=0.25, random_state=5)
                  
      # Ensemble Learning classifier 
      params = {'n_estimators': 100, 'max_depth': 4, 'random_state': 0} 
      
      if classifier_type == 'rf': 
            classifier = RandomForestClassifier(**params) 
      else: 
            classifier = ExtraTreesClassifier(**params) 
            
      classifier.fit(X_train, y_train) 
      visualize_classifier(classifier, X_train, y_train, 'Training dataset')
      
      y_test_pred = classifier.predict(X_test) 
      visualize_classifier(classifier, X_test, y_test, 'Test dataset')
      
      # Evaluate classifier performance 
      class_names = ['Class-0', 'Class-1', 'Class-2'] 
      print("\n" + "#"*40) 
      print("\nClassifier performance on training dataset\n") 
      print(classification_report(y_train, classifier.predict(X_train), target_names=class_names)) 
      print("#"*40 + "\n") 
 
      print("#"*40) 
      print("\nClassifier performance on test dataset\n") 
      print(classification_report(y_test, y_test_pred, target_names=class_names)) 
      print("#"*40 + "\n")

$ python3 random_forests.py --classifier-type rf

2 극단 랜덤 포레스트

극단 랜덤 포레스트는 임의성을 다음 계층으로 계속 전달한다. 입력 데이터셋의 특징 중 임이의로 선택된 몇 개의 특징을 이용하고 이에 더해 임계 값도 임의로 선택한다. 이렇게 생성돈 임계 값은 분할 규칙으로 이용돼 모델의 분산을 더욱 줄일 수 있게 된다. 따라서 극단 랜덤 포레스트는 랜덤 포레스트보다 더 매끈한 형태를 구축한다.

$ python3 random_forests.py --classifier-type erf

4 Estimating the confidence measure of the predictions

신뢰도를 추정하는 것은 머신 러닝에서 중요한 작업이다.

# Compute confidence 
      test_datapoints = np.array([[5, 5], [3, 6], [6, 4], [7, 2], [4, 4], [5, 2]])
      
print("\nConfidence measure:") 
for datapoint in test_datapoints: 
	probabilities = classifier.predict_proba([datapoint])[0] 
	predicted_class = 'Class-' + str(np.argmax(probabilities)) 
	print('\nDatapoint:', datapoint) 
	print('Predicted class:', predicted_class)
            
# Visualize the datapoints 
visualize_classifier(classifier, test_datapoints, 
                           [0]*len(test_datapoints),  
                           'Test datapoints') 
       
      plt.show()

5 Dealing with class imbalance

분류기의 성능은 사용된 데이터에 의해 크게 좌우된다. 분류기의 성능을 높이기 위해서는 각 클래스별로 동일한 수의 데이터가 있는 것이 좋으나 이는 쉽지 않다. 이러한 불균형은 알고리즘을 통해 해결할 수 있다.

$ python3 class_imbalance.py balance

6 Finding optimal training parameters using grid search

분류기를 사용할 때 최적의 매개변수를 찾기 위해서 사용하는 것이 그리드 검색이다. 그리드 검색은 매개변수 값의 범위를 지정하면 자동으로 다양한 매개변수를 조합해 최적의 값을 찾아준다.

# Define the parameter grid  
parameter_grid = [ {'n_estimators': [100], 'max_depth': [2, 4, 7, 12, 16]}, 
                         {'max_depth': [4], 'n_estimators': [25, 50, 100, 250]} 
                         ]
                         
metrics = ['precision_weighted', 'recall_weighted'] 

for metric in metrics: 
      print("\n##### Searching optimal parameters for", metric) 
 
      classifier = grid_search.GridSearchCV( 
                  ExtraTreesClassifier(random_state=0),  
                  parameter_grid, cv=5, scoring=metric) 
      classifier.fit(X_train, y_train)
      
      print("\nGrid scores for the parameter grid:") 
      for params, avg_score, _ in classifier.grid_scores_: 
            print(params, '-->', round(avg_score, 3)) 
 
      print("\nBest parameters:", classifier.best_params_) 
      
      y_pred = classifier.predict(X_test) 
      print("\nPerformance report:\n") 
      print(classification_report(y_test, y_pred))

7 Computing relative feature importance

N차원 데이터가 포함된 데이터셋으로 작업할 때 모든 특징이 똑같이 중요하지는 않다. 아다부스트 회귀 분석기는 특징의 중요성을 계산해 성능을 향상시키는데 도움을 주는 알고리즘이다. 아다부스트는 여러 단계에 걸쳐 여러 개의 분류기를 생성하고 데이터 가중치를 업데이트해 성능을 개선하고 최종 레이블을 결정한다.

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import AdaBoostRegressor 
from sklearn import datasets 
from sklearn.metrics import mean_squared_error, explained_variance_score 
from sklearn import cross_validation 
from sklearn.utils import shuffle

# Load housing data 
housing_data = datasets.load_boston()

# Shuffle the data 
X, y = shuffle(housing_data.data, housing_data.target, random_state=7)

# Split data into training and testing datasets  
X_train, X_test, y_train, y_test = cross_validation.train_test_split( 
            X, y, test_size=0.2, random_state=7)
            
# AdaBoost Regressor model 
regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),  
            n_estimators=400, random_state=7) 
regressor.fit(X_train, y_train)

# Evaluate performance of AdaBoost regressor 
y_pred = regressor.predict(X_test) 
mse = mean_squared_error(y_test, y_pred) 
evs = explained_variance_score(y_test, y_pred ) 
print("\nADABOOST REGRESSOR") 
print("Mean squared error =", round(mse, 2)) 
print("Explained variance score =", round(evs, 2))

# Extract feature importances 
feature_importances = regressor.feature_importances_ 
feature_names = housing_data.feature_names 

# Normalize the importance values  
feature_importances = 100.0 * (feature_importances / max(feature_importances)) 

# Sort the values and flip them 
index_sorted = np.flipud(np.argsort(feature_importances))

# Arrange the X ticks
pos = np.arange(index_sorted.shape[0]) + 0.5 

# Plot the bar graph 
plt.figure() 
plt.bar(pos, feature_importances[index_sorted], align='center') 
plt.xticks(pos, feature_names[index_sorted]) 
plt.ylabel('Relative Importance') 
plt.title('Feature importance using AdaBoost regressor') 
plt.show()

8 Predicting traffic using Extremely Random Forests regressor

oyoi

오이

이전 포스트

Classification and Regression Using Supervised Learning

다음 포스트