[Ensemble] Gradient Boosting Classifier

안암동컴맹·2024년 3월 10일
0

Machine Learning

목록 보기
40/103

Gradient Boosting Classifier

Introduction

Gradient Boosting Classifier is a powerful machine learning algorithm that belongs to the ensemble methods family, specifically to boosting techniques. It works by sequentially adding predictors (typically decision trees), where each one corrects its predecessor. Unlike other boosting methods that adjust the weights of observations, Gradient Boosting focuses on minimizing the loss function, a differentiable function that measures the error of the model.

Background and Theory

Gradient Boosting combines multiple weak learners (models that perform slightly better than random guessing) to create a strong learner in a sequential manner. The core principle is to construct new models that predict the residuals or errors of prior models and then add these new models to the ensemble.

Mathematical Foundations

Let's denote our data set as (xi,yi)(x_i, y_i), where xix_i represents the features and yiy_i the target variable for the ithi^\text{th} observation. The goal is to find a function F(x)F(x) that maps input features xx to predictions as close as possible to the target values yy.

The algorithm starts with a model that makes constant predictions:

F0(x)=arg minγi=1NL(yi,γ)F_0(x) = \text{arg min}\gamma \sum_{i=1}^{N} L(y_i, \gamma)

where L(yi,γ)L(y_i, \gamma) is the loss function evaluated at the initial prediction γ\gamma and the true target value yiy_i.

At each stage m=1m = 1 to MM, where MM is the total number of trees, the algorithm fits a new model hm(x)h_m(x) to the negative gradient of the loss function evaluated at the previous model's predictions:

rim=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)r_{im} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F{m-1}(x)}

Then, it finds the best-fit model hm(x)h_m(x) to these residuals. The model hm(x)h_m(x) is typically a decision tree.

Next, it computes a multiplier γm\gamma_m for the model hm(x)h_m(x) that minimizes the loss when added to the existing model Fm1(x)F_{m-1}(x):

γm=arg min γi=1NL(yi,Fm1(xi)+γhm(xi))\gamma_m = \text{arg min}~\gamma \sum_{i=1}^{N} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))

The model is then updated:

Fm(x)=Fm1(x)+νγmhm(x)F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x)

where ν\nu is the learning rate, a parameter that scales the contribution of each hm(x)h_m(x).

The process repeats for MM iterations or until a stopping condition is reached, resulting in a final model:

FM(x)=F0(x)+νm=1Mγmhm(x)F_M(x) = F_0(x) + \nu \sum_{m=1}^{M} \gamma_m h_m(x)

Algorithmic Contemplation

Let’s try generalizing the problem Gradient Boosting algorithm is about to solve. We want to find the function F^H\hat{F}\in\mathcal{H} which minimizes the expected loss function in terms of pp-dimensional loss function LL, where H={FF:RpR}\mathcal{H}=\{F\mid F:\mathbb{R}^p\rightarrow\mathbb{R}\}.

F^=arg minFH Ex,y[L(y,F(x))]\hat{F}=\underset{F\in\mathcal{H}}{\text{arg min}}~E_{x,y}[L(y,F(x))]

Unfortunately, H\mathcal{H} is an enormous set for the algorithm to find optimal F^\hat{F}. Therefore, some constraints are needed, which constrains H\mathcal{H} to:

H={FF:RpR, F=C+m=1Mγmfm, fiH, γiR, i=1,,M}\mathcal{H}^*=\left\{ F\mid F: \mathbb{R}^p\rightarrow\mathbb{R},~F=C+\sum_{m=1}^M\gamma_mf_m,~f_i\in\mathcal{H},~\gamma_i\in \mathbb{R},~i=1,\ldots,M \right\}

where MM is a natural number.

In short, a structure which can be expressed through a linear combination of a constant and several functions is added, rather than a simple generic function.

Moreover, since the distributions of x,yx,y are unknown, it is highly demanding to figure out the expected loss. Therefore, we have to find F^\hat{F} that minimizes the experienced loss function, which can be derived by utilizing a finite amount of data (xi,yi), i=1,,n(x_i,y_i),~i=1,\ldots,n.

F^=arg minFH1ni=1nL(yi,F(xi))\hat{F}=\underset{F\in\mathcal{H}^*}{\text{arg min}}\frac{1}{n}\sum_{i=1}^nL(y_i,F(x_i))

This can be found by following several steps:

F0=arg minγi=1nL(yi,γ)F_0=\underset{\gamma}{\text{arg min}}\sum_{i=1}^nL(y_i,\gamma)
Fm=Fm1+arg minfmHi=1nL(yi,Fm1(Xi)+fm(xi))F_m=F_{m-1}+\underset{f_m\in\mathcal{H}}{\text{arg min}}\sum_{i=1}^nL(y_i,F_{m-1}(X_i)+f_m(x_i))

At this point, Gradient Boosting reforms the last equation by substituting the minimizing part with the gradient descent method as following:

Fm(x)=Fm1(x)γi=1nL(yi,Fm1(Xi)+fm(xi))F_m(x)=F_{m-1}(x)-\gamma\sum_{i=1}^n\nabla L(y_i,F_{m-1}(X_i)+f_m(x_i))

FmF_m can be approximated with respect to the positive integer γ\gamma. This is the reason why the term Gradient is included in the name of this boosting algorithm.

Here, the gradient can be computed as:

L(yi,Fm1(Xi)+fm(xi))=L(yi,Fm1(Xi)+fm(xi))fm(xi)\nabla L(y_i,F_{m-1}(X_i)+f_m(x_i))=\frac{\partial L(y_i,F_{m-1}(X_i)+f_m(x_i))}{\partial f_m(x_i)}

which in most cases(except for MSE loss), are very complitated to compute. Therefore, we take the 1st-order Taylor approximation with respect to fmf_m, which necessarily requires the loss function to be differentiable.

L(yi,Fm1(Xi)+fm(xi)) L(yi,Fm1(Xi))+fm(xi)[L(yi,F(xi))F(xi)]F(x)=Fm1(x)L(y_i, F_{m-1}(X_i) + f_m(x_i)) \approx \ L(y_i, F_{m-1}(X_i)) + f_m(x_i) \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F_{m-1}(x)}

If we differentiate the right-hand term of the above equation with respect to fm(xi)f_m(x_i), we can approximate the L(yi,Fm1(Xi)+fm(xi))\nabla L(y_i,F_{m-1}(X_i)+f_m(x_i)) by:

L(yi,Fm1(Xi)+fm(xi))fm(xi)[L(yi,F(xi))F(xi)]F(x)=Fm1(x)\frac{\partial L(y_i,F_{m-1}(X_i)+f_m(x_i))}{\partial f_m(x_i)}\approx\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F_{m-1}(x)}

And therefore:

Fm(x)=Fm1(x)γi=1n[L(yi,F(xi))F(xi)]F(x)=Fm1(x)F_m(x)=F_{m-1}(x)-\gamma\sum_{i=1}^n\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F_{m-1}(x)}

Additionaly, we can optimize γ\gamma in order to minimize the residuals:

γm=arg minγ>0i=1nL(yi,Fm1(xi)γ[L(yi,F(xi))F(xi)]F(x)=Fm1(x))\gamma_m=\underset{\gamma>0}{\text{arg min}}\sum_{i=1}^nL\left( y_i,F_{m-1}(x_i)-\gamma\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F_{m-1}(x)} \right)

As a result, we can define the pseudo-residual rimr_{im} as following:

rim=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)for i=1,,nr_{im}=-\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F_{m-1}(x)}\quad \text{for}~i=1,\ldots,n
γm=arg minγ>0i=1nL(yi,Fm1(xi)+γrim)\gamma_m=\underset{\gamma>0}{\text{arg min}}\sum_{i=1}^nL\left( y_i,F_{m-1}(x_i)+\gamma r_{im} \right)

For the next step, we train a new function gmg_m, which accepts xix_i as an input variable and rimr_{im} as an output variable, by approximating rimr_{im} with respect to xix_i. Then we can redefine γm\gamma_m as following:

γm=arg minγ>0i=1nL(yi,Fm1(xi)+γgm(xi))\gamma_m=\underset{\gamma>0}{\text{arg min}}\sum_{i=1}^nL\left( y_i,F_{m-1}(x_i)+\gamma g_m(x_i) \right)

Consequently, FmF_m can be finally set as Fm(x)=Fm1(x)+lγmgm(x)F_m(x)=F_{m-1}(x)+l\cdot\gamma_m\cdot g_m(x) by adopting ll, which is for the prevention of overfitting.

From previous explanations, hi(x), i=1,,Mh_i(x),~i=1,\ldots,M are in fact γigi(x)\gamma_i\cdot g_i(x).

Loss Functions

In the Gradient Boosting Classifier, the choice of the loss function is crucial as it directly influences the model's performance by determining how well the predictions fit the actual data. Here are two popular loss functions and their mathematical integration into the algorithm:

Binary Cross-Entropy Loss

For a binary classification with target variable y{0,1}y \in \{0,1\} and predicted probability p=σ(F(x))p = \sigma(F(x)) where σ()\sigma(\cdot) is the sigmoid function, the Binary Cross-Entropy loss for a single observation is given by:

L(y,p)=ylog(p)(1y)log(1p)L(y, p) = -y\log(p) - (1 - y)\log(1 - p)

In the context of Gradient Boosting, the gradient (negative gradient of the loss with respect to F(x)F(x)) that we fit the next model to is:

rim=L(yi,p)F(xi)=yipir_{im} = -\frac{\partial L(y_i, p)}{\partial F(x_i)} = y_i - p_i

where pi=σ(Fm1(xi))p_i = \sigma(F_{m-1}(x_i)) is the predicted probability based on the model at iteration m1m-1.

Multinomial Deviance Loss

For a classification problem with KK classes, let yy be the true class encoded as a one-hot vector (y1,y2,,yK)(y_1, y_2, \ldots, y_K), where yk=1y_k = 1 for the correct class and yj=0y_j = 0 for all jkj \neq k. Let pk=P(Y=kX=x)p_k = P(Y = k \mid X = x) be the predicted probability that observation X=xX = x belongs to class kk. The Multinomial Deviance Loss for a single observation is given by:

L(y,p)=k=1Kyklog(pk)L(y, p) = -\sum_{k=1}^{K} y_k \log(p_k)

The gradient of this loss function with respect to the model's output Fk(x)F_k(x) for class kk is:

rimk=ykpkr_{imk} = y_k - p_k

where pk=eFk(x)/j=1KeFj(x)p_k = {e^{F_k(x)}}/{\sum_{j=1}^{K} e^{F_j(x)}} is the predicted probability for class kk, computed using the softmax function over the outputs for all classes.

Exponential Loss

For a binary classification problem with target variable y{1,1}y \in \{-1,1\} (note the difference in encoding compared to the binary cross-entropy loss, where targets are usually 00 and 11), and a prediction model F(x)F(x), the Exponential Loss for a single observation is given by:

L(y,F(x))=exp(yF(x))L(y, F(x)) = \exp(-y F(x))

The negative gradient of this loss function with respect to the prediction F(x)F(x), which is what the next model in the sequence will predict, is given by:

rim=L(yi,F(xi))F(xi)=yiexp(yiFm1(xi))r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} = y_i \exp(-y_i F_{m-1}(x_i))

This formulation effectively weights the update of each point by how poorly it was predicted by the previous model: the worse the prediction (in terms of sign and magnitude of the margin yF(x)y F(x)), the higher the weight of the corresponding point for the next model to focus on.

Procedural Steps

  1. Initialization: Start with a constant model F0(x)F_0(x).
  2. For m=1m = 1 to MM:
    • Compute the residuals rimr_{im}, the negative gradient of the loss function.
    • Fit a model hm(x)h_m(x) to these residuals.
    • Find the optimal multiplier γm\gamma_m.
    • Update the model Fm(x)=Fm1(x)+νγmhm(x)F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x).
  3. Output: The final model FM(x)F_M(x).

Implementation

Parameters

  • base_estimator: Estimator, default = DecisionTreeRegressor()
    Base estimator for training multiple models
  • n_estimators: int, default = 100
    Number of base estimators to fit
  • learning_rate: float, default = 0.1
    Shrinking factor of the contribution of each estimator
  • subsampling: float, default = 1.0
    Fraction of samples to be used for fitting each estimator

Examples

Test on standardized and reduced(Rn×64Rn×2\mathbb{R}^{n\times64}\rightarrow\mathbb{R}^{n\times2}) digits dataset with only 5 classes:

from luma.ensemble.boost import GradientBoostingClassifier
from luma.preprocessing.scaler import StandardScaler
from luma.model_selection.split import TrainTestSplit
from luma.reduction.linear import KernelPCA
from luma.visual.evaluation import DecisionRegion, ROCCurve

from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)

X, y = load_digits(n_class=5, return_X_y=True)
indices = np.random.choice(X.shape[0], size=500)

X_sample = X[indices]
y_sample = y[indices]

sc = StandardScaler()
X_sample_std = sc.fit_transform(X_sample)

kpca = KernelPCA(n_components=2, gamma=0.01, kernel='rbf')
X_kpca = kpca.fit_transform(X_sample_std)

X_train, X_test, y_train, y_test = TrainTestSplit(X_kpca, y_sample, 
                                                  test_size=0.3).get

gb = GradientBoostingClassifier(n_estimators=100,
                                learning_rate=0.1,
                                subsample=1.0,
                                max_depth=3)

gb.fit(X_train, y_train)

X_cat = np.concatenate((X_train, X_test))
y_cat = np.concatenate((y_train, y_test))

score = gb.score(X_cat, y_cat)

fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

dec = DecisionRegion(gb, X_cat, y_cat)
dec.plot(ax=ax1)
ax1.set_title(f'GradientBoostingClassifier [Acc: {score:.4f}]')

pr = ROCCurve(y_cat, gb.predict_proba(X_cat))
pr.plot(ax=ax2, show=True)

Applications

Gradient Boosting Classifier is used in a variety of domains due to its flexibility and accuracy. Common applications include but are not limited to:

  • Fraud detection in banking.
  • Customer churn prediction.
  • Disease outbreak prediction.
  • Demand forecasting in retail.

Strengths and Limitations

Strengths

  • Highly Accurate: Often provides very high accuracy across a wide range of applications.
  • Flexibility: Can handle different types of data and is adaptable to various loss functions.

Limitations

  • Prone to Overfitting: Especially with noisy data and without proper regularization.
  • Computationally Intensive: Can be slower to train due to the sequential nature of boosting.
  • Parameter Tuning: Requires careful tuning of parameters and stopping criteria to avoid overfitting and underfitting.

Advanced Topics

  • Regularization Techniques: Techniques like shrinkage (learning rate) and stochastic gradient boosting can help prevent overfitting.
  • Loss Functions: Beyond the typical use with classification and regression tasks, custom loss functions can be implemented for specific applications.

References

  1. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics.
  2. Natekin, A., & Knoll, A. (2013). "Gradient boosting machines, a tutorial." Frontiers in Neurorobotics.
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글