Gradient Boosting Regressor
Introduction
Gradient Boosting Regressor is a potent machine learning algorithm that is part of the ensemble methods family, particularly within the boosting techniques category. It operates by sequentially incorporating predictors (commonly decision trees), with each one amending the errors of its predecessor. Unlike other boosting approaches that modify the weights of observations, Gradient Boosting concentrates on minimizing the loss function, a differentiable function that quantifies the model's error, making it highly effective for regression tasks.
Background and Theory
Gradient Boosting combines multiple weak learners (models that perform slightly better than random guessing) to create a strong learner in a sequential manner. The core principle is to construct new models that predict the residuals or errors of prior models and then add these new models to the ensemble.
Mathematical Foundations
Let's denote our data set as (xi,yi), where xi represents the features and yi the target variable for the ith observation. The goal is to find a function F(x) that maps input features x to predictions as close as possible to the target values y.
The algorithm starts with a model that makes constant predictions:
F0(x)=arg minγi=1∑NL(yi,γ)
where L(yi,γ) is the loss function evaluated at the initial prediction γ and the true target value yi.
At each stage m=1 to M, where M is the total number of trees, the algorithm fits a new model hm(x) to the negative gradient of the loss function evaluated at the previous model's predictions:
rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)
Then, it finds the best-fit model hm(x) to these residuals. The model hm(x) is typically a decision tree.
Next, it computes a multiplier γm for the model hm(x) that minimizes the loss when added to the existing model Fm−1(x):
γm=arg min γi=1∑NL(yi,Fm−1(xi)+γhm(xi))
The model is then updated:
Fm(x)=Fm−1(x)+νγmhm(x)
where ν is the learning rate, a parameter that scales the contribution of each hm(x).
The process repeats for M iterations or until a stopping condition is reached, resulting in a final model:
FM(x)=F0(x)+νm=1∑Mγmhm(x)
Algorithmic Contemplation
Let’s try generalizing the problem Gradient Boosting algorithm is about to solve. We want to find the function F^∈H which minimizes the expected loss function in terms of p-dimensional loss function L, where H={F∣F:Rp→R}.
F^=F∈Harg min Ex,y[L(y,F(x))]
Unfortunately, H is an enormous set for the algorithm to find optimal F^. Therefore, some constraints are needed, which constrains H to:
H∗={F∣F:Rp→R, F=C+m=1∑Mγmfm, fi∈H, γi∈R, i=1,…,M}
where M is a natural number.
In short, a structure which can be expressed through a linear combination of a constant and several functions is added, rather than a simple generic function.
Moreover, since the distributions of x,y are unknown, it is highly demanding to figure out the expected loss. Therefore, we have to find F^ that minimizes the experienced loss function, which can be derived by utilizing a finite amount of data (xi,yi), i=1,…,n.
F^=F∈H∗arg minn1i=1∑nL(yi,F(xi))
This can be found by following several steps:
F0=γarg mini=1∑nL(yi,γ)
Fm=Fm−1+fm∈Harg mini=1∑nL(yi,Fm−1(Xi)+fm(xi))
At this point, Gradient Boosting reforms the last equation by substituting the minimizing part with the gradient descent method as following:
Fm(x)=Fm−1(x)−γi=1∑n∇L(yi,Fm−1(Xi)+fm(xi))
Fm can be approximated with respect to the positive integer γ. This is the reason why the term Gradient is included in the name of this boosting algorithm.
Here, the gradient can be computed as:
∇L(yi,Fm−1(Xi)+fm(xi))=∂fm(xi)∂L(yi,Fm−1(Xi)+fm(xi))
which in most cases(except for MSE loss), are very complitated to compute. Therefore, we take the 1st-order Taylor approximation with respect to fm, which necessarily requires the loss function to be differentiable.
L(yi,Fm−1(Xi)+fm(xi))≈L(yi,Fm−1(Xi))+fm(xi)[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)
If we differentiate the right-hand term of the above equation with respect to fm(xi), we can approximate the ∇L(yi,Fm−1(Xi)+fm(xi)) by:
∂fm(xi)∂L(yi,Fm−1(Xi)+fm(xi))≈[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)
And therefore:
Fm(x)=Fm−1(x)−γi=1∑n[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)
Additionaly, we can optimize γ in order to minimize the residuals:
γm=γ>0arg mini=1∑nL(yi,Fm−1(xi)−γ[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x))
As a result, we can define the pseudo-residual rim as following:
rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)for i=1,…,n
γm=γ>0arg mini=1∑nL(yi,Fm−1(xi)+γrim)
For the next step, we train a new function gm, which accepts xi as an input variable and rim as an output variable, by approximating rim with respect to xi. Then we can redefine γm as following:
γm=γ>0arg mini=1∑nL(yi,Fm−1(xi)+γgm(xi))
Consequently, Fm can be finally set as Fm(x)=Fm−1(x)+l⋅γm⋅gm(x) by adopting l, which is for the prevention of overfitting.
From previous explanations, hi(x), i=1,…,M are in fact γi⋅gi(x).
Loss Functions
In the Gradient Boosting Regressor, the choice of the loss function is crucial as it directly influences the model's performance by determining how well the predictions fit the actual data. Here are three popular loss functions and their mathematical integration into the algorithm:
Mean Squared Error (MSE) Loss
For regression tasks where the target variable y represents the true value, and the model's prediction is denoted by y^, the Mean Squared Error (MSE) loss for a single observation is defined as:
L(y,y^)=(y−y^)2
When applying Gradient Boosting for regression, the gradient (negative gradient of the loss with respect to the model's prediction y^ used to fit the next model is:
rim=−∂y^i∂L(yi,y^)=2(y^i−yi)
where y^i is the prediction for the i-th observation based on the model up to iteration m−1.
Mean Absolute Error (MAE) Loss
For the same regression context, the Mean Absolute Error (MAE) loss, which measures the average magnitude of errors between predicted and actual values without considering their direction, for a single observation is:
L(y,y^)=∣y−y^∣
The gradient for the MAE loss with respect to the model's prediction y^ is:
rim=−∂y^i∂L(yi,y^)=sign(yi−y^i)
where sign(x) is a function that returns 1 if x>0, -1 if x<0, and 0 if x=0, and y^i is as previously defined.
Huber Loss
Huber Loss is used in regression and is less sensitive to outliers in data than the squared error loss. It combines the properties of both MSE and MAE. The loss function for a single observation can be defined piecewise as:
Lδ(y,y^)={21(y−y^)2δ∣y−y^∣−21δ2for ∣y−y^∣≤δotherwise
where δ is a hyperparameter that dictates the transition between the squared loss and the absolute loss.
The gradient for the Huber loss with respect to y^ is:
rim=−∂y^i∂Lδ(yi,y^)={y^i−yiδ⋅sign(y^i−yi)for ∣yi−y^i∣≤δotherwise
where y^i and δ are as defined above. This formulation allows for adjusting the model's sensitivity to outliers in the gradient boosting process.
Procedural Steps
- Initialization: Start with a constant model F0(x).
- For m=1 to M:
- Compute the residuals rim, the negative gradient of the loss function.
- Fit a model hm(x) to these residuals.
- Find the optimal multiplier γm.
- Update the model Fm(x)=Fm−1(x)+νγmhm(x).
- Output: The final model FM(x).
Implementation
Parameters
base_estimator
: Estimator
, default = DecisionTreeRegressor()
Base estimator for training multiple models
n_estimators
: int
, default = 100
Number of base estimators to fit
learning_rate
: float
, default = 0.1
Shrinking factor of the contribution of each estimator
subsampling
: float
, default = 1.0
Fraction of samples to be used for fitting each estimator
loss
: Literal['mse', 'mae', 'huber']
, default = ‘mse’
Type of loss function
delta
: float
, default = 1.0
Balancing factor between MSE loss and MAE loss
Examples
Test on the synthesized dataset of the curve y=sin3x⋅cosex/2+ϵ:
from luma.ensemble.boost import GradientBoostingRegressor
from luma.preprocessing.scaler import StandardScaler
from luma.metric.regression import RootMeanSquaredError
from luma.visual.evaluation import ResidualPlot
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
X = np.linspace(-4, 4, 400).reshape(-1, 1)
y = (np.sin(3 * X) * np.cos(np.exp(X / 2))).flatten()
y += 0.15 * np.random.randn(400)
sc = StandardScaler()
y_trans = sc.fit_transform(y)
gb = GradientBoostingRegressor(n_estimators=50,
learning_rate=0.1,
subsample=1.0,
loss='mae',
max_depth=3)
gb.fit(X, y_trans)
y_pred = gb.predict(X)
score = gb.score(X, y_trans, metric=RootMeanSquaredError)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
ax1.scatter(X, y_trans,
s=10, c='black', alpha=0.3,
label=r'$y=\sin{3x}\cdot\cos{e^{x/2}}+\epsilon$')
for tree in gb:
ax1.plot(X, tree.predict(X), c='pink', alpha=0.1)
ax1.plot(X, y_pred, lw=2, c='crimson', label='Predicted Plot')
ax1.legend(loc='upper right')
ax1.set_xlabel('x')
ax1.set_ylabel('y (Standardized)')
ax1.set_title(f'GradientBoostingRegressor [RMSE: {score:.4f}]')
res = ResidualPlot(gb, X, y_trans)
res.plot(ax=ax2)
plt.tight_layout()
plt.show()
Applications
Gradient Boosting Classifier is used in a variety of domains due to its flexibility and accuracy. Common applications include but are not limited to:
- Fraud detection in banking.
- Customer churn prediction.
- Disease outbreak prediction.
- Demand forecasting in retail.
Strengths and Limitations
Strengths
- Highly Accurate: Often provides very high accuracy across a wide range of applications.
- Flexibility: Can handle different types of data and is adaptable to various loss functions.
Limitations
- Prone to Overfitting: Especially with noisy data and without proper regularization.
- Computationally Intensive: Can be slower to train due to the sequential nature of boosting.
- Parameter Tuning: Requires careful tuning of parameters and stopping criteria to avoid overfitting and underfitting.
Advanced Topics
- Regularization Techniques: Techniques like shrinkage (learning rate) and stochastic gradient boosting can help prevent overfitting.
- Loss Functions: Beyond the typical use with classification and regression tasks, custom loss functions can be implemented for specific applications.
References
- Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics.
- Natekin, A., & Knoll, A. (2013). "Gradient boosting machines, a tutorial." Frontiers in Neurorobotics.