The Random Forest Regressor is a machine learning algorithm used for regression tasks, which predicts a continuous value for new observations by aggregating the predictions from multiple decision trees. This ensemble method is based on the principle of bagging (Bootstrap Aggregating), aiming to improve the stability and accuracy of machine learning algorithms by combining the predictions from several models.
Random Forests are an extension of Decision Trees. A Decision Tree is a flowchart-like tree structure where an internal node represents a feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The main problem with decision trees, especially deep ones, is that they tend to overfit the training data, making them poor at generalizing to unseen data.
Random Forest mitigates this by creating a 'forest' of trees where each tree is trained on a random subset of the data and features, making the model less sensitive to the noise in the training data and reducing overfitting. The final prediction is typically the average of the predictions from all trees for regression tasks.
The algorithm follows several key steps:
Mathematically, if we denote as the target variable and as the input features, the prediction for a new observation is given by:
where is the prediction of the tree, and is the number of trees in the forest.
n_trees
: int
, default = 100max_depth
: int
, default = 10min_samples_split
: int
, default = 2min_samples_leaf
: int
, default = 1max_features
: int
, default = Nonemin_variance_decrease
: float
, default = 0.0max_leaf_nodes
: int
, default = 1bootstrap
: bool
, default = Truebootstrap_feature
: bool
, default = Falsen_features
: int
Literal['auto']
, default = ‘auto’Test on the synthesized dataset of the curve :
from luma.ensemble.forest import RandomForestRegressor
from luma.preprocessing.scaler import StandardScaler
from luma.metric.regression import RootMeanSquaredError
from luma.visual.evaluation import ResidualPlot
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
X = np.linspace(-4, 4, 400).reshape(-1, 1)
y = (np.cos(2 * X) * np.sin(np.exp(-np.ceil(X / 2)))).flatten()
y += 0.2 * np.random.randn(400)
sc = StandardScaler()
y_trans = sc.fit_transform(y)
forest = RandomForestRegressor(n_trees=10,
max_depth=7,
bootstrap=True)
forest.fit(X, y_trans)
y_pred = forest.predict(X)
score = forest.score(X, y_trans, metric=RootMeanSquaredError)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
ax1.scatter(X, y_trans,
s=10, c='black', alpha=0.3,
label=r'$y=\cos{2x}\cdot$' + \
r'$\sin{e^{-\left\lceil x/2\right\rceil}}+\epsilon$')
for tree in forest.trees:
ax1.plot(X, tree.predict(X), c='violet', alpha=0.2)
ax1.plot(X, y_pred, lw=2, c='purple', label='Predicted Plot')
ax1.legend(loc='upper right')
ax1.set_xlabel('x')
ax1.set_ylabel('y (Standardized)')
ax1.set_title(f'RandomForestRegressor [RMSE: {score:.4f}]')
res = ResidualPlot(forest, X, y_trans)
res.plot(ax=ax2)
plt.tight_layout()
plt.show()
- Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.
- Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news 2.3 (2002): 18-22.