The Random Forest Classifier is a powerful and versatile machine learning algorithm used for classification tasks. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) of the individual trees. Random forests correct for decision trees' habit of overfitting to their training set, providing a more generalized model.
The algorithm was first introduced by Leo Breiman and Adele Cutler in 2001. Random Forests belong to the ensemble learning family, which means they use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
A Random Forest Classifier builds multiple decision trees and merges them together to get a more accurate and stable prediction. The core concept behind random forests is the idea of "bagging" or Bootstrap Aggregating, where multiple models (in this case, trees) are trained on different parts of the same training set and then averaged to improve the stability and accuracy.
Each tree in the forest is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Additionally, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result, the correlation between trees in the forest is decreased, leading to a decrease in the forest's variance without an increase in bias.
Given a training set with corresponding labels , a random forest classifier constructs a collection of decision trees , each trained on a bootstrap sample of the training data.
The prediction of the random forest, , for an input sample is given by:
where denotes the statistical mode (i.e., the most frequent label among the predictions of the individual trees).
n_trees
: int
, default = 100max_depth
: int
, default = 10criterion
: Literal['gini', 'entropy']
, default = ‘gini’min_samples_split
: int
, default = 2min_samples_leaf
: int
, default = 1max_features
: int
, default = Nonemin_impurity_decrease
: float
, default = 0.0max_leaf_nodes
: int
, default = 1bootstrap
: bool
, default = Truebootstrap_feature
: bool
, default = Falsen_features
: int
Literal['auto']
, default = ‘auto’Test on standardized and reduced() digits dataset with only 5 classes:
from luma.ensemble.forest import RandomForestClassifier
from luma.preprocessing.scaler import StandardScaler
from luma.model_selection.split import TrainTestSplit
from luma.reduction.linear import KernelPCA
from luma.visual.evaluation import DecisionRegion, ConfusionMatrix
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
X, y = load_digits(n_class=5, return_X_y=True)
indices = np.random.choice(X.shape[0], size=500)
X_sample = X[indices]
y_sample = y[indices]
sc = StandardScaler()
X_sample_std = sc.fit_transform(X_sample)
kpca = KernelPCA(n_components=2, gamma=0.01, kernel='rbf')
X_kpca = kpca.fit_transform(X_sample_std)
X_train, X_test, y_train, y_test = TrainTestSplit(X_kpca, y_sample,
test_size=0.3).get
forest = RandomForestClassifier(n_trees=10,
max_depth=100,
criterion='gini',
min_impurity_decrease=0.01,
bootstrap=True)
forest.fit(X_train, y_train)
X_cat = np.concatenate((X_train, X_test))
y_cat = np.concatenate((y_train, y_test))
score = forest.score(X_cat, y_cat)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
dec = DecisionRegion(forest, X_cat, y_cat, cmap='Spectral')
dec.plot(ax=ax1)
ax1.set_title(f'RandomForestClassifier [Acc: {score:.4f}]')
conf = ConfusionMatrix(y_cat, forest.predict(X_cat), cmap='BuPu')
conf.plot(ax=ax2, show=True)
- Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
- Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18-22.