A Decision Tree Classifier is a non-parametric supervised learning method used for classification and regression tasks. It models decisions and their possible consequences as a tree structure, incorporating chance event outcomes, resource costs, and utility. Decision Trees are easy to interpret, handle categorical and numerical data, and can visually represent decision-making processes, making them a popular choice across various domains.
The core idea behind Decision Trees is to mimic human decision-making processes by splitting data into subsets based on feature values. This process is repeated recursively, resulting in a tree-like model of decisions. The decisions or splits are made at nodes based on the outcome that provides the best separation of the data according to a specific criterion, such as Gini impurity or information gain.
When constructing a Decision Tree Classifier, the mathematical formulations governing the selection of splits are crucial for understanding its decision-making process. Let's delve deeper into these formulations, particularly focusing on Gini impurity, Entropy, and Information Gain, as these are central to how decision trees decide where to split the data.
Gini Impurity measures the frequency at which any element from the set will be mislabeled when it is randomly labeled according to the distribution of labels in the subset. The Gini Impurity for a dataset is:
For a binary classification problem, this simplifies to , where is the proportion of elements of one class. The goal is to minimize Gini impurity at each split, which corresponds to increasing the purity of the subsets formed by the split.
Entropy measures the amount of information disorder or uncertainty. In the context of a decision tree, it quantifies the impurity or randomness in the class labels of the dataset. The entropy of a set is defined as:
Entropy reaches its maximum value when all classes in the set are equally represented, indicating the highest uncertainty. A set with only one class (pure) has an entropy of 0.
Information Gain is the reduction in entropy or impurity achieved by partitioning the data according to a particular attribute. It is a measure of how well a given attribute separates the training examples according to their target classification. Information Gain for a split on attribute is calculated as:
The attribute that maximizes the Information Gain is chosen for the split at each node of the tree. This process aims to increase the homogeneity of the subsets, which ideally would have entropy values moving closer to 0 (increasing purity).
Decision Trees use different algorithms to decide on the best split. The most common are ID3 (Iterative Dichotomiser 3), C4.5 (a successor of ID3), and CART (Classification and Regression Trees). Each algorithm has its methodology for selecting the best split:
The process of determining the best split involves calculating the chosen metric (e.g., Gini Impurity, Entropy) for every possible split and then selecting the one that results in the highest gain (for Information Gain) or the lowest impurity (for Gini Impurity).
Suppose we're considering a split on a particular feature. We calculate the metric for each subset that would result from the split and then combine these to get a weighted sum or average that represents the entire dataset after the split. The improvement (or gain) from a split is the difference between the metric before the split and this weighted sum after the split.
For Information Gain, specifically, you calculate the gain for a potential split on attribute as follows:
For Gini Impurity, the improvement from a potential split can be measured as a decrease in impurity:
The split that provides the highest Information Gain or the largest decrease in Gini Impurity is selected as the best split.
Let's consider a practical example with a dataset containing two features and , and we want to determine the best feature to split on at the current node:
In summary, the mathematical formulation and consideration for the best split in a Decision Tree involve analyzing the potential decrease in impurity or increase in information gain across all possible splits. The choice of metric and the method for calculating this improvement are fundamental to the tree's ability to learn from the training data effectively and generalize well to unseen data.
max_depth
: int
, default = 10criterion
: Literal['gini', 'entropy']
, default = ‘gini’min_samples_split
: int
, default = 2min_samples_leaf
: int
, default = 1max_features
: int
, default = Nonemin_impurity_decrease
: float
, default = 0.01max_leaf_nodes
: int
, default = Nonerandom_state
: int
, default = NoneTest on synthesized dataset with PCA
and RandomizedSearchCV
:
from luma.classifier.tree import DecisionTreeClassifier
from luma.preprocessing.scaler import StandardScaler
from luma.reduction.linear import PCA
from luma.model_selection.split import TrainTestSplit
from luma.model_selection.search import RandomizedSearchCV
from luma.visual.evaluation import DecisionRegion, ConfusionMatrix
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
X, y = make_classification(n_samples=400,
n_classes=4,
n_features=5,
n_informative=4,
n_redundant=1,
n_clusters_per_class=1,
class_sep=2.0,
random_state=0)
X_train, X_test, y_train, y_test = TrainTestSplit(X, y,
test_size=0.2,
random_state=42).get
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.fit_transform(X_test_std)
param_dist = {'max_depth': [5, 10, 20, 50],
'criterion': ['gini', 'entropy'],
'min_impurity_decrease': np.linspace(0, 0.2, 10)}
rand = RandomizedSearchCV(estimator=DecisionTreeClassifier(),
param_dist=param_dist,
max_iter=50,
cv=5,
refit=True,
random_state=42,
verbose=True)
rand.fit(X_train_pca, y_train)
tree_best = rand.best_model
X_concat = np.concatenate((X_train_pca, X_test_pca))
y_concat = np.concatenate((y_train, y_test))
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
dec = DecisionRegion(tree_best, X_concat, y_concat, cmap='plasma')
dec.plot(ax=ax1)
conf = ConfusionMatrix(y_concat, tree_best.predict(X_concat))
conf.plot(ax=ax2, show=True)
# Best params: {
# 'max_depth': 5,
# 'criterion': 'entropy',
# 'min_impurity_decrease': 0.2
# }
# Best score: 0.940625
- Breiman, L., Friedman, J., Olshen, R.A., & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth.
- Quinlan, J.R. (1986). "Induction of Decision Trees". Machine Learning, 1(1), 81-106.
- Rokach, L. & Maimon, O. (2005). Decision Trees for Machine Learning. Data Mining and Knowledge Discovery Handbook, Springer.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer Texts in Statistics.