Decision Tree

been_29·2024년 8월 9일
post-thumbnail

💡 Decision Tree

  • Tree-based classification rules that automatically discover patterns in data through learning
  • On what criteria should the data be based to create the most efficient classification rules?



Decision Tree Components

  • Node : The point at which the data is split
    • Root Node : Located at the top of the tree, the starting point of tree splitting.
    • Internal Node : Nodes branched from the root node, providing additional splitting criteria.
    • Leaf Node : The final node of the tree that is not split any further.
  • Branch : Connections between nodes, with each branch representing a splitting condition.
  • Splitting Criteria : The rule for dividing data at each node.
  • Depth : The maximum path length from the Root Node to the Leaf Node.
    • The deeper the tree, the more complex it becomes, increasing the risk of overfitting.
    • If the tree is shallow, it becomes too simple and may not adequately represent the data.






Node creation process

  1. Root node selection : Set the root node to include all the data
  2. Determine the optimal splitting criterion : Select the attribute that best splits the data based on specific criteria (e.g., Gini Impurity, Entropy, etc.)
  3. Data splitting : Split the data based on the selected attribute and create new nodes
  4. Iteration : Repeat the above process on the split data to create lower nodes. If further splitting is not possible or predefined conditions (e.g., max_depth, min_samples, etc.) are met, create a Leaf Node






Uniformity-based rule conditions

  • Gini Impurity
    • a measure of the degree of mixture within a dataset
    • A smaller value indicates that the data is more uniformly distributed
    • formula
      Gini=1pi2Gini = 1-\sum{p_i}^2
    • pip_i presents the proportion of clss ii
  • Entropy
    • A measure of the degree of mixture within the data
    • A larger value indicates that the data is distributed across a variety of classes
    • formula
      Entropy=(pilog2pi)Entropy = -\sum(p_ilog_2p_i)
  • Information Gain
    • Calculate the difference in entropy before and after the split, and choose the splitting criterion with the highest information gain
    • formula
      InformationGain=Entropy(parent)(childparentEntropy(child))Information Gain = Entropy(parent) - \sum(\frac{|child|}{|parent|}*Entropy(child))
  • Variance Reduction
    • Primarily used in regression trees, it calculates the difference in variance before and after the split
    • Choose the criterion with the largest variance reduction






Main Hyperparameters

  • Max Depth : Set the maximum depth of the tree to prevent it from becoming too deep and overfitting
  • Minimum Samples
    • Minimum Samples Split : The minimum number of samples required to split a node
    • Minimum Samples Leaf : The minimum number of samples required to be in a leaf node
  • Max Features : The maximum number of features (attributes) to consider when making a split
  • Pruning : The process of reducing the branches of the tree to prevent overfitting
    • Post-Pruning
    • Pre-Pruning






Decision Tree Model Using Iris

Decision Tree Model Code Example Using the Iris Dataset

# Import the necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create and train the decision tree model
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X, y)

# Visualize the decision tree
plt.figure(figsize=(20,10))
tree.plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.show()
profile
Data Analysis

0개의 댓글