# Decision Trees and Random Forests

dougieduk·2022년 4월 2일
0

# Elements of a decision tree

• Nodes
- Split for the value of a certain attribute

• Edges
- Outcome of a split to next node

• Root
- The node that performs the first split

• Leaves
- Terminal nodes that predict the outcome # Intuitions behind the split

• We try to choose the variables which splits the tree most cleanly

# Concept of Impurity

• How do you define clean?
- Entropy and Information Gain are the Mathematical Methods of choosing the best split. Refer to reading assignment
• Information gain is the (Beginning entropy )- (sum of the entropy of the terminal nodes)  • Example of Entropy
- There's a node with 3 reds, 3 greens - Entropy is calculated as below: • GINI Index could also be used:
- How to calculate GINI Index - Calculating a GINI Index  # Information Gain

• Calculating Information Gain - Beginning Entropy : - Entropy of the leaf nodes :  - Information Gain :
0.815-0.6075 = 0.2075

• We repeat this process until the information gain is less than a certain threshold (e.g. 0.1)

# Information Gain Ratio (GR)

• The more nodes you create , higher is the Information Gain. But is this a good model?
- That's where GR comes in

• How you calculate GR • Example
- an example model - Denominator of each model  - Calculating GR # Random Forest

• Decision Trees tend to overfit. Therefore you create multiple decision trees and let the trees do the voting • This is one of Ensemble Machine Learning Methods

# Bagging

• Selecting random features/ rows from data with replacement

• Bagging Features
- If you choose the features to split on, like the traditional decision tree model, the models are likely to start spliting with the same feature (the strongest feature)

• In this case, the trees are likely to be highly correlated
• Therefore, you randomly choose the features
• number of features to choose is the sqrt of total no. of features
• Bagging Rows
- You also do the same for rows

The Contents of this post belongs to Jose Portilla's Python for Data Science and Machine Learning Bootcamp and 유나의 공부