This note is based on lecture by Professor 최영우 @ Sookmyung Women's University
When a model does not generalize beyond the training dataset
A model should predict well for instances that the model has not yet seen
When the model is not complex enough
not very accuarte on both training & test data >> Underfitting
As the model gets too complex,
- more accurate on the training data
- less accurate on the test(=holdout) data
--Overfitting!
more terms or variables
To avoid overfitting,
1-1. Holdout evaluation
1-2. 문제점
2-1. Cross-Validation
2-2. k-fold Cross-Validation
Flexibility
Decision tree > Logistic Regression
Logistic Regression: can't model the full complexity of the data ase the data becomes larger (but it overfits less)
Decision tree: can model more complex regularities with larger training sets (but it overfits more)
- Stop growing the tree before it gets too complex
- Grow the tree until it's too large, then "prune" it back to reduce its size
- Build trees with different number of nodes and pick the best
If this replacement does not reduce its accuracy
Used when we don't know the best value of a complexity parameter
Cross-validation
Used when the value of the complexitiy parameter is fixed
Nested cross-validation
Used when we want to find the best value of a complexity parameter