1. Concept of mean encoding
(1) Using target to generate features
1) why does it work
- label encoding gives random order, no correlation with target
- mean encoding helps to seperate zero from ones
- reaching a better loss with shorter trees
2) what will you learn
- construct encodings
- correctly validate them
- extend them
(2) Indicators of usefulness
(3) ways to use target variable
2. regularization
4 types
- Cross-Validation loop inside the training data
- Smoothing
- Adding random noise
- Sorting and Calculating the expanding mean
(1) Cross Validation
- Robust and Intuitive
- 일반적으로 4-5개의 fold면 충분함
- Leave-One-Out과 같은 극단적인 경우에 주의하여야 함.
(2) Smoothing
- α controls the amount of regularization
- Only works together with some other regularization method
nrows+alphamean(target)∗nrows+globalmean∗alpha
(3) Add random noise
- Noise degrades the quality of encoding
- usually used together with LOO
(4) Expanding mean
- Least amount of leakage
- No hyper parameters
- Irregular encoding quality
- Built-in in CatBoost
which one to use? CV and Expanding mean is usaully practical!
3. Extensions and Generalization
(1) Regression and multiclass
- more statistics for regression tasks, percentiles, std, distribution bins
- Introducing new information for one vs all classifiers in multiclass tasks
(2) Many-to-many relations
- Cross product of entities
- Statistics from vectors
(3) Interactions and numerical features
- Analyzing fitted model
- Binning numeric and selecting interactions
- 두 가지 요소가 서로 인접한 두개의 노드에 있으면 tree에서 상호작용을 염두에 두고 모델의 모든 tree를 반복하여 각 feature 상호작용이 일어나는 횟수를 계산할 수 있음
(4) Correct Validation reminder
1) Local experiments:
- Estimate encodings on X_tr
- Map them to X_tr and X_val
- Regularize on X_tr
- Validate model on X_tr/X_val split
2) Submission
- Estimate encodings on whole train data
- Map them to train and test set
- Regularize on Train
- Fit on train
(5) 장점과 단점
- 장점 : categorical 변수의 변환 쉬움, feature engineering의 기초
- 단점 : validation에 유의, overfitt 발생 가능성 상승, specific dataset에서만 유효