[Coursera] How to win a data science competition - 3주차 2강

환공지능·2021년 7월 8일

Coursera kaggle machine learning

0

[Coursera] How to win a data science competition

목록 보기

8/11

1. Concept of mean encoding

(1) Using target to generate features

1) why does it work

label encoding gives random order, no correlation with target
mean encoding helps to seperate zero from ones
reaching a better loss with shorter trees

2) what will you learn

construct encodings
correctly validate them
extend them

(2) Indicators of usefulness

(3) ways to use target variable

2. regularization

4 types

Cross-Validation loop inside the training data
Smoothing
Adding random noise
Sorting and Calculating the expanding mean

(1) Cross Validation

Robust and Intuitive
일반적으로 4-5개의 fold면 충분함
Leave-One-Out과 같은 극단적인 경우에 주의하여야 함.

(2) Smoothing

$\alpha$ controls the amount of regularization
Only works together with some other regularization method $\frac {mean(target) * nrows + globalmean * alpha}{nrows + alpha}$

(3) Add random noise

Noise degrades the quality of encoding
usually used together with LOO

(4) Expanding mean

Least amount of leakage
No hyper parameters
Irregular encoding quality
Built-in in CatBoost

which one to use? CV and Expanding mean is usaully practical!

3. Extensions and Generalization

(1) Regression and multiclass

more statistics for regression tasks, percentiles, std, distribution bins
Introducing new information for one vs all classifiers in multiclass tasks

(2) Many-to-many relations

Cross product of entities
Statistics from vectors

(3) Interactions and numerical features

Analyzing fitted model
Binning numeric and selecting interactions
두 가지 요소가 서로 인접한 두개의 노드에 있으면 tree에서 상호작용을 염두에 두고 모델의 모든 tree를 반복하여 각 feature 상호작용이 일어나는 횟수를 계산할 수 있음

(4) Correct Validation reminder

1) Local experiments:

Estimate encodings on X_tr
Map them to X_tr and X_val
Regularize on X_tr
Validate model on X_tr/X_val split

2) Submission

Estimate encodings on whole train data
Map them to train and test set
Regularize on Train
Fit on train

(5) 장점과 단점

장점 : categorical 변수의 변환 쉬움, feature engineering의 기초
단점 : validation에 유의, overfitt 발생 가능성 상승, specific dataset에서만 유효

데이터사이언티스트 대학원생

이전 포스트

[Coursera] How to win a data science competition - 3주차 1강

다음 포스트

[Coursera] How to win a data science competition - 4주차 1강

0개의 댓글