[Coursera] How to win a data science competition - 3주차 2강

환공지능·2021년 7월 8일
0
post-thumbnail

1. Concept of mean encoding

(1) Using target to generate features

1) why does it work

  • label encoding gives random order, no correlation with target
  • mean encoding helps to seperate zero from ones
  • reaching a better loss with shorter trees

2) what will you learn

  • construct encodings
  • correctly validate them
  • extend them

(2) Indicators of usefulness

(3) ways to use target variable

2. regularization

4 types

  • Cross-Validation loop inside the training data
  • Smoothing
  • Adding random noise
  • Sorting and Calculating the expanding mean

(1) Cross Validation

  • Robust and Intuitive
  • 일반적으로 4-5개의 fold면 충분함
  • Leave-One-Out과 같은 극단적인 경우에 주의하여야 함.

(2) Smoothing

  • α\alpha controls the amount of regularization
  • Only works together with some other regularization method
    mean(target)nrows+globalmeanalphanrows+alpha\frac {mean(target) * nrows + globalmean * alpha}{nrows + alpha}

(3) Add random noise

  • Noise degrades the quality of encoding
  • usually used together with LOO

(4) Expanding mean

  • Least amount of leakage
  • No hyper parameters
  • Irregular encoding quality
  • Built-in in CatBoost

which one to use? CV and Expanding mean is usaully practical!

3. Extensions and Generalization

(1) Regression and multiclass

  • more statistics for regression tasks, percentiles, std, distribution bins
  • Introducing new information for one vs all classifiers in multiclass tasks

(2) Many-to-many relations

  • Cross product of entities
  • Statistics from vectors

(3) Interactions and numerical features

  • Analyzing fitted model
  • Binning numeric and selecting interactions
  • 두 가지 요소가 서로 인접한 두개의 노드에 있으면 tree에서 상호작용을 염두에 두고 모델의 모든 tree를 반복하여 각 feature 상호작용이 일어나는 횟수를 계산할 수 있음

(4) Correct Validation reminder

1) Local experiments:

  • Estimate encodings on X_tr
  • Map them to X_tr and X_val
  • Regularize on X_tr
  • Validate model on X_tr/X_val split

2) Submission

  • Estimate encodings on whole train data
  • Map them to train and test set
  • Regularize on Train
  • Fit on train

(5) 장점과 단점

  • 장점 : categorical 변수의 변환 쉬움, feature engineering의 기초
  • 단점 : validation에 유의, overfitt 발생 가능성 상승, specific dataset에서만 유효
profile
데이터사이언티스트 대학원생

0개의 댓글

관련 채용 정보