[Coursera] How to win a data science competition - 4주차 1강

환공지능·2021년 7월 9일

Coursera kaggle machine learning

[Coursera] How to win a data science competition

목록 보기

9/11

1. Hyperparameter tuning

(1) How do we tune hyperparameters

1) Select the most influential parameters

2) Understand, how exactly they influence the training

3) Tune them! by manually, or automatically

(2) Optimization software

: Hyperopt

(3) Color-coding legend

1) (red)Underfitting(bad)

2) Good fit and generalization(good)

3) (green)Overfitting(bad)

- red : 파라미터 증가 시 fitting을 방해, overfitting을 감소, 모델의 자유도 감소, 오버피팅에서 언더피팅으로 모델을 변화시킴.
- green : 파라미터 증가 시 train set에 맞게 fit, underfit이면 파라미터 증가, overfit이면 파라미터 감소, underfitting에서 overfitting으로 모델 변화시킴.

(4) Tree-based models

1) XGBoost & LightGBM

두 모델 모두 tree 생성 후 given objective를 최적화
max_depth : tree의 최대 depth, 7로 두고 해보기
num_leaves : tree가 깊을 경우 leaves 수를 조정하기
subsample : 일종의 정규화를 도와줌
colsample_~ : 모델이 오버피팅 시 이 값들을 줄이면 해결 가능
ETA는 gradient descend와 같은 학습 변수
num_round는 우리가 수행하고자 하는 학습 단계 또는 만들게 되는 tree의 수에 관한 변수
seed를 바꾸게 되면 완전히 다른 모델이 될 수 있으나, random한 seed가 모델에 영향을 미치지 않는다면 바꿔도 좋다.

2) RandomForest & ExtraTrees

N_estimator : tree의 수, 이 값을 다양하게 설정 후 그래프를 통해 추론
max_depth : none으로도 설정 가능, 보통 7로 설정
min_sample_leaf : 정규화를 도와줌
criterion : 지니 또는 엔트로피 관련

3) Neural Network

4) Linear Model

TIPS

하이퍼파라미터 튜닝에 너무 많은 시간을 쏟지 말기, 아이디어나 feature 부족일 경우에만 시도
GBDT, NN을 수천번 돌려야할 수 있으니 인내심을 가질 것

2. Practical guide

(1) Define your goals. What you can get out of your participation
1) To learn more about an interesting prolbem
2) To get acquainted with new software tools
3) To hunt for a medal

(2) After you enter a competition
Sort all parameters by these principles:
1) Importance
2) Feasibility
3) Understanding

(3) Fast and dirty always better
1) 코드의 질에 너무 신경쓰지 마라
2) 중요한 것만 dataframe 형태로 저장하라
3) 더 큰 서버를 빌려라

(4) Software Development
1) Use good variable names
2) Keep your research reproducible : fix random seed, write down exactly how any feature were generated, use version control
3) Reuse code : same code for train and test stages

3. Competition Pipeline

(1) Understand the Problem (1일)

문제의 종류 ex) 이미지, 텍스트, 최적화
데이터의 크기
하드웨어 요구사항 확인
소프트웨어 요구사항 확인
test하려는 metric 확인

(2) EDA (1-2일)

plot histograms of variables
plot features versus the target variable and vs time.
consider univariate predictability metrics ex) IV, R, AUC
binning numerical features and correlation matrices

(3) Define CV Strategy

Most Important Step
People have won by just selecting the right way to validate
만약 시간이 중요하다면? Time-based validation
만약 train과 test 사이에 다른 특성이 있다면? Stratified validation
만약 완벽하게 random하다면? Random validation( random K-fold )

(4) Feature Engineering (3-4일)

Different problems require Different feature engineering!
비슷한 대회를 보고 사람들이 어떤 식으로 했는지 확인할 것!
Image : scaling, shifting, rotation, CNN
Text : tf-idf, svd, stemming, stop word's removal
Time series : lags, weighted averaging, exponential smoothing
Categorical : target enc, freq, one-hot-encoding, label encoding
Numerical : scaling, binning, derivatives, ourlier removal, dimensionality reduction
Interactions : multiplication, division, concatenation
Recommenders : transactional history, item popularity, frequency of purchase
Can be automated using selection with CV!

(5) Modeling (3-4일)

Image : CNN
Text : GBM, Linear, DL, KNN, LIBFM
Time series : autoregressive models, ARIMA, GBM, Linear, LSTM
Categorical : GBM, Linear, DL, LIBFM
Numerical : GBM, Linear, DL, SVM
Interaction : GBM, Linear, DL
Recommenders : CF, DL, LIBFM, GBM

(6) Ensembling (3-4일)

더 좋은 예측 성능을 얻기 위해 다수의 학습 알고리즘을 사용하는 방법

All this time, predictions on internal validation and test are saved.
Different ways to combine form averaging to mulilayer stacking
Helps to average a few low-correlated predictions with good scores
Stacking process repeats the modelling process

환공지능

데이터사이언티스트 대학원생

이전 포스트

[Coursera] How to win a data science competition - 3주차 2강

다음 포스트

[Coursera] How to win a data science competition - 4주차 1강