[Coursera]How to win a data science competition - 2주차 2강

환공지능·2021년 7월 7일

Coursera kaggle machine learning

0

[Coursera] How to win a data science competition

목록 보기

5/11

1. Validation and Overfitting

(1) Private Leaderboard
public과 private 순위가 바뀌는 경우가 존재
1) 경쟁자가 validation을 무시하고 public leaderboard에서 가장 좋은 결과물을 제출
2) 경쟁자가 public, private 데이터를 일치시키지 않은 경우

(2) Validation

unseen data에도 맞추어야 함.
이 성능은 train set과 test set에 따라 달라짐.
학습 후 모델의 성능을 평가하기 위해 validation을 사용
validation에서 성능이 잘 나온 것을 best 모델로 하고 계속 튜닝을 하면 overfitting이 일어날 수 있음. 즉 test에는 안 맞게 되는 경우가 존재할 수 있음

(3) Underfitting and Overfitting

머신러닝의 overfitting과 competition의 overfitting은 개념 상 차이가 존재.
머신러닝의 overfitting : capturing noize, capturing patterns which do not generalize to test data
대회의 overfitting : low models' quality on test data, which was unexpected due to validation scores

(4) Conclusion

1) Validation helps us evaluate a quality of the model
모델의 질이 어떤지 답하는데 도움을 준다.

2) Validation helps us select the model which will perform best on the unseen data
테스트 데이터에서 최상의 품질을 얻을 것으로 예상되는 모델을 선택하도록 도와준다.

3) Underfitting refers to not capturing enough patterns in the data

4) Overfitting refers to capturing noize, capturing patterns

5) In competition, overfitting refers to low model's quality on test data, which was unexpected due to validation scores.

2. Validation Strategies

(1) Hold out

sklearn.model_selection.ShuffleSplit
data를 A, B 그룹으로 나눔
A를 train set으로, B를 predict set으로 사용
B의 예측을 토대로 model quality를 측정하고 하이퍼파라미터 진행
장점 : 데이터양이 충분할 때 유용, 같은 모델을 다르게 split하여 성능 보고 싶은 경우 유용
단점 : split 방식이 성능에 지대한 영향을 미침.

(2) K-fold

ngroup = k(개)
sklearn.model_selection.kfold
train data를 K개의 folds로 나눔
각 fold 마다 iterate : 현재의 fold를 제외한 모든 fold로 retrain한 후, 현재의 fold로 predict
장점 : loss와 mean과 variance를 측정해 성능 개선 가능, 모든 데이터를 train, test에 쓸 수 있음

(3) Leave-one-out

ngroup = len(train) (train set의 길이)
sklearn.model_selection.LeaveOneOut
데이터의 크기가 작은 경우 유용
현재 샘플을 제외한 모든 샘플로 retrain, 현재 샘플로 predict
다른 알고리즘에 비해 실행시간이 오래 걸리게 됨.

3. Data Splitting Strategies

(1) Random rows in validation

row들이 독립적인 경우 유용
row가 사람인 경우 독립적인 케이스 ex) 가족이거나 회사 동료일 경우에는 의존적인 케이스

(2) Time based split

특정 사건 이전의 데이터는 train, 이후의 데이터는 test 데이터로 분리
Moving window validation

(3) Differend approached to validation

by ID

4. Problem Occuring during validation

(1) Causes of score difference

데이터가 너무 적은 경우
데이터가 너무 다양하거나 일치성이 없는 경우

(2) Submission differs from validation

Even K-fold validation has variation
Too little data on leaderboard
Train and test are from different distributions

(3) LB shuffle

LB score가 validation score보다 일관되게 상승/하락/관련 없음
randomness of data
Too litte data
public and private are from different distributions

데이터사이언티스트 대학원생

이전 포스트

[Coursera]How to win a data science competition - 2주차 1강

다음 포스트

[Coursera]How to win a data science competition - 2주차 3강

0개의 댓글

관련 채용 정보