[Coursera] How to win a data science competition - 4주차 2강

환공지능·2021년 7월 21일

[Coursera] How to win a data science competition

목록 보기

10/11

Catboost

lightGBM, XGBoost 등과 같은 알고리즘과 비교하였을 때 lightGBM이 튜닝되지 않은 Catboost 보다 좋은 경우를 제외하고, 만약 매개 변수를 튜닝하는 경우 데이터 세트의 품질 측면에서 다른 모든 라이브러리를 능가한다.

Problems

(1) Categorical features
(2) Parameter tunning
(3) Prediction speed
(4) Overfitting
(5) Training speed

Numerical data and Categorical data
Catboost 사용 시에 categorical data를 어떻게 처리해야하는지 해결 방안 제시

(1) One-hot-encoding : 원본 feature가 제거되고, 각 범주에 대해 새로운 이진 변수 추가
(2) Number of appearances : 데이터셋의 카테고리 appearance를 새로운 feature로 사용하는 것
(3) Statistics with label usage on a random permutation of the data : 객체의 레이블 값을 사용하여 일부 통계를 계산하는 것
(4) Statistics on feature combinations : numeric 또는 categorical feature의 조합을 사용하는 방법

How Catboost reduces overfitting

Overfitting detector
Evaluating custom metrics during training
User defined metrics and loss functions
Nan features support
Cross validation
Staged predict and metric calculation on given dataset
Feature importances

Traning parameters

Number of trees + Learning rate
Tree depth
L2 regularization
Bagging temperature
Random strength

환공지능

데이터사이언티스트 대학원생

이전 포스트

[Coursera] How to win a data science competition - 4주차 1강

다음 포스트

[Coursera] How to win a data science competition - 4주차 2강