[Coursera] How to win a data science competition - 3주차 1강

환공지능·2021년 7월 8일

[Coursera] How to win a data science competition

목록 보기

7/11

1. Regression metrics review

Metrics : numeric values that your are trying to optimize
Metrics는 다양한 방식으로 숫자적으로 표현된 값으로 만든 모델이 평가되는 방법을 의미한다.
각각의 Metric은 다른 의미를 함의하고 있기 때문에 당신이 해결하고자 하는 문제에 따라 다른 Metric을 적용하여 평가되어야 할 것이다.

(1) MSE : Mean Square Error

MSE = \frac {1}{N} \sum_{i=1}^{N}{(y_{i} - \hat{y}_{i})^2}

MSE는 우리의 예측값의 평균 squared error를 측정

(2) RMSE : Root Mean Square Error

RMSE = \sqrt{\frac {1}{N} \sum_{i=1}^{N}{(y_{i} - \hat{y}_{i})^2}}= \sqrt{MSE}

minimizing RMSE is also minimizes the MSE.
RMSE is more intuitive.
can differ when used by gradient-based models

(3) R-squared

optimizing r-squared is equivalent to optimizing MSE.

(4) MAE

MSE = \frac {1}{N} \sum_{i=1}^{N}{|(y_{i} - \hat{y}_{i})|}

1) Do you have outliers in the data? use MAE
2) Are you sure they are outliers : use MAE
3) Or they are just unexpected values we should still care about? use MSE

(5) (R)MSPE, MAPE

Shop 1 : predicted 9, sold = 10, MSE = 1
Shop 2 : predicted 999, sold=1000, MSE = 1
Shop1의 error가 더 치명적임에도 불구하고 MSE는 같은 문제 발생 -> 비율로 이를 해결

MSPE = \frac {100}{N} \sum_{i=1}^{N}{(\frac {y_{i}-\hat{y}_{i}} {y_{i}})^2}(Percent)

MAPE = \frac {100}{N} \sum_{i=1}^{N}{|(\frac {y_{i}-\hat{y}_{i}} {y_{i}})|}(Percent)

(6) (R)MSLE

RMSLE = \sqrt{ \frac {1}{N} \sum_{i=1}^{N}{(log(y_{i}+1)- log(\hat{y}_{i}+1))^2}}

RMSLE는 로그 스페이스에서의 MSE와 같음.

2. Classification metrics review

Notation

N - is number of objects
L - is number of classes
$y$ - ground truth
$\hat{y}$ - predictions
$|a=b|$ - indicator function
Soft labels are classifier's scores
Hard labels $arg max f_{i}{x}$ , threshold

(1) Accuracy

분류기의 성능 측정 시 가장 간단하게 사용할 수 있음
Best Constant : predict the most frequent class

(2) Logarithmic Loss

Best Constant : set $a$ to frequency of i-th class

(3) Area under ROC curve

only uses in binary task
특정 threshold를 설정
예측의 순서에 의존적이며 절대값에 의존적이지 않음
Red는 TP, Green은 FP이므로 Red일 때 TP축으로 1만큼, Green일 때 FP축으로 1만큼 이동하여 그려지는 그래프의 아래 부분이 AUC!

(4) (Quadratic Weighted) Kappa

3. Genral approaches for metrics optimization

(1) Loss vs metric

Target metric is what we want to optimize
Optimization loss is what model optimizes

(2) Target Metrics Optimization

1) 올바른 모델의 실행 : MSE, LogLoss
2) train set을 전처리하고 다른 metric을 최적화 : MSPE, MAPE, RMSLE
3) 후처리로 예측된 metric 최적화 : Accuracy, Kappa
4) Custom Loss Function 구상

4. Regression metrics optimization

Usable Library Lists

Tree-based

xgboost
lightGBM
sklearn.RandomForestRegressor

Linear models

sklearn.<>Regression
sklearn.SGNRegressor
Vowpal Wabbit

Neural Nets

pytorch
Keras
Tensor Flow
etc.

(1) RMSE, MSE, R-squared

Synonyms : L2 loss
위의 모든 라이브러리에 loss function 구현되어 있음

(2) MAE

Synonyms : L1 loss, Median regression
xgboost, linear model 제외한 라이브러리에 loss function 구현

(3) MSPE, MAPE

둘 다 weighted version이기 때문에 쉽게 loss function 구현 가능

(4) RMSLE

5. Classification metrics optimization

(1) Logloss : Just run the right model!

RandomForest Regression은 로그 손실 면에서 안좋음.
Plat Scalling : just fit logistic regression to your prediction
Isotonic regression : just fit isotonic regression to your prediction
Staking : just fit XGBoost or neural net to your prediction

(2) Accuracy : Fit any metric and tune treshold!

(3) AUC
1) Pointwise loss

각 객체에 대해 개별적으로 계산하며, 전체 손실을 줄이기 위해 모든 객체에 대한 손실액을 합계하거나 평균함.

2) Pairwise loss

한 쌍의 객체에 대한 예측과 레이블을 가져오고 손실량을 계산하여 순서가 올바르면 0, 순서가 틀렸을 때 0보다 큰 loss를 반환함.

(4) Quadratic weighted Kappa

환공지능

데이터사이언티스트 대학원생

이전 포스트

[Coursera]How to win a data science competition - 2주차 3강

다음 포스트

[Coursera] How to win a data science competition - 3주차 1강