shenanigans.log

shenanigans.log

Linear Models

JOY JHJEONG·2021년 12월 12일

0

Simple Regression

Comparing Classification & Regression

property	supervised classification	regression
output type	discrete (class labels)	continuous (number)
what are you trying to find?	decision boundary	best fit line
evaluation	accuracy	sum of squared error

Baseline Model
최소한의 성능을 가지면서 학습할 모델과 비교하기 위한 기준이 되는 모델
- 분류 문제
  타겟의 최빈 클래스
- 회귀 문제
  타겟의 평균값
- 시계열 회귀 문제
  이전 time stamp의 값
예측 모델(Predictive Model) 활용
- 예측값
  만들어진 모델이 추정하는 값
- 잔차(residual)
  예측값과 관측값의 차(거리)
- 오차(error)
  모집단에서의 예측값과 관측값의 차(거리)
- RSS(Residual Sum of Squares) (또는 SSE(Sum of Square Error))
  잔차 제곱들의 합
  회귀 모델의 비용 함수(Cost Function)
- 학습
  비용함수를 최소화 하는 모델을 찾는 과정
- 회귀 직선
  RSS를 최소화 하는 직선

Reference

Comparing Classfication and Regression

Multiple Regression

Training and testing
Training set으로 모델을 학습시키고, Test set으로 모델의 accuracy를 평가한다.
- 시계열 데이터의 경우에는 과거의 데이터를 바탕으로 미래를 예측하기 때문에 오래된 데이터를 Training set으로, 최근 데이터를 Test set으로 설정한다.
회귀 모델을 평가하는 평가지표(evaluation metrics)
- MSE(Mean Squared Error)
  $\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y_{i}})^{2}$
- MAE(Mean Absolute Error)
  $\frac{1}{n}\sum_{i=1}^{n}\left | y_{i} - \hat{y_{i}} \right |$
- R-squared
  How well a regression line predicts or estimates actual values
  - |actual - mean|과 |estimated - mean|의 비교
  $\frac{\sum_{i=1}^{n}(\hat{y_{i}} - \bar{y_{i}})^{2}}{\sum_{i=1}^{n}(y_{i} - \bar{y_{i}})^{2}}$
  - 예측값과 관측값이 같을수록 $R^2$ 값은 1에 가까워진다.
Bias and variance
- Bias
  The difference between actual values and prediction on training set
- Variance
  The difference in fits between data sets
- Overfit
  The Squiggly line fits the training set really well, but not the test set.
  - High variance / Low bias

Reference

Ridge Regression

Ont-hot Encoding
범주형 변수를 수치형으로 변환
각 카테고리에 해당하는 변수들이 모두 차원에 더해지므로 카테고리가 너무 많은 경우(high cardinality)는 사용하기에 적합하지 않다.
- 범주형 변수(Categorical variable)
  - 명목형(nominal)
  - 순서형(ordidnal)
- Cardinality
  In mathematics, the cardinality of a set means the number of its elements.
- Category_encoders 라이브러리
  범주형 데이터에만 원핫인코딩 수행
특성 선택(Feature selection)
```
from sklearn.feature_selection import SelectKBest
```
- 특성 공학(Feature engineering)
  과제에 적합한 특성을 만들어 내는 과정
Ridge regression
To find a new line that doesn't fit the training set as well
- Ridge regression penalty
  The ridge regression line, which has the small amount of bias due to the penalty, has less variance.
- 정규화(Regularization)
  모델을 변형하여 과적합을 완화해 일반화 성능을 높여주기 위한 기법
  - Ridge 회귀는 편향을 조금 더하고, 분산을 줄이는 방법으로 정규화 수행
- lambda
  정규화의 강도를 조절해 주는 패널티값
  - RidgeCV
    최적 패널티(lambda) 검증

Reference

Regularization Part 1 : Ridge (L2) Regression

Logistic Regression

It can be used to classify samples.

Data Sets
- Training set
  Construct classifier
- Validation set
  Pick algorithm + knob settings
  - training set로 모델을 한번에 완전하게 학습시키기 어렵기 때문에 training set로 다르게 튜닝된 여러 모델들을 학습한 후 어떤 모델의 학습이 잘 되었는지 검증하고 선택하는 과정 필요
- Test set
  Estimate future error rate
  - 모델의 일반화 성능을 마지막에 한 번 올바르게 측정
- Split randomly to avoid bias
- Model selection
  - 모델 선택 수행에서 하이퍼파라미터 튜닝의 효과를 확인하기 위해 validation set 필요
  - K-fold 교차검증(k-fold cross-validation)
    상대적으로 데이터 수가 적을 경우에 진행
분류(Classification) 문제
- 평가지표(evaluation metrics)
  - Accuracy $= (TP+TN)/Total$
Logistic Regression vs Linear Regression
- Logistic R predicts whether something is True or False, instead of predicting something continuous.
- Instead of fitting a line to the data, Logistic R fits an 'S' shpaed 'logistic function'.
- How the line is fit to the data
  - With Linear R, we fit the line using 'least squares'.
  - Logistic R uses something called 'maximum likelihood'.
Logistic Regression Model
- OneHotEncoder
  카테고리 데이터 처리
- SimpleImputer
  결측치 처리
- StandardScaler
  특성들의 척도를 맞추기 위해 표준정규분포로 표준화 (평균=0, 표준편차=1)

reference

Overfitting 4: training, validation, testing

Data Scientist를 향한 공부 기록✏️

이전 포스트

Deep Learning Applications

다음 포스트

Decision Trees

0개의 댓글