[Coursera]How to win a data science competition - 1주차 3강

환공지능·2021년 7월 6일

Coursera kaggle machine learning

[Coursera] How to win a data science competition

목록 보기

3/11

1. Overview

(1) 목차

Feature Preprocessing : 데이터의 전처리
Feature Generation : 특징 생성
Their dependence on a model type : 전처리와 특징 생성 모두 사용할 모델에 따라 달라짐.

(2) Features

numeric
categorical
ordinal
datetime
coordinates

2. Numeric features

(1) Numeric feature : 수치로 표현되어 있는 데이터

(2) Non-tree

Linear, k-NN 모델과 같은 Non-tree 모델은 scaling에 따라 결과값이 변함
gradient descent, neural network 역시 적절한 scaling이 필수
Scaling Methods
- MinMaxScaler : range가 규정된 경우 사용
- StandardScaler : PCA 시 사용
- Outliers : 1-99%까지의 데이터만 사용
- RankTransformation : scipy.stats.rankdata로 사용
- LogTransformation : np.log(1+x)로 큰 값을 평균에 가깝도록 만들어 사용
- Raising to the power < 1: np.sqrt(x+ 2/3)
하나의 scaling 기법만 사용하는 것이 아니라 여러개를 혼용

(3) tree-based model : scaling에 영향을 받지 않음

3. Categorical and Ordinal features

(1) Categorical : 수치로 표현이 불가능한 범주형 데이터

(2) Ordinal : 범주형 데이터처럼 비연속적이지만, 숫자처럼 비교 가능한 데이터

(3) Label Encoding

Categorical data를 고유한 숫자로 변환
sklearn.preprocessing.LabelEncoder or Pandas.factorize

(4) Frequence Encoding

빈도 별 인코딩
value의 빈도가 target과 연관성이 있다면 유용

(5) One-hot-encoding

설명 추가
Tree-based model에서는 효율성이 떨어질 수 있음
column이 많아져 학습이 힘들어 질 수 있음.
pandas.get_dummies or sklearn.preprocessing.OneHotEncoder

4. Datetime and Coordinates

(1) Datetime

Periodicity : datetime을 week, month, season, year, hour 등으로 나누어 추가
Time since : 독립적 기간(1970.1.1부터), 의존적 기간(다음 연휴까지 남은 날)
Difference between dates : datetime_feature1 - datetime_feature2

(2) Coordinates

위도와 경도
Tree 모델은 numeric feature가 선형성을 띄면 구분하기 어려워 약간 회전된 좌표를 새로운 feature에 추가할 수 있음. 이를 통해 모델이 지도에서 더 정확한 선택을 할 수 있도록 함.
근접한 지역을 추가 feature로 생성 가능
clustering에 사용할 수 있음
Aggregated stats를 사용할 수 있음

5. Handling missing values

(1) 결측치(missing values)를 보통 -1, 99 등으로 채움

(2) feature 별로 histogram을 그려 주최 측이 결측치를 어떻게 처리했는지 알 수 있음.

(3) 모델 별 handling missing values

Tree : -999, -1, etc
Simple Linear, Neural Network : mean, median
isNull : reconstruct value

(4) 이 작업에 따라 모델의 성능이 달라질 수 있으므로, 주의해야 함.

5. Conclusion

(1) Scaling and Rank for numeric features:

Tree-based models does not depend on them
Non-Tree-based models hugely depend on them

(2) Most often used preprocessings are:

MinMaxScaler - to [0,1]
StandardScaler - to mean ==0, std ==1
Rank-sets spaces between sorted values to be equal
np.log(1+x) and np.sqrt(1+x) for strange values

(3) Feature generation is powered by:

prior knowledge
Exploratory Data Analysis(EDA)

(4) Various Features

numeric
categorical
ordinal
datetime
coordinates

(5) Handling Missing value

.
.
.
강의 링크 : Coursera

환공지능

데이터사이언티스트 대학원생

이전 포스트

[Coursera]How to win a data science competition - 1주차 2강

다음 포스트

[Coursera]How to win a data science competition - 1주차 3강