[Coursera] How to win a data science competition - 4주차 3강

환공지능·2021년 7월 26일

Coursera kaggle machine learning

[Coursera] How to win a data science competition

목록 보기

11/11

1. Ensemble Method

더 강력한 예측을 얻기 위해 다양한 기계 학습 모델을 조합하는 것.
간단한 평균의 방법부터 시작하여 여러 가중 평균의 방법 존재

2. Bagging

Means averaging slightly different versions of the same model to improve accuracy

(1) Why Bagging?
: Errors due to Bias(Underfitting) and Variance(Overfitting) exist

(2) Parameters that control bagging
: Changing the seed, Row sampling or Bootstrapping, Shuffling, Column sampling, Model-specific parameters, Number of models or bags, Parallelism

(3) Example of bagging

# train is the training data
# test is the test data
# y is target variable

model = RandomForestRegressor()
bags = 10
seed = 1

bagged_prediction = np.zeros(test.shape[0])

for n in range(0,bags):
	model.set_params(random_state = seed+n) # update seed
    model.fit(train.y)
    preds = model.predict(test)
    bagged_prediction +=preds
# take average of predictions
bagged_prediction/= bags

3. Boosting

Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.
= 이전 모델의 성능을 고려하여 각 모델이 순차적으로 만들어지는 모델의 가중 평균의 형식

(1) Weight based boosting
특정한 법칙에 따라 weight를 만들고, weight를 feature의 하나로 추가

Learning rate
Number of estimators
Input model - can be anything that accepts weights
Sub boosting type : AdaBoost, LogitBoost

(2) Residual based boosting
특정 법칙에 따라 error를 계산하고 y label을 Old_prediction에 따라 새로 정할 것

Learning rate
Number of estimators
Row sampling
Column sampling
Input model - better be trees
Sub boosting type : Fully gradient based, Dart

XGBoost, LightGBM, H20'S GBM, CatBoost 등 지배적인 알고리즘에서 사용하는 방법!

4. Stacking

Means making predictions of a number of models in a hold-out set and then using a different meta model to train on these predictions.
예측 모델 부분에서 가장 인기 있는 형태이자 마지막 단계에서 대체로 사용되는 방식

() Stacking Example

from sklearn.ensemble import RandomForestRegressor

training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)

model1 = RandomForestRegressor()
model2 = LinearRegression()

model1.fit(training, ytraining)
model2.fit(training, ytraining)

preds1 = model1.predict(valid)
preds2 = model2.predict(valid)

test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)

stacked_prediction = np.column_stack(preds1,preds2)
stacked_test_prediction = np.column_stack(test_preds1, test_preds2)

#specifiy meta model
meta_model = LinearRegression()
# fit meta model on stacked predictions
meta_model.fit(stacked_predictions, yvalid)
# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)

() Things to consider

With time sensitive data- respect time
Diversity as important as performance
Diversity may come from different algorithms or different input features
Performance plateauing after N models
Meta model is normally modest

5. StackNet

Scalable meta modelling methodology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels
스태킹을 사용하여 여러 모델을 여러 라벨의 NN에 결합하는 확장가능한 메타 모델링 방법론

6. Tips and Tricks

(1) 1st level tips

2-3 gradient boosted trees (lightgbm, xgboost, catboost)
2-3 Neural Net (keras, pytorch)
1-2 ExtraTrees/Random Forest
1-2 linear models, SVM
1-2 KNN
1 factorization machine (libfm)
1 SVM with nonlinear kernel if size/memory allows

(2) subsequent level tips

1) simpler algorithms

gradient boosted trees with small depth like 2-3
linear models with high regularization
extra trees
shallow networks
KNN with braycurtis distance
Brute forcing a search for best linear weights based on CV

2) Feature engineering

pairwise differences between meta features
row-wise statistics like avg or std
standard feature selection techniques

3) Be mindful of target leakage!

환공지능

데이터사이언티스트 대학원생

이전 포스트

[Coursera] How to win a data science competition - 4주차 3강