[Coursera] How to win a data science competition - 4주차 3강

환공지능·2021년 7월 26일
0
post-thumbnail

1. Ensemble Method

  • 더 강력한 예측을 얻기 위해 다양한 기계 학습 모델을 조합하는 것.
  • 간단한 평균의 방법부터 시작하여 여러 가중 평균의 방법 존재

2. Bagging

Means averaging slightly different versions of the same model to improve accuracy

(1) Why Bagging?
: Errors due to Bias(Underfitting) and Variance(Overfitting) exist

(2) Parameters that control bagging
: Changing the seed, Row sampling or Bootstrapping, Shuffling, Column sampling, Model-specific parameters, Number of models or bags, Parallelism

(3) Example of bagging

# train is the training data
# test is the test data
# y is target variable

model = RandomForestRegressor()
bags = 10
seed = 1

bagged_prediction = np.zeros(test.shape[0])

for n in range(0,bags):
	model.set_params(random_state = seed+n) # update seed
    model.fit(train.y)
    preds = model.predict(test)
    bagged_prediction +=preds
# take average of predictions
bagged_prediction/= bags

3. Boosting

Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.
= 이전 모델의 성능을 고려하여 각 모델이 순차적으로 만들어지는 모델의 가중 평균의 형식

(1) Weight based boosting
특정한 법칙에 따라 weight를 만들고, weight를 feature의 하나로 추가

  • Learning rate
  • Number of estimators
  • Input model - can be anything that accepts weights
  • Sub boosting type : AdaBoost, LogitBoost

(2) Residual based boosting
특정 법칙에 따라 error를 계산하고 y label을 Old_prediction에 따라 새로 정할 것

  • Learning rate
  • Number of estimators
  • Row sampling
  • Column sampling
  • Input model - better be trees
  • Sub boosting type : Fully gradient based, Dart

XGBoost, LightGBM, H20'S GBM, CatBoost 등 지배적인 알고리즘에서 사용하는 방법!

4. Stacking

Means making predictions of a number of models in a hold-out set and then using a different meta model to train on these predictions.
예측 모델 부분에서 가장 인기 있는 형태이자 마지막 단계에서 대체로 사용되는 방식

() Stacking Example

from sklearn.ensemble import RandomForestRegressor

training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)

model1 = RandomForestRegressor()
model2 = LinearRegression()

model1.fit(training, ytraining)
model2.fit(training, ytraining)

preds1 = model1.predict(valid)
preds2 = model2.predict(valid)

test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)

stacked_prediction = np.column_stack(preds1,preds2)
stacked_test_prediction = np.column_stack(test_preds1, test_preds2)

#specifiy meta model
meta_model = LinearRegression()
# fit meta model on stacked predictions
meta_model.fit(stacked_predictions, yvalid)
# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)

() Things to consider

  • With time sensitive data- respect time
  • Diversity as important as performance
  • Diversity may come from different algorithms or different input features
  • Performance plateauing after N models
  • Meta model is normally modest

5. StackNet

Scalable meta modelling methodology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels
스태킹을 사용하여 여러 모델을 여러 라벨의 NN에 결합하는 확장가능한 메타 모델링 방법론

6. Tips and Tricks

(1) 1st level tips

  • 2-3 gradient boosted trees (lightgbm, xgboost, catboost)
  • 2-3 Neural Net (keras, pytorch)
  • 1-2 ExtraTrees/Random Forest
  • 1-2 linear models, SVM
  • 1-2 KNN
  • 1 factorization machine (libfm)
  • 1 SVM with nonlinear kernel if size/memory allows

(2) subsequent level tips

1) simpler algorithms

  • gradient boosted trees with small depth like 2-3
  • linear models with high regularization
  • extra trees
  • shallow networks
  • KNN with braycurtis distance
  • Brute forcing a search for best linear weights based on CV

2) Feature engineering

  • pairwise differences between meta features
  • row-wise statistics like avg or std
  • standard feature selection techniques

3) Be mindful of target leakage!

profile
데이터사이언티스트 대학원생

0개의 댓글