๐Ÿ’  AIchemist 7th Session | ํšŒ๊ท€ ์บ๊ธ€ ํ•„์‚ฌ

yellowsubmarine372ยท2023๋…„ 11์›” 13์ผ

AIchemist

๋ชฉ๋ก ๋ณด๊ธฐ
9/14
post-thumbnail

01. ์ž์ „๊ฑฐ ๋Œ€์—ฌ ์ˆ˜์š” ์˜ˆ์ธก

์ž์ „๊ฑฐ ๋Œ€์—ฌ ์‹œ์Šคํ…œ
โ€ข ๋„์‹œ ์ „์—ญ์˜ ํ‚ค์˜ค์Šคํฌ ์œ„์น˜ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ด ํšŒ์›๊ฐ€์ž…,๋Œ€์—ฌ ๋ฐ ์ž์ „๊ฑฐ ๋ฐ˜ํ™˜ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ž๋™ํ™”๋˜๋Š” ์ž์ „๊ฑฐ๋ฅผ ๋Œ€์—ฌํ•˜๋Š” ์ˆ˜๋‹จ
โ€ข ์ „์„ธ๊ณ„ 500๊ฐœ ์ด์ƒ์˜ ์ž์ „๊ฑฐ ๋Œ€์—ฌ ํ”„๋กœ๊ทธ๋žจ์ด ์žˆ์Œ
=> ์ž์ „๊ฑฐ ๋Œ€์—ฌ ์‹œ์Šคํ…œ์—์„œ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋Š” ์—ฌํ–‰ ๊ธฐ๊ฐ„, ์ถœ๋ฐœ ์œ„์น˜, ๋„์ฐฉ ์œ„์น˜ ๋ฐ ๊ฒฝ๊ณผ ์‹œ๊ฐ„์ด ๋ช…์‹œ์ ์œผ๋กœ ๊ธฐ๋ก๋˜๊ธฐ ๋•Œ๋ฌธ์— ์„ผ์„œ ๋„คํŠธ์›Œํฌ๋กœ์„œ ๊ธฐ๋Šฅํ•˜๋ฉฐ, ์ด๋Š” ๋„์‹œ์˜ ์ด๋™์„ฑ์„ ์—ฐ๊ตฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

bike sharing Demand

Data Description

  • datetime : hourly date + timestamp
  • season : 1= ๋ด„, 2= ์—ฌ๋ฆ„, 3= ๊ฐ€์„, 4= ๊ฒจ์šธ
  • holiday: 1= ํ† , ์ผ์š”์ผ์˜ ์ฃผ๋ง์„ ์ œ์™ธํ•œ ๊ตญ๊ฒฝ์ผ ๋“ฑ์˜ ํœด์ผ, 0= ํœด์ผ์ด ์•„๋‹Œ ๋‚ 
  • workingday: 1= ํ† , ์ผ์š”์ผ์˜ ์ฃผ๋ง . ๋ฐํœด์ผ์ด ์•„๋‹Œ ์ฃผ์ค‘, 0= ์ฃผ๋ง . ๋ฐํœด์ผ
  • weather
    • 1 = ๋ง‘์Œ, ์•ฝ๊ฐ„ ๊ตฌ๋ฆ„ ๋‚€ ํ๋ฆผ
    • 2 = ์•ˆ๊ฐœ, ์•ˆ๊ฐœ + ํ๋ฆผ
    • 3 = ๊ฐ€๋ฒผ์šด ๋ˆˆ, ๊ฐ€๋ฒผ์šด . ๋น„+ ์ฒœ๋‘ฅ
    • 4 = ์‹ฌํ•œ ๋ˆˆ/๋น„, ์ฒœ๋‘ฅ/๋ฒˆ๊ฐœ
  • temp : ์˜จ๋„(์„ญ์”จ)
  • atemp: ์ฒด๊ฐ์˜จ๋„(์„ญ์”จ)
  • humidity: ์ƒ๋Œ€์Šต๋„
  • windspeed: ํ’์†
  • casual: ์‚ฌ์ „์— ๋“ฑ๋ก๋˜์ง€ ์•Š๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๋Œ€์—ฌํ•œ ํšŸ์ˆ˜
  • registered: ์‚ฌ์ „์— ๋“ฑ๋ก๋œ ์‚ฌ์šฉ์ž๊ฐ€ ๋Œ€์—ฌํ•œ ํšŸ์ˆ˜
  • count: ๋Œ€์—ฌ ํšŸ์ˆ˜

๋ฐ์ดํ„ฐ ํด๋ Œ์ง• ๋ฐ ๊ฐ€๊ณต๊ณผ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

๋ชจ๋ธ์„ ํ•™์Šตํ•ด ๋Œ€์—ฌ ํšŸ์ˆ˜(count)๋ฅผ ์˜ˆ์ธก

  • ๋ฐ์ดํ„ฐ ํ™•์ธ

>> date ์นผ๋Ÿผ๋งŒ objectํ˜•
๋…„/์›”/์ผ/์‹œ๊ฐ„์ด 4๊ฐœ ์†์„ฑ์œผ๋กœ ๋ถ„๋ฆฌ

# ๋ฌธ์ž์—ด์„ datetime ํƒ€์ž…์œผ๋กœ ๋ณ€๊ฒฝ. 
bike_df['datetime'] = bike_df.datetime.apply(pd.to_datetime)

# datetime ํƒ€์ž…์—์„œ ๋…„, ์›”, ์ผ, ์‹œ๊ฐ„ ์ถ”์ถœ
bike_df['year'] = bike_df.datetime.apply(lambda x : x.year)
bike_df['month'] = bike_df.datetime.apply(lambda x : x.month)
bike_df['day'] = bike_df.datetime.apply(lambda x : x.day)
bike_df['hour'] = bike_df.datetime.apply(lambda x: x.hour)
bike_df.head(3)
  • ์นผ๋Ÿผ ์‚ญ์ œ

casual + registered = count
์ƒ๊ด€๋„๊ฐ€ ๋†’์€ ๋‘ ์นผ๋Ÿผ์€ ์˜ˆ์ธก์„ ์ €ํ•ดํ•  ์šฐ๋ ค๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ๋‘ ์นผ๋Ÿผ ์‚ญ์ œ


  • Target์ธ count ๋ถ„ํฌ ํŒŒ์•…

(1) ์‹œ๊ฐํ™”

fig, axs = plt.subplots(figsize=(16, 8), ncols=4, nrows=2)
cat_features = ['year', 'month','season','weather','day', 'hour', 'holiday','workingday']
# cat_features์— ์žˆ๋Š” ๋ชจ๋“  ์นผ๋Ÿผ๋ณ„๋กœ ๊ฐœ๋ณ„ ์นผ๋Ÿผ๊ฐ’์— ๋”ฐ๋ฅธ count์˜ ํ•ฉ์„ barplot์œผ๋กœ ์‹œ๊ฐํ™”
for i, feature in enumerate(cat_features):
    row = int(i/4)
    col = i%4
    # ์‹œ๋ณธ์˜ barplot์„ ์ด์šฉํ•ด ์นผ๋Ÿผ๊ฐ’์— ๋”ฐ๋ฅธ count์˜ ํ•ฉ์„ ํ‘œํ˜„
    sns.barplot(x=feature, y='count', data=bike_df, ax=axs[row][col])

(8๊ฐœ ์นผ๋Ÿผ๊ณผ count ์ƒ๊ด€๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„)

year๋ณ„ count๋ฅผ ๋ณด๋ฉด ์‹œ๊ฐ„์ด ์ง€๋‚  ์ˆ˜๋ก ์ž์—ฐ๊ฑฐ ๋Œ€์—ฌ ํšŸ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ / month์˜ ๊ฒฝ์šฐ 6, 7, 8, 9๊ฐ€ ๋†’์Œ/ season์€ ์—ฌ๋ฆ„, ๊ฐ€์„์ด ๋†’์Œ / weather์€ ๋ง‘๊ฑฐ๋‚˜, ์•ˆ๊ฐœ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋†’์Œ / hour์˜ ๊ฒฝ์šฐ๋Š” ์˜ค์ „ ์ถœ๊ทผ ์‹œ๊ฐ„๊ณผ ์˜คํ›„ ํ‡ด๊ทผ ์‹œ๊ฐ„์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์Œ / day๋Š” ์ฐจ์ด๊ฐ€ ํฌ์ง€ ์•Š์œผ๋ฉฐ / holiday ๋˜๋Š” workingday๋Š” ์ฃผ์ค‘์ผ ๊ฒฝ์šฐ ์ƒ๋Œ€์ ์œผ๋กœ ์•ฝ๊ฐ„ ๋†’์Œ.


  • ๋ชจ๋ธ ํ›ˆ๋ จ ์‹œ์ž‘

๊ฐ ํšŒ๊ท€๋ชจ๋ธ๋ณ„๋กœ RMSLE ์ถœ๋ ฅ (์ž์ „๊ฑฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ํšŒ๊ท€๋ชจ๋ธ ์ฐพ๊ธฐ)

RMSLE(Root Mean Square Log Error)

1) outlier์— ๋œ ๋ฏผ๊ฐํ•จ (outlier ๊ฐ€ ์žˆ๋”๋ผ๋„ ๊ฐ’์˜ ๋ณ€๋™ํญ์ด ํฌ์ง€ ์•Š์Œ)
2) ์ƒ๋Œ€์  Error๋ฅผ ์ธก์ •ํ•จ (๊ฐ’์˜ ์ ˆ๋Œ€์  ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋ฉด RMSE์˜ ๊ฐ’๋„ ์ปค์ง€์ง€๋งŒ, RMSLE๋Š” ์ƒ๋Œ€์  ํฌ๊ธฐ๊ฐ€ ๋™์ผํ•˜๋‹ค๋ฉด RMSLE์˜ ๊ฐ’๋„ ๋™์ผํ•จ)
3) Under Estimation์— ํฐ ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•จ

log1p()๋ฅผ ์ด์šฉํ•ด ์–ธ๋”ํ”Œ๋กœ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•œ๋‹ค(expm1()ํ•จ์ˆ˜๋กœ ๋ณต์› = exp(X)+1)

rmsle ๊ตฌํ˜„ ํ•จ์ˆ˜

from sklearn.metrics import mean_squared_error, mean_absolute_error

# log ๊ฐ’ ๋ณ€ํ™˜ ์‹œ NaN๋“ฑ์˜ ์ด์Šˆ๋กœ log() ๊ฐ€ ์•„๋‹Œ log1p() ๋ฅผ ์ด์šฉํ•˜์—ฌ RMSLE ๊ณ„์‚ฐ
def rmsle(y, pred):
    log_y = np.log1p(y)
    log_pred = np.log1p(pred)
    squared_error = (log_y - log_pred) ** 2
    rmsle = np.sqrt(np.mean(squared_error))
    return rmsle

# ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ mean_square_error() ๋ฅผ ์ด์šฉํ•˜์—ฌ RMSE ๊ณ„์‚ฐ
def rmse(y,pred):
    return np.sqrt(mean_squared_error(y,pred))

# MSE, RMSE, RMSLE ๋ฅผ ๋ชจ๋‘ ๊ณ„์‚ฐ 
def evaluate_regr(y,pred):
    rmsle_val = rmsle(y,pred)
    rmse_val = rmse(y,pred)
    # MAE ๋Š” scikit learn์˜ mean_absolute_error() ๋กœ ๊ณ„์‚ฐ
    mae_val = mean_absolute_error(y,pred)
    print('RMSLE: {0:.3f}, RMSE: {1:.3F}, MAE: {2:.3F}'.format(rmsle_val, rmse_val, mae_val))

๋กœ๊ทธ ๋ณ€ํ™˜, ํ”ผ์ฒ˜ ์ธ์ฝ”๋”ฉ๊ณผ ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

1. ๊ฒฐ๊ด๊ฐ’์ด ์ •๊ทœ๋ถ„ํฌ์ธ์ง€ ํ™•์ธ

  • ์˜ค๋ฅ˜๊ฐ’ ๋น„๊ต (์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’ ์ฐจ์ด ๋น„๊ต)
def get_top_error_data(y_test, pred, n_tops = 5):
    # DataFrame์— ์ปฌ๋Ÿผ๋“ค๋กœ ์‹ค์ œ ๋Œ€์—ฌํšŸ์ˆ˜(count)์™€ ์˜ˆ์ธก ๊ฐ’์„ ์„œ๋กœ ๋น„๊ต ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ƒ์„ฑ. 
    result_df = pd.DataFrame(y_test.values, columns=['real_count'])
    result_df['predicted_count']= np.round(pred)
    result_df['diff'] = np.abs(result_df['real_count'] - result_df['predicted_count'])
    # ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์ด ๊ฐ€์žฅ ํฐ ๋ฐ์ดํ„ฐ ์ˆœ์œผ๋กœ ์ถœ๋ ฅ. 
    print(result_df.sort_values('diff', ascending=False)[:n_tops])
    
get_top_error_data(y_test,pred,n_tops=5)

์˜ˆ์ธก ์˜ค๋ฅ˜๊ฐ€ ํฌ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธ ๊ฐ€๋Šฅ

โ–ถ๏ธŽ Target ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ์™œ๊ณก๋œ ํ˜•ํƒœ์ธ์ง€ ๋จผ์ € ํ™•์ธํ•˜๊ธฐ(์ •๊ทœ๋ถ„ํฌ๊ฐ€ best)

0~200 ์‚ฌ์ด์— ์™œ๊ณก๋ผ ์žˆ์Œ

  • ๋กœ๊ทธ๋ฅผ ์ ์šฉํ•ด ๋ณ€ํ™˜

//์—ฌ๊ธฐ์„œ ๋” ์ •๊ทœ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ž‘์—…์„ ์ถ”๊ฐ€ํ•ด๋„ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์„๋“ฏ

2. ๊ฐœ๋ณ„ ํ”ผ์ฒ˜๋“ค์˜ ์ธ์ฝ”๋”ฉ

year, hour, month ๋“ฑ์€ ์ˆซ์ž๊ฐ’์œผ๋กœ ํ‘œํ˜„๋˜์—ˆ์ง€๋งŒ ๋ชจ๋‘ ์นดํ…Œ๊ณ ๋ฆฌํ˜• ํ”ผ์ฒ˜
์ˆซ์žํ˜• ์นดํ…Œ๊ณ ๋ฆฌ ๊ฐ’์„ ์„ ํ˜•ํšŒ๊ท€์— ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ด ์ˆซ์žํ˜• ๊ฐ’์— ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฐ›์œผ๋ฉด ์•ˆ๋˜๋ฏ€๋กœ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ์ ์šฉํ•ด ๋ณ€ํ™˜

#'year', 'month', 'day', 'hour' ๋“ฑ์˜ ํ”ผ์ฒ˜๋“ค์„ One Hot Encoding
X_features_ohe = pd.get_dummies(X_features, columns =['year', 'month', 'day', 'hour', 'holiday',
                                                     'workingday', 'season', 'weather'])
  • ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ
# ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์ด ์ ์šฉ๋œ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต/์˜ˆ์ธก ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X_features_ohe, y_target_log,
                                                   test_size =0.3, random_state=0)

# ๋ชจ๋ธ๊ณผ ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ˆ˜์น˜๋ฅผ ๋ฐ˜ํ™˜
def get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=False):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    if is_expm1:
        y_test = np.expm1(y_test)
        pred = np.expm1(pred)
    print('###', model.__class__.__name__, '###')
    evaluate_regr(y_test, pred)
#end of function get_model_predict

# ๋ชจ๋ธ๋ณ„๋กœ ํ‰๊ฐ€ ์ˆ˜ํ–‰
lr_reg = LinearRegression()
ridge_reg = Ridge(alpha=10)
lasso_reg = Lasso(alpha = 0.01)

for model in [lr_reg, ridge_reg, lasso_reg]:
    get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=True)
  • ๋ Œ๋ค ํฌ๋ ˆ์ŠคํŠธ, GBM, XGBoost, LightGBM
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

#๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ,GBM , XGBoos(t, LightGBM model๋ณ„๋กœ ํ‰๊ฐ€ ์ˆ˜ํ–‰
rf_reg = RandomForestRegressor(n_estimators=500)
gbm_reg = GradientBoostingRegressor(n_estimators=500)
xgb_reg = XGBRegressor(n_estimators = 500)
lgbm_reg = LGBMRegressor(n_estimators = 500)

for model in [rf_reg, gbm_reg, xgb_reg, lgbm_reg]:
    #XGBoost์˜ ๊ฒฝ์šฐ DataFrame์ด ์ž…๋ ฅ๋  ๊ฒฝ์šฐ ๋ฒ„์ „์— ๋”ฐ๋ผ ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ. ndarray๋กœ ๋ณ€ํ™˜.
    get_model_predict(model, X_train.values, X_test.values, y_train.values, 
                      y_test.values, is_expm1=True)


02. ์บ๊ธ€ ์ฃผํƒ ๊ฐ€๊ฒฉ : ๊ณ ๊ธ‰ ํšŒ๊ท€ ๊ธฐ๋ฒ•

์ฃผํƒ ๊ฐ€๊ฒฉ ์˜ˆ์ธก
79๊ฐœ์˜ ์„ค๋ช…๋ณ€์ˆ˜๊ฐ€ ๋ฏธ๊ตญ ์•„์ด์˜ค์™€ ์ฃผ์˜ ์—์ž„์Šค์— ์žˆ๋Š” ์ฃผ๊ฑฐ์šฉ ์ฃผํƒ์˜ ๊ฑฐ์˜ ๋ชจ๋“  ์ธก๋ณ€์„ ์„ค๋ช…ํ•จ.
=> ๋Œ€ํšŒ๋Š” ์ด๋ฅผ ์ด์šฉํ•ด ๊ฐ ์ฃผํƒ์˜ ์ตœ์ข… ๊ฐ€๊ฒฉ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ๋„์ „ํ•จ.

House Price - Advanced Regression Techniques

Data Description

  • SalePrice : ๋ถ€๋™์‚ฐ์˜ ํŒ๋งค ๊ฐ€๊ฒฉ (๋‹จ์œ„: ๋‹ฌ๋Ÿฌ)
  • GrLivArea : ์ฃผ๊ฑฐ ๊ณต๊ฐ„ ํฌ๊ธฐ
  • CentralAir : ์ค‘์•™ ์—์–ด์ปจ
  • OverallQual : ์ „๋ฐ˜์ ์ธ ์žฌ๋ฃŒ ๋ฐ ๋งˆ๋ฌด๋ฆฌ ํ’ˆ์งˆ
  • OverallCond : ์ „์ฒด ์ƒํƒœ ๋“ฑ๊ธ‰
  • RoofStyle : ์ง€๋ถ•์˜ ์œ ํ˜•
  • 1stFirSF : 1์ธต ํ‰๋ฐฉ ํ”ผํŠธ
  • PaveDrive : ํฌ์žฅ๋œ ์ง„์ž…๋กœ
  • Fence : ์šธํƒ€๋ฆฌ ํ’ˆ์งˆ
  • Sale Type : ํŒ๋งค ์œ ํ˜•
  • LotFrontage : ์žฌ์‚ฐ์— ์—ฐ๊ฒฐ๋œ ๊ฑฐ๋ฆฌ์˜ ์„ ํ˜• ํ”ผํŠธ
  • Street : ๋„๋กœ ์ ‘๊ทผ ์œ ํ˜•

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ ๊ฐ€๊ณต

Target ๊ฐ’์€ SalePrice. 80๊ฐœ ํ”ผ์ฒ˜ ์ค‘ 43๊ฐœ๊ฐ€ ๋ฌธ์žํ˜•์ด๋ฉฐ Null๊ฐ’์ด ๋งŽ์€ ํ”ผ์ฒ˜๋„ ์กด์žฌ. (PoolQC, MiscFeature, Alley, Fence 1000๊ฐœ๊ฐ€ ๋„˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ Null)

  • ํƒ€๊นƒ ๊ฐ’ ์ •๊ทœ๋ถ„ํฌ
plt.title('Original Sale Price Histogram')
plt.xticks(rotation=45)
sns.histplot(house_df['SalePrice'], kde=True)
plt.show()

ํƒ€๊นƒ ๋ฐ์ดํ„ฐ ๊ฐ’์ด ์ค‘์‹ฌ์—์„œ ์™ผ์ชฝ์œผ๋กœ ์น˜์šฐ์นœ ํ˜•ํƒœ๋กœ ์ •๊ทœ๋ถ„ํฌ์—์„œ ๋ฒ—์–ด๋‚˜ ์žˆ์Œ.
log1p()๋กœ ๋กœ๊ทธ๋ณ€ํ™˜ ํ›„ ๋‹ค์‹œ ๊ฒฐ๊ด๊ฐ’์„ expm1()๋กœ ํ™˜์›ํ•˜๋ฉด ๋จ.

plt.title('Log Transformed Sale Price Histogram')
log_SalePrice = np.log1p(house_df['SalePrice'])
sns.histplot(log_SalePrice, kde=True)
plt.show()

์ดํ•ด๊ฐ€ ์•ˆ๋จ. ๋‹ค์‹œ ๋ณต๊ท€ํ•˜๋ฉด ์›๋ณธ์ด๋ž‘ ๋˜‘๊ฐ™์•„์ ธ์„œ ์˜๋ฏธ ์—†์–ด์ง€๋Š” ๊ฑฐ ์•„๋‹˜?

์€๋ณ„ ๐Ÿ’ฌ
ํ›ˆ๋ จํ•  ๋•Œ๋งŒ ๋กœ๊ทธ ๋ณ€ํ™˜์„ ์ ์šฉํ•จ. ๊ทธ๋ฆฌ๊ณ  ๊ณ„์† ์ฒ˜๋ฆฌํ•  ๋ฐ์ดํ„ฐ๋Š” ์›๋ž˜ ์Šค์ผ€์ผ๋กœ ๋ณต๊ท€ํ•˜๊ธฐ
ํ›ˆ๋ จํ• ๋•Œ๋งŒ ์ •๊ทœํ™” ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ ์šฉํ•˜๊ธฐ

  • Null ๊ฐ’ ํ”ผ์ฒ˜ ์‚ญ์ œ

PoolQC, MiscFeature, Alley, Fence, FireplaceQu ์‚ญ์ œ
๋‚˜๋จธ์ง€ Nullํ”ผ์ฒ˜๋Š” ์ˆซ์žํ˜•์˜ ๊ฒฝ์šฐ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด

# SalePrice ๋กœ๊ทธ ๋ณ€ํ™˜
original_SalePrice = house_df['SalePrice']
house_df['SalePrice'] = np.log1p(house_df['SalePrice'])

# Null ์ด ๋„ˆ๋ฌด ๋งŽ์€ ์ปฌ๋Ÿผ๋“ค๊ณผ ๋ถˆํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์‚ญ์ œ
house_df.drop(['Id','PoolQC' , 'MiscFeature', 'Alley', 'Fence','FireplaceQu'], axis=1 , inplace=True)
# Drop ํ•˜์ง€ ์•Š๋Š” ์ˆซ์žํ˜• Null์ปฌ๋Ÿผ๋“ค์€ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
house_df['LotFrontage'].fillna(house_df['LotFrontage'].mean(),inplace=True)
house_df['MasVnrArea'].fillna(house_df['MasVnrArea'].mean(),inplace=True)
house_df['GarageYrBlt'].fillna(house_df['GarageYrBlt'].mean(),inplace=True)
# LotFrontage , MasVnrArea, GarageYrBlt has null

# Null ๊ฐ’์ด ์žˆ๋Š” ํ”ผ์ฒ˜๋ช…๊ณผ ํƒ€์ž…์„ ์ถ”์ถœ
null_column_count = house_df.isnull().sum()[house_df.isnull().sum() > 0]
print('## Null ํ”ผ์ฒ˜์˜ Type :\n', house_df.dtypes[null_column_count.index])

์ฑ…์—์„œ ์—๋Ÿฌ ๋ฐœ์ƒํ•˜๋Š” ๋ถ€๋ถ„ ์ˆ˜์ •


  • ๋ฌธ์žํ˜• ํ”ผ์ฒ˜๋Š” ์›-ํ•ซ ์ธ์ฝ”๋”ฉ ๋ณ€ํ™˜

get_dummies()๋Š” ์ž๋™์œผ๋กœ ๋ฌธ์ž์—ด ํ”ผ์ฒ˜๋ฅผ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ ๋ณ€ํ™˜ํ•˜๋ฉด์„œ Null ๊ฐ’์„ 0์œผ๋กœ ๋ณ€ํ™˜, ๋ณ„๋„์˜ Null ๊ฐ’์„ ๋Œ€์ฒดํ•˜๋Š” ๋กœ์ง์ด ํ•„์š” ์—†์Œ.


์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

ํƒ€๊นƒ๊ฐ’์ธ SalePrice ๋กœ๊ทธ ๋ณ€ํ™˜, ์˜ˆ์ธก๊ฐ’ ๋กœ๊ทธ ๋ณ€ํ™˜๋œ SalePrice ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธกํ•˜๋ฏ€๋กœ ์›๋ณธ SalePrice ์˜ˆ์ธก๊ฐ’์˜ ๋กœ๊ทธ ๋ณ€ํ™˜ ๊ฐ’
>> ๋”ฐ๋ผ์„œ ์˜ˆ์ธก ๊ฒฐ๊ณผ ์˜ค๋ฅ˜์— RMSE๋งŒ ์ ์šฉ ํ•˜๋ฉด RMSLE๊ฐ€ ์ž๋™์œผ๋กœ ์ธก์ •

def get_rmse(model):
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    rmse = np.sqrt(mse)
    print(model.__class__.__name__, '๋กœ๊ทธ ๋ณ€ํ™˜๋œ RMSE:', np.round(rmse, 3))
    return rmse
  • ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ํ•™์Šต
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

y_target = house_df_ohe['SalePrice']
X_features = house_df_ohe.drop('SalePrice',axis=1, inplace=False)

X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.2, random_state=156)

# LinearRegression, Ridge, Lasso ํ•™์Šต, ์˜ˆ์ธก, ํ‰๊ฐ€
lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)

ridge_reg = Ridge()
ridge_reg.fit(X_train, y_train)

lasso_reg = Lasso()
lasso_reg.fit(X_train, y_train)

models = [lr_reg, ridge_reg, lasso_reg]
get_rmses(models)
[Output]

LinearRegression ๋กœ๊ทธ ๋ณ€ํ™˜๋œ RMSE: 0.01
Ridge ๋กœ๊ทธ ๋ณ€ํ™˜๋œ RMSE: 0.01
Lasso ๋กœ๊ทธ ๋ณ€ํ™˜๋œ RMSE: 0.018
[0.010481899993240616, 0.00997580081727205, 0.01795924469011489]

>> ๋ผ์˜ ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ๋งŽ์ด ๋–จ์–ด์ง€๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„

  • ํšŒ๊ท€ ๊ณ„์ˆ˜ ์‹œ๊ฐํ™”

ํšŒ๊ท€ ๊ณ„์ˆ˜ ๊ฐ’์˜ ์ƒ์œ„ 10๊ฐœ, ํ•˜์œ„ 10๊ฐœ์˜ ํ”ผ์ฒ˜์— ๋Œ€ํ•œ ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ

๋ผ์˜์˜ ์ „์ฒด์  ํšŒ๊ท€ ๊ณ„์ˆ˜๊ฐ’์ด ๋งค์šฐ ์ž‘๊ณ , YearBuilt๊ฐ€ ๊ฐ€์žฅ ํฌ๊ณ  ๋‹ค๋ฅธ ํ”ผ์ฒ˜์˜ ํšŒ๊ท€ ๊ณ„์ˆ˜๋Š” ๋„ˆ๋ฌด ์ž‘์Œ.

I. ๋ผ์˜ ํšŒ๊ท€ ๊ฐœ์„ 

1. ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ๋ฐฉ์‹ ๊ฐœ์„ 

5๊ฐœ ๊ต์ฐจ ๊ฒ€์ฆ ํด๋“œ ์„ธํŠธ๋กœ ํ›ˆ๋ จ์„ธํŠธ ๋ถ„ํ• 

from sklearn.model_selection import cross_val_score

def get_avg_rmse_cv(models):
    for model in models:
        #๋ถ„ํ• ํ•˜์ง€ ์•Š๊ณ  ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ cross_val_score() ์ˆ˜ํ–‰. ๋ชจ๋ธ๋ณ„ CV RMSE๊ฐ’๊ณผ ํ‰๊ท  RMSE ์ถœ๋ ฅ
        rmse_list = np.sqrt(-cross_val_score(model, X_features, y_target,
                                            scoring="neg_mean_squared_error", cv=5))
        rmse_avg = np.mean(rmse_list)
        print('\n{0} CV RMSE ๊ฐ’ ๋ฆฌ์ŠคํŠธ: {1}'.format(model.__class__.__name__, np.round(rmse_list, 3)))
        print('{0} CV ํ‰๊ท  RMSE ๊ฐ’: {1}'.format(model.__class__.__name__, np.round(rmse_avg, 3)))
        
# ์•ž ์˜ˆ์ œ์—์„œ ํ•™์Šตํ•œ ridge_reg, lasso_reg ๋ชจ๋ธ์˜ CV RMSE ๊ฐ’ ์ถœ๋ ฅ
models = [ridge_reg, lasso_reg]
get_avg_rmse_cv(models)

2. ๋ฆฟ์ง€์™€ ๋ผ์˜ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ตœ์  alpha ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ฐพ๊ธฐ

from sklearn.model_selection import GridSearchCV

def print_best_params(model, params):
    grid_model = GridSearchCV(model, param_grid = params, 
                             scoring = 'neg_mean_squared_error', cv=5)
    grid_model.fit(X_features, y_target)
    rmse = np.sqrt(-1*grid_model.best_score_)
    print('{0} 5 CV ์‹œ ์ตœ์  ํ‰๊ท  RMSE ๊ฐ’:{1}, ์ตœ์  alpha: {2}'.format(model.__class__.__name__,
                                                             np.round(rmse, 4), grid_model.best_params_))
ridge_params = {'alpha':[0.05, 0.1, 1, 5, 8, 10, 12, 15, 20]}
lasso_params = {'alpha': [0.001, 0.005, 0.008, 0.05, 0.03, 0.1, 0.5, 1, 5, 10]}
print_best_params(ridge_reg, ridge_params)
print_best_params(lasso_reg, lasso_params)

์ตœ์  alpha ๊ฐ’์œผ๋กœ ๋ชจ๋ธ์˜ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€๋ฅผ ์žฌ์ˆ˜ํ–‰

II. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

1. ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ

ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ง€๋‚˜์น˜๊ฒŒ ์™œ๊ณก๋œ ํ”ผ์ฒ˜๊ฐ€ ์กด์žฌํ•  ๊ฒฝ์šฐ ํšŒ๊ท€ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

  • skew()
    skewness(์™œ๋„), ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋น„๋Œ€์นญ ์ •๋„๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ˆ˜์น˜

์™œ๋„ 1์ด์ƒ์˜ ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ”ผ์ฒ˜๋งŒ ์ถ”์ถœํ•ด ์™œ๊ณก ์ •๋„๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋กœ๊ทธ ๋ณ€ํ™˜์„ ์ ์šฉ. ์ˆซ์žํ˜• ํ”ผ์ฒ˜์˜ ์นผ๋Ÿผ index ๊ฐ์ฒด๋ฅผ ์ถ”์ถœํ•ด ์ˆซ์žํ˜• ์นผ๋Ÿผ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ apply lambda์‹ skew()๋ฅผ ํ˜ธ์ถœํ•ด ์ˆซ์žํ˜• ํ”ผ์ฒ˜์˜ ์™œ๊ณก ์ •๋„ ์ถœ๋ ฅ

from scipy.stats import skew

# object๊ฐ€ ์•„๋‹Œ ์ˆซ์žํ˜• ํ”ผ์ฒ˜์˜ ์นผ๋Ÿผ index ๊ฐ์ฒด ์ถ”์ถœ.
features_index = house_df.dtypes[house_df.dtypes != 'object'].index
# house_df์— ์นผ๋Ÿผ index๋ฅผ [ ]๋กœ ์ž…๋ ฅํ•˜๋ฉด ํ•ด๋‹นํ•˜๋Š” ์นผ๋Ÿผ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ฐ˜ํ™˜. apply lambda๋กœ skew( ) ํ˜ธ์ถœ
skew_features = house_df[features_index].apply(lambda x : skew(x))
# skew(์™œ๊ณก) ์ •๋„๊ฐ€ 1 ์ด์ƒ์ธ ์นผ๋Ÿผ๋งŒ ์ถ”์ถœ.
skew_features_top = skew_features[skew_features > 1]
print(skew_features_top.sort_values(ascending=False))

โ—๏ธ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ ์นดํ…Œ๊ณ ๋ฆฌ ์ˆซ์žํ˜• ํ”ผ์ฒ˜๋Š” ์ œ์™ธ - ์ธ์ฝ”๋”ฉ ์‹œ ๋‹น์—ฐํžˆ ์™œ๊ณก๋  ๊ฐ€๋Šฅ์„ฑ ํผ

์ถ”์ถœ๋œ ์™œ๊ณก ์ •๋„๊ฐ€ ๋†’์€ ํ”ผ์ฒ˜๋ฅผ ๋กœ๊ทธ๋ณ€ํ™˜

house_df[skew_features_top.index] = np.log1p(house_df[skew_features_top.index])

2. ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ

3๊ฐœ ๋ชจ๋ธ์—์„œ ๋ชจ๋‘ ๊ฐ€์žฅ ํฐ ํšŒ๊ท€ ๊ณ„์ˆ˜ GrLivArea ํ”ผ์ฒ˜์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ถ„์„

outlier ๋ฐ์ดํ„ฐ๋กœ ๊ฐ„์ฃผํ•˜๊ณ  ์ „๋ถ€ ์‚ญ์ œ

(๋‹จ ๋ชจ๋‘ ๋กœ๊ทธ ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ์ด๋ฏ€๋กœ ์ด๋ฅผ ๋ฐ˜์˜ํ•ด ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„๋ฅ˜ํ•ด์•ผ ๋จ -> log1p(x)๋กœ ์ œํ•œ)

# GrLivArea์™€ SalePrice ๋ชจ๋‘ ๋กœ๊ทธ ๋ณ€ํ™˜๋˜์—ˆ์œผ๋ฏ€๋กœ ์ด๋ฅผ ๋ฐ˜์˜ํ•œ ์กฐ๊ฑด ์ƒ์„ฑ. 
cond1 = house_df_ohe['GrLivArea'] > np.log1p(4000)
cond2 = house_df_ohe['SalePrice'] < np.log1p(500000)
outlier_index = house_df_ohe[cond1 & cond2].index

"๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜๊ธฐ ์ด์ „์— ์™„๋ฒฝํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ์˜ ์„ ์ฒ˜๋ฆฌ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋ผ๋Š” ์˜๋ฏธ๋Š” ์•„๋‹™๋‹ˆ๋‹ค. ์ผ๋‹จ ๋Œ€๋žต์˜ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต๊ณผ ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•œ ๋’ค ๋‹ค์‹œ ์ด์— ๊ธฐ๋ฐ˜ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ธฐ๋ฒ•์˜ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต๊ณผ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•œ ๋ชจ๋ธ ์ƒ์„ฑ ๊ณผ์ •" (p.390, ํŒŒ์ด์ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ์™„๋ฒฝ ๊ฐ€์ด๋“œ)


ํšŒ๊ท€ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ ํ˜ผํ•ฉ์„ ํ†ตํ•œ ์ตœ์ข… ์˜ˆ์ธก

๊ฐœ๋ณ„ ํšŒ๊ท€ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ด๊ฐ’์„ ํ˜ผํ•ฉํ•ด ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… ํšŒ๊ท€ ๊ฐ’์„ ์˜ˆ์ธก
โžข ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์„ ํ•ฉ์ณ๋„ ๋จ!

- A๋ชจ๋ธ๊ณผ B ๋ชจ๋ธ, ๋‘ ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์ด ์žˆ๋‹ค๋ฉด A๋ชจ๋ธ ์˜ˆ์ธก ๊ฐ’์˜ 40%, B ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’์˜ 60%๋ฅผ ๋”ํ•ด์„œ ์ตœ์ข… ํšŒ๊ท€ ๊ฐ’์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ

Ex) A ํšŒ๊ท€ ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’ [100. 80, 60]
    B ํšŒ๊ท€ ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’ [120. 80, 50]

์ตœ์ข… ํšŒ๊ท€ ์˜ˆ์ธก๊ฐ’ : [100*0.4 + 120*0.6, 80*0.4 + 80*0.6, 60*0.4 + 50*0.6] = [112, 80, 54]

pred = 0.4 * ridge_pred + 0.6 * lasso_pred

๊ฐ ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์„ ๊ณ„์‚ฐํ•œ ๋’ค ๊ฐœ๋ณ„ ๋ชจ๋ธ๊ณผ ์ตœ์ข… ํ˜ผํ•ฉ ๋ชจ๋ธ์˜ RMSE ๊ตฌํ•˜๊ธฐ

def get_rmse_pred(preds):
    for key in preds.keys():
        pred_value = preds[key]
        mse = mean_squared_error(y_test , pred_value)
        rmse = np.sqrt(mse)
        print('{0} ๋ชจ๋ธ์˜ RMSE: {1}'.format(key, rmse))

# ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ํ•™์Šต
ridge_reg = Ridge(alpha=8)
ridge_reg.fit(X_train, y_train)
lasso_reg = Lasso(alpha=0.001)
lasso_reg.fit(X_train, y_train)
# ๊ฐœ๋ณ„ ๋ชจ๋ธ ์˜ˆ์ธก
ridge_pred = ridge_reg.predict(X_test)
lasso_pred = lasso_reg.predict(X_test)

# ๊ฐœ๋ณ„ ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’ ํ˜ผํ•ฉ์œผ๋กœ ์ตœ์ข… ์˜ˆ์ธก๊ฐ’ ๋„์ถœ
pred = 0.4 * ridge_pred + 0.6 * lasso_pred
preds = {'์ตœ์ข… ํ˜ผํ•ฉ': pred,
         'Ridge': ridge_pred,
         'Lasso': lasso_pred}
#์ตœ์ข… ํ˜ผํ•ฉ ๋ชจ๋ธ, ๊ฐœ๋ณ„๋ชจ๋ธ์˜ RMSE ๊ฐ’ ์ถœ๋ ฅ
get_rmse_pred(preds)

์Šคํƒœํ‚น ์•™์ƒ๋ธ” ๋ชจ๋ธ์„ ํ†ตํ•œ ํšŒ๊ท€ ์˜ˆ์ธก

๋ถ„๋ฅ˜์—์„œ ๋ฐฐ์šด ์Šคํƒœํ‚น ์•™์ƒ๋ธ” ํ•˜๋‚˜๋„ ๊ธฐ์–ต๋‚˜์ง€ ์•Š๋Š”๋‹ค...

2๊ฐ€์ง€ ๋ชจ๋ธ ํ•„์š”, ๊ฐœ๋ณ„์ ์ธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๊ณผ ์ด ๊ฐœ๋ณ„ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“ค์–ด์„œ ํ•™์Šตํ•˜๋Š” ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ

  • ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์ด ํ•™์Šตํ•  ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์›๋ณธ ํ•™์Šต ํ”ผ์ฒ˜ ์„ธํŠธ๋กœ ํ•™์Šตํ•œ ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์„ ์Šคํƒœํ‚น ํ˜•ํƒœ๋กœ ๊ฒฐํ•ฉํ•œ ๊ฒƒ
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

# ๊ฐœ๋ณ„ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—์„œ ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•  ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜. 
def get_stacking_base_datasets(model, X_train_n, y_train_n, X_test_n, n_folds ):
    # ์ง€์ •๋œ n_folds๊ฐ’์œผ๋กœ KFold ์ƒ์„ฑ.
    kf = KFold(n_splits=n_folds, shuffle=False)
    #์ถ”ํ›„์— ๋ฉ”ํƒ€ ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•  ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฐ˜ํ™˜์„ ์œ„ํ•œ ๋„˜ํŒŒ์ด ๋ฐฐ์—ด ์ดˆ๊ธฐํ™” 
    train_fold_pred = np.zeros((X_train_n.shape[0] ,1 ))
    test_pred = np.zeros((X_test_n.shape[0],n_folds))
    print(model.__class__.__name__ , ' model ์‹œ์ž‘ ')
    
    for folder_counter , (train_index, valid_index) in enumerate(kf.split(X_train_n)):
        #์ž…๋ ฅ๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ํ•™์Šต/์˜ˆ์ธกํ•  ํด๋“œ ๋ฐ์ดํ„ฐ ์…‹ ์ถ”์ถœ 
        print('\t ํด๋“œ ์„ธํŠธ: ',folder_counter,' ์‹œ์ž‘ ')
        X_tr = X_train_n[train_index] 
        y_tr = y_train_n[train_index] 
        X_te = X_train_n[valid_index]  
        
        #ํด๋“œ ์„ธํŠธ ๋‚ด๋ถ€์—์„œ ๋‹ค์‹œ ๋งŒ๋“ค์–ด์ง„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ํ•™์Šต ์ˆ˜ํ–‰.
        model.fit(X_tr , y_tr)       
        #ํด๋“œ ์„ธํŠธ ๋‚ด๋ถ€์—์„œ ๋‹ค์‹œ ๋งŒ๋“ค์–ด์ง„ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ์˜ˆ์ธก ํ›„ ๋ฐ์ดํ„ฐ ์ €์žฅ.
        train_fold_pred[valid_index, :] = model.predict(X_te).reshape(-1,1)
        #์ž…๋ ฅ๋œ ์›๋ณธ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํด๋“œ ์„ธํŠธ๋‚ด ํ•™์Šต๋œ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—์„œ ์˜ˆ์ธก ํ›„ ๋ฐ์ดํ„ฐ ์ €์žฅ. 
        test_pred[:, folder_counter] = model.predict(X_test_n)
            
    # ํด๋“œ ์„ธํŠธ ๋‚ด์—์„œ ์›๋ณธ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ‰๊ท ํ•˜์—ฌ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์ƒ์„ฑ 
    test_pred_mean = np.mean(test_pred, axis=1).reshape(-1,1)    
    
    #train_fold_pred๋Š” ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ, test_pred_mean์€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ
    return train_fold_pred , test_pred_mean

ํ•จ์ˆ˜ ๋‚ด์—์„œ๋Š” ๊ฐœ๋ณ„ ๋ชจ๋ธ์ด K-fold ์„ธํŠธ๋กœ ์„ค์ •๋œ ํด๋“œ ์„ธํŠธ ๋‚ด๋ถ€์—์„œ ์›๋ณธ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ์ถ”์ถœํ•ด ํ•™์Šต๊ณผ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•œ ๋’ค ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ €์žฅ๋œ ์˜ˆ์ธก ๋ฐ์ดํ„ฐ๋Š” ์ถ”ํ›„์— ๋ฉ”ํƒ€ ๋ชจ๋ธ์˜ ํ•™์Šต ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์ด์šฉ๋ฉ๋‹ˆ๋‹ค. ํ•จ์ˆ˜ ๋‚ด์—์„œ ํด๋“œ ์„ธํŠธ ๋‚ด๋ถ€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๊ฐœ๋ณ„ ๋ชจ๋ธ์ด ์ธ์ž๋กœ ์ž…๋ ฅ๋œ ์›๋ณธ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•œ ๋’ค, ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ํ•ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์ƒ์„ฑ


  • ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์ธ ๋ผ์˜ ๋ชจ๋ธ์— ์ ์šฉ
# ๊ฐœ๋ณ„ ๋ชจ๋ธ์ด ๋ฐ˜ํ™˜ํ•œ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ Stacking ํ˜•ํƒœ๋กœ ๊ฒฐํ•ฉ.  
Stack_final_X_train = np.concatenate((ridge_train, lasso_train, 
                                      xgb_train, lgbm_train), axis=1)
Stack_final_X_test = np.concatenate((ridge_test, lasso_test, 
                                     xgb_test, lgbm_test), axis=1)

# ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์€ ๋ผ์˜ ๋ชจ๋ธ์„ ์ ์šฉ. 
meta_model_lasso = Lasso(alpha=0.0005)

#๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธกํ•˜๊ณ  RMSE ์ธก์ •.
meta_model_lasso.fit(Stack_final_X_train, y_train)
final = meta_model_lasso.predict(Stack_final_X_test)
mse = mean_squared_error(y_test , final)
rmse = np.sqrt(mse)
print('์Šคํƒœํ‚น ํšŒ๊ท€ ๋ชจ๋ธ์˜ ์ตœ์ข… RMSE ๊ฐ’์€:', rmse)

3. Medical Cost Personal

๊ฐœ์ธ ๋ณดํ—˜๋ฃŒ ์˜ˆ์ธก
์—ฌ๋Ÿฌ feature์„ ๊ฐ€์ง„ ์‚ฌ๋žŒ์˜ ๋ณดํ—˜๋ฃŒ๋ฅผ ์˜ˆ์ธก (age, sex, bmi, children, smoker, region)
=> ์˜๋ฃŒ๋ณดํ—˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด ํ•œ ์‚ฌ๋žŒ์ด ๋ณดํ—˜๋ฃŒ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋‚ผ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํšŒ๊ท€ ๋ฌธ์ œ

๋ถ„์„ ๋ฐฉ๋ฒ•

  • ๊ฐ feature ๋ณ„๋กœ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
  • ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ
  • Linear Model, Decision Tree, Ensemble(Random Forest, AdaBoost, Gradient Bossting), Boosting(XGBoost, LightGBM)

Medical Cost Personal

์ฐธ๊ณ  ๋…ธํŠธ๋ถ ์ถœ์ฒ˜

Data Description

Age: ํ”ผ๋ณดํ—˜์ž์˜ ๋‚˜์ด
Sex: ํ”ผ๋ณดํ—˜์ž์˜ ์„ฑ๋ณ„
BMI: ํ”ผ๋ณดํ—˜์ž์˜ ์ฒด์งˆ๋Ÿ‰ ์ง€์ˆ˜
Children: ํ”ผ๋ณดํ—˜์ž์˜ ์ž๋…€์˜ ์ˆ˜
Smoker: ํก์—ฐ ์—ฌ๋ถ€ (yes / no)
Region: ํ”ผ๋ณดํ—˜์ž๊ฐ€ ๊ฑฐ์ฃผํ•˜๋Š” ์ง€์—ญ (Southeast / Southwest / Northeast / Northwest)
Charges: ๋ณดํ—˜๋ฃŒ

Exploratory Data Analysis

  • objectํ˜•์€ ๋‚˜์ค‘์— ํ•ซ-์ธ์ฝ”๋”ฉ ์ ์šฉ ํ•„์š”
  • NaN์ด๋‚˜ Null์ด ์—†์–ด ๋”ฐ๋กœ null ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š” ์—†์Œ

  • ์ž๋…€์˜ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ ํ‰๊ท ๊ณผ ์œ ์‚ฌํ•˜๋‹ค๋Š” ๋ฌธ์ œ ์กด์žฌ
  • charges ํ˜ผ์ž ์ž๋ฆฌ ์ˆ˜ ๋‹ค๋ฆ„! -> ํ•˜์ง€๋งŒ ์ข…์†๋ณ€์ˆ˜์ด๋ฏ€๋กœ scaling ๋ถˆํ•„์š” (๋…๋ฆฝ๋ณ€์ˆ˜์˜€๋‹ค๋ฉด scaling ํ•„์ˆ˜)

Visualization

ํ”ผ์ฒ˜๋“ค์˜ ๋ถ„ํฌ ํ™•์ธ

(์œ„ ๊ฐœ์„ ์‚ฌํ•ญ๋“ค์„ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ -> ์ฐจ์ฐจ ๊ณ ๋ คํ•ด๋ณผ ๊ฒƒ!)

์ƒ๊ด€๊ด€๊ณ„

์ƒ๊ด€๊ด€๊ณ„
๋‘ ๋ณ€์ˆ˜ ์‚ฌ์ด ์ƒ๊ด€๊ด€๊ณ„ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜์น˜
๋‘ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์‚ฐ์ ๋„์—์„œ ์ ๋“ค์ด ์–ผ๋งˆ๋‚˜ ์ง์„ ์— ๊ฐ€๊นŒ์šด๊ฐ€์˜ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ ์“ฐ์ด๋Š” ์ฒ™๋„
(ํฐ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ฐ’์ด ํ•ญ์ƒ ๋‘ ๋ณ€์ˆ˜ ์‚ฌ์ด ์–ด๋–ค ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ์˜๋ฏธํ•˜์ง€ ์•Š์Œ

age ํ”ผ์ฒ˜์™€ charges ํ”ผ์ฒ˜์˜ ์ƒ๊ด€๊ด€๊ณ„ ๊ณ„์ˆ˜ ๋†’์Œ >> ๊ทธ๋ž˜ํ”„์—์„œ ์ง์„ ํ˜•์œผ๋กœ ๋งค์šฐ ๋šœ๋ ทํ•จ
age ํ”ผ์ฒ˜์— ๋Œ€ํ•œ ๊ฐœ์„ ์ด โ–บ ์ „์ฒด ๋ชจ๋ธ์˜ ๊ฐœ์„ ์œผ๋กœ ์ด์–ด์ง!!

age ๊ตฌ๊ฐ„ ๋‚˜๋ˆ„๊ธฐ

# ์—ฐ๋ น๋ณ„ ๊ตฌ๊ฐ„ ์„ค์ •
bins = [0, 20, 25, 30, 35, 40, 45, 50, 55, 60, np.inf]
age_bin = pd.cut(df['age'], bins=bins, labels=[i+1 for i in range(len(bins)-1)])
# len(bins)
# age_bin
df['age_bin'] = age_bin
df.head()

์ด์ƒ์น˜ ํƒ์ง€

๋ถ„๋ฅ˜ ์„ธ์…˜์—์„œ ๋ฐฐ์› ๋˜ box plot์œผ๋กœ ์ด์ƒ์น˜๋ฅผ ํ™•์ธํ•ด๋ณด์ž๋ฉด

bmi ์ด์ƒ์น˜ ์žˆ์Œ!! bmi ์ด์ƒ์น˜ ์ œ๊ฑฐ ํ•„์š”~!

๊ทธ๋Ÿผ ์ด์ƒ์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ์ œ๊ฑฐํ•˜๋А๋ƒ?
์œ ๋ช…ํ•œ ์ด์ƒ์น˜ ์ œ๊ฑฐ ๊ธฐ๋ฒ•

IQR์„ ํ†ตํ•œ ์ด์ƒ์น˜ ์ œ๊ฑฐ

IQR ์ •๋ฆฌ ์ž๋ฃŒ

# IQR(Q3 - Q1)๋กœ๋ถ€ํ„ฐ ์ด์ƒ์น˜ ํŒŒ์•…ํ•˜๊ธฐ
bmi_q1 = df['bmi'].quantile(q=0.25)
bmi_q3 = df['bmi'].quantile(q=0.75)
iqr = bmi_q3 - bmi_q1

# (q1 - (iqr * 1.5))์™€ (q3 + (iqr * 1.5))๋ฅผ ๋ฒ—์–ด๋‚œ ๊ฐ’์ด ์ด์ƒ์น˜
condi1 = (df['bmi'] < (bmi_q1 - (1.5 * iqr)))
condi2 = (df['bmi'] > (bmi_q3 + (1.5 * iqr)))
outliers = df[condi1 | condi2]
outliers['bmi'] 

Scaling, Transforming and Encoding

  • Scaling ์†Œ๊ฐœ

//ppt ์ €์ž‘๊ถŒ์€ ์ €ํ•œํ…Œ ์žˆ์Šต๋‹ˆ๋‹น~

  • Scaling ๋ชจ๋ธ ๊ฐ๊ฐ ์ ์šฉํ•ด๋ณด๊ธฐ
#์ˆซ์žํ˜• ๋ณ€์ˆ˜์— ๋Œ€ํ•ด Box-Cox transformation, Quantile transformation, ๊ทธ๋ฆฌ๊ณ  ๋กœ๊ทธ ๋ณ€ํ™˜
#Scaling
to_scale = ['age', 'bmi', 'children', 'charges']
df_to_scale = df[to_scale].copy()

quantile = QuantileTransformer(n_quantiles=100, random_state=42, output_distribution='normal') #1000๊ฐœ ๋ถ„์œ„๋ฅผ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ท ๋“ฑ๋ถ„ํฌ
power = PowerTransformer(method= 'yeo-johnson') #๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ๋ณ„๋กœ ์ •๊ทœ๋ถ„ํฌํ˜•ํƒœ์— ๊ฐ€๊น๋„๋ก ๋ณ€ํ™˜
q_scaled = quantile.fit_transform(df_to_scale)
yj = power.fit_transform(df_to_scale)

q_scaled_df = pd.DataFrame(q_scaled, columns=to_scale)
scaled_df = pd.DataFrame(yj, columns=to_scale)
logged_df = pd.DataFrame(np.log1p(df_to_scale), columns=to_scale)

fig, ax = plt.subplots(4, 4, figsize=(40, 30))

for i in range(4):
    idx = 0
    for j in range(4): #subplot๋“ค์˜ ์—ด
        colname = to_scale[idx]
        if i == 0 :
            ax[i][j].hist(df_to_scale[colname], bins = 30)
            ax[i][j].set_xlabel(colname)
            ax[i][j].set_ylabel('Frequency')
        elif i == 1:
            ax[i][j].hist(scaled_df[colname], bins = 30)
            ax[i][j].set_xlabel(colname)
            ax[i][j].set_ylabel('Transformed Frequency')
        elif i == 2:
            ax[i][j].hist(q_scaled_df[colname], bins = 30)
            ax[i][j].set_xlabel(colname)
            ax[i][j].set_ylabel('Transformed Frequency')
        elif i == 3:
            ax[i][j].hist(logged_df[colname], bins = 30)
            ax[i][j].set_xlabel(colname)
            ax[i][j].set_ylabel('Logged Frequency')
        
        idx += 1                 

Target๊ฐ’๊ณผ BMI๊ฐ€ ๊ฐ€์žฅ ์ •๊ทœ๋ถ„ํฌํ™”๋œ QuantileTransformer๋ชจ๋ธ ์„ ํƒ!

QuantileTransformer์„ ์ด์šฉํ•ด ํ›ˆ๋ จ ์„ธํŠธ, ํ…Œ์ŠคํŠธ ์„ธํŠธ ์žฌ์ฒ˜๋ฆฌ

#Quantile Transformation
to_scale = ['age', 'bmi']

quantile = QuantileTransformer(n_quantiles=10, random_state=0, output_distribution='normal')

for col in to_scale:
    quantile.fit(X_train[[col]])
    X_train[col] = quantile.transform(X_train[[col]]).flatten()
    X_test[col] = quantile.transform(X_test[[col]]).flatten()

์ด๋•Œ BMI ํ”ผ์ฒ˜๋งŒ ์Šค์ผ€์ผ์„ ๋ณ€ํ™˜ํ–ˆ์œผ๋ฏ€๋กœ ๋‹ค๋ฅธ ํ”ผ์ฒ˜๋“ค์˜ ๋‹จ์œ„ ๋ถ„ํฌ๋ฅผ BMI์— ๋งž์ถฐ์ฃผ๊ธฐ ์œ„ํ•ด standard scaling๋„ ์ง„ํ–‰

  • One-Hot encoding

์นดํ…Œ๊ณ ๋ฆฌ ํ”ผ์ฒ˜(๋ฌธ์ž์—ด ํ”ผ์ฒ˜)๋ฅผ ์ •์ˆ˜ํ˜•์œผ๋กœ ๋ณ€ํ™˜
/> sex, region, smoker ํ”ผ์ฒ˜ ๋ณ€ํ™˜ ํ•„์š”

Model Selection

๋งŽ์€ ํšŒ๊ท€ ๋ชจ๋ธ๋“ค์„ ์ ์šฉํ•ด ์–ด๋–ค ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€์ง€๋ฅผ ํŒ๋‹จํ•˜์ž!

# default ๋ชจ๋ธ์„ ์„ค์ •ํ•œ ๋’ค, cross-validation์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ‰๊ฐ€
lr = LinearRegression()
enet = ElasticNet(random_state=42)
dt = DecisionTreeRegressor(random_state=42)
rf = RandomForestRegressor(random_state=42)
ada = AdaBoostRegressor(random_state=42)
gbr = GradientBoostingRegressor(random_state=42)
xgb = XGBRegressor(random_state=42)
lgbm = LGBMRegressor(random_state=42)

models = [lr, enet, dt, rf, ada, gbr, xgb, lgbm]

# ํ‰๊ฐ€์ง€ํ‘œ RMSE
for model in models:
    name = model.__class__.__name__
    scores = cross_val_score(model, X=X_train, y=y_train, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    mse = (-1) * np.mean(scores) # negative mean squared error๋กœ ์„ค์ •ํ–ˆ์œผ๋ฏ€๋กœ -1์„ ๊ณฑํ•ด ๋ถ€ํ˜ธ๋ฅผ ๋งž์ถฐ์ค๋‹ˆ๋‹ค.
    print('Model %s - RMSE: %.4f' % (name, np.sqrt(mse)))
[Output]

Model LinearRegression - RMSE: 6415.7267
Model ElasticNet - RMSE: 9609.0652
Model DecisionTreeRegressor - RMSE: 6377.2354
Model RandomForestRegressor - RMSE: 4917.3047
Model AdaBoostRegressor - RMSE: 5211.7117
Model GradientBoostingRegressor - RMSE: 4695.0703
Model XGBRegressor - RMSE: 5302.2534
Model LGBMRegressor - RMSE: 4827.5845

Hyperparameter Tuning

Gradient Boosting Regressor, LightGBM, Random Forest ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

profile
for well-being we need nectar and ambrosia

0๊ฐœ์˜ ๋Œ“๊ธ€