๐Ÿ  ์„œ์šธ์‹œ ๋ถ€๋™์‚ฐ ๊ฐ€๊ฒฉ ์˜ˆ์ธก ํ”„๋กœ์ ํŠธ ํšŒ๊ณ  (with ๋จธ์‹ ๋Ÿฌ๋‹)

์ง„์ •ยท2025๋…„ 5์›” 18์ผ

๋ถ€ํŠธ์บ ํ”„์—์„œ ์ง„ํ–‰ํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ฒฝ์ง„๋Œ€ํšŒ๋ฅผ ํ†ตํ•ด, ๊ณต๊ณต๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„œ์šธ์‹œ ์•„ํŒŒํŠธ ๊ฐ€๊ฒฉ์„ ์˜ˆ์ธกํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ถ€ํ„ฐ ๋ชจ๋ธ๋ง, ๊ทธ๋ฆฌ๊ณ  AutoML๊นŒ์ง€์˜ ํ๋ฆ„์„ ์ •๋ฆฌํ•˜๋ฉด์„œ, ์‹ค์ „ ๊ฐ๊ฐ์„ ์ตํžˆ๋Š” ๋ฐ ํฐ ๋„์›€์ด ๋˜์—ˆ๋˜ ๊ฒฝํ—˜์„ ์ž‘์„ฑํ•ด๋ด…๋‹ˆ๋‹ค.


๐Ÿ“ฆ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

  • train.csv: ๊ฑฐ๋ž˜๊ฐ€ ํฌํ•จ๋œ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์…‹ (์•ฝ 111๋งŒ ๊ฑด)

  • test.csv: ์˜ˆ์ธก์„ ์œ„ํ•œ ํ…Œ์ŠคํŠธ์…‹ (์•ฝ 9์ฒœ ๊ฑด)

  • bus_feature.csv: ๋‹จ์ง€๋ณ„ ๋ฒ„์Šค์ •๋ฅ˜์žฅ ํ†ต๊ณ„

  • subway_feature.csv: ๋‹จ์ง€๋ณ„ ์ง€ํ•˜์ฒ  ํ†ต๊ณ„

๐Ÿงผ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ํŠน์ง•

  1. One-Hot Encoding: ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜(์‹œ๊ตฐ๊ตฌ, ๊ฑฐ๋ž˜์œ ํ˜• ๋“ฑ)์— ๋Œ€ํ•ด ์ ์šฉ

  2. Missing ์ฒ˜๋ฆฌ: ๊ฒฐ์ธก ๋น„์œจ์ด ๋†’๊ณ  ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ๋ณ€์ˆ˜๋Š” ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ํ‰๊ท /์ตœ๋นˆ๊ฐ’์œผ๋กœ ๋Œ€์ฒด

  3. ์ง€ํ•˜์ฒ /๋ฒ„์Šค ์ •๋ณด ๋ณ‘ํ•ฉ: ๋‹จ์ง€ ๋‹จ์œ„๋กœ aggregation ํ›„ ์•„ํŒŒํŠธ ๋ฐ์ดํ„ฐ์™€ ์กฐ์ธ

import matplotlib.pyplot as plt
plt.scatter(train_df['์ „์šฉ๋ฉด์ (ใŽก)'], train_df['๊ฑฐ๋ž˜๊ธˆ์•ก(๋งŒ์›)'], alpha=0.3)
plt.xlabel("์ „์šฉ๋ฉด์ (ใŽก)")
plt.ylabel("๊ฑฐ๋ž˜๊ธˆ์•ก(๋งŒ์›)")
plt.title("์ „์šฉ๋ฉด์  vs ๊ฑฐ๋ž˜๊ธˆ์•ก")
plt.show()

๐Ÿ”น A) ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

  • ๊ฑฐ๋ž˜๊ธˆ์•ก(๋งŒ์›) ์ปฌ๋Ÿผ์— ๋น„์ •์ƒ์ ์œผ๋กœ ๋‚ฎ์€ ๊ฐ’(์˜ˆ: 0์› ๊ฑฐ๋ž˜)์„ ์ œ์™ธ ์ฒ˜๋ฆฌ
  • ๊ฑด์ถ•๋…„๋„๊ฐ€ 1900๋…„ ์ดํ•˜์ด๊ฑฐ๋‚˜ ์ด์ƒํ•œ ๊ฐ’์œผ๋กœ ์ž…๋ ฅ๋œ ๊ฒฝ์šฐ ์ œ๊ฑฐ
  • ํ•ด์ œ์‚ฌ์œ ๋ฐœ์ƒ์ผ, ๋“ฑ๊ธฐ์‹ ์ฒญ์ผ์ž ๋“ฑ ๊ฒฐ์ธก ๋น„์œจ์ด 80% ์ด์ƒ์ธ ๋ณ€์ˆ˜๋Š” ์ œ๊ฑฐ

๐Ÿ”น B) ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ „์ฒ˜๋ฆฌ

  • ์‹œ๊ตฐ๊ตฌ, ๋„๋กœ๋ช…, ์•„ํŒŒํŠธ๋ช…, ๊ฑฐ๋ž˜์œ ํ˜• ๋“ฑ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋Š” One-Hot Encoding ๋˜๋Š” Label Encoding ์ ์šฉ
  • ํŠนํžˆ ์•„ํŒŒํŠธ๋ช…์€ ๊ณ ์œ ๊ฐ’์ด ๋งŽ์•„ ํ‰๊ท  ๊ฑฐ๋ž˜๊ธˆ์•ก์„ groupbyํ•˜์—ฌ target encoding ์‹œ๋„
train_df["์•„ํŒŒํŠธ๋ช…_ํ‰๊ท ๊ฐ€"] = train_df.groupby("์•„ํŒŒํŠธ๋ช…")["๊ฑฐ๋ž˜๊ธˆ์•ก(๋งŒ์›)"].transform("mean")

๐Ÿ”น C) ๋ฉด์ , ์ธต, ๊ฑด์ถ•๋…„๋„ ๊ด€๋ จ ๋ณ€ํ™˜

  • ์ „์šฉ๋ฉด์ ์€ ๋กœ๊ทธ ๋ณ€ํ™˜ํ•˜์—ฌ ์Šค์ผ€์ผ ์กฐ์ •
  • ์ธต์€ ๋‚ฎ์€์ธต/๊ณ ์ธต ์—ฌ๋ถ€๋ฅผ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ  ๋ฒ”์ฃผํ™”
  • ๊ฑด์ถ•๋…„๋„๋Š” ๊ฒฝ๊ณผ์—ฐ์ˆ˜ = ๊ณ„์•ฝ๋…„์›” - ๊ฑด์ถ•๋…„๋„ ๋กœ ํŒŒ์ƒ ๋ณ€์ˆ˜ ์ƒ์„ฑ

๐Ÿ”น D) ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ (๋ฒ„์Šค & ์ง€ํ•˜์ฒ )

  • bus_feature.csv, subway_feature.csv๋Š” ๋‹จ์ง€๋ณ„๋กœ ํ†ตํ•ฉ ํ›„ ๋„๋กœ๋ช… ๋˜๋Š” ์‹œ๊ตฐ๊ตฌ ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉ
  • ์ฃผ๋ณ€ ๋Œ€์ค‘๊ตํ†ต ์ ‘๊ทผ์„ฑ ์ง€ํ‘œ๋กœ ํ™œ์šฉ (ex. ์ง€ํ•˜์ฒ ์—ญ ์ˆ˜, ํ‰๊ท  ๊ฑฐ๋ฆฌ, ๋…ธ์„  ์ˆ˜ ๋“ฑ)

๐Ÿค– ๋ชจ๋ธ๋ง

๐Ÿ”ธ 1. ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง

  • Hyperparameter ์„ค์ •: ๊ธฐ๋ณธ ์„ค์ •์—์„œ num_leaves, max_depth, learning_rate ๋“ฑ์„ Grid Search๋กœ ์ตœ์ ํ™”
  • EarlyStopping ๋ฐ eval_set ์ง€์ •์œผ๋กœ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€
from lightgbm import LGBMRegressor
model = LGBMRegressor(num_leaves=64, max_depth=7, learning_rate=0.05, n_estimators=1000)
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=50)

๐Ÿ”ธ 2. ๊ต์ฐจ๊ฒ€์ฆ ์ „๋žต

  • Stratified KFold: ๊ฑฐ๋ž˜๊ธˆ์•ก์„ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด stratification ์ˆ˜ํ–‰
  • K=5 or K=10 ๊ต์ฐจ๊ฒ€์ฆ์„ ํ†ตํ•ด Validation RMSE ํ‰๊ท  ๊ณ„์‚ฐ

๐Ÿ”ธ 3. ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์œ„ํ•œ ์•™์ƒ๋ธ”

  • ๊ฐœ๋ณ„ ๋ชจ๋ธ(LGBM, XGB, CatBoost)์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ Weighted Average ๋ฐฉ์‹์œผ๋กœ ์•™์ƒ๋ธ”
  • ์˜ˆ: final_pred = 0.5 lgb_pred + 0.3 xgb_pred + 0.2 * cat_pred

๐Ÿ”ธ 4. AutoML ์ ์šฉ (AutoGluon)

  • AutoGluon์€ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์ž๋™์œผ๋กœ ํ•™์Šตํ•˜๊ณ , stacking ๋ฐ bagging์„ ์ˆ˜ํ–‰
  • ๋‹จ์ˆœํžˆ .fit() ๋ฉ”์„œ๋“œ๋งŒ ํ˜ธ์ถœํ•˜๋ฉด ๋‚ด๋ถ€์ ์œผ๋กœ ์ˆ˜์‹ญ ๊ฐœ ๋ชจ๋ธ์„ ์‹คํ—˜ํ•˜๊ณ  ์กฐํ•ฉ
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label='๊ฑฐ๋ž˜๊ธˆ์•ก(๋งŒ์›)', eval_metric='rmse').fit(train_data=train_df)

โœ… ์„ฑ๋Šฅ ์˜ˆ์‹œ (LightGBM)


โš™๏ธ AutoML ์ ์šฉ (AutoGluon)

๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์—์„œ๋Š” Amazon์˜ AutoML ํ”„๋ ˆ์ž„์›Œํฌ์ธ AutoGluon์„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์—ฌ๋Ÿฌ ๋ชจ๋ธ(LGBM, XGB, NeuralNet ๋“ฑ)์„ ์กฐํ•ฉํ•˜๊ณ , Stacking/Bagging ์•™์ƒ๋ธ”๊นŒ์ง€ ์ž๋™์œผ๋กœ ์ˆ˜ํ–‰

[์žฅ์ ]
๋ณต์žกํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์—†์ด๋„ ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ
๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์กฐํ•ฉ์„ ํ†ตํ•œ ์Šคํƒœํ‚น ์•™์ƒ๋ธ”
Validation Set ๋ถ„ํ•  ์ž๋™ํ™”

from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="๊ฑฐ๋ž˜๊ธˆ์•ก(๋งŒ์›)").fit(train_data=train_df)

๐Ÿ” ํšŒ๊ณ  ๋ฐ ๋งˆ๋ฌด๋ฆฌ

  • EDA์™€ ํ”ผ์ฒ˜ ์—”์ง€๋‹ˆ์–ด๋ง์—์„œ ์‹œ๊ฐ„ ์†Œ์š” (๊ฐ€์žฅ ๋งŽ์ด ๋˜๊ณ  ๊ฐ€์žฅ ์ค‘์š”ํ•œ๋“ฏ)

  • AutoML์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ baseline ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๊ณ  ๋น„๊ตํ•˜๋Š” ๊ณผ์ •์ด ์ค‘์š”ํ•จ์„ ๋А๋‚Œ

  • ํ–ฅํ›„์—๋Š” ์‹œ๊ณ„์—ด ๊ณ ๋ ค ๋˜๋Š” ์ง€์—ญ๋ณ„ ๋ชจ๋ธ ๋ถ„๋ฆฌ ๋“ฑ์˜ ๊ณ ๋„ํ™”๋ฅผ ๊ณ ๋ คํ•  ์˜ˆ์ •

  • ๋ถ€๋™์‚ฐ์ด๋ผ๋Š” ์‹ค์ƒํ™œ๊ณผ ๋ฐ€์ ‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ค„๋ณด๋ฉฐ, ๋จธ์‹ ๋Ÿฌ๋‹์ด ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์˜์‚ฌ๊ฒฐ์ •์— ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ๋Š”์ง€ ์ฒด๊ฐํ•  ์ˆ˜ ์žˆ์—ˆ์Œ

0๊ฐœ์˜ ๋Œ“๊ธ€