๐Ÿ“’ ML

Kimdongkiยท2024๋…„ 7์›” 5์ผ

์•Œ๊ณ ๋ฆฌ์ฆ˜

๋ชฉ๋ก ๋ณด๊ธฐ
8/8

๐Ÿ“Œ ๋ฐ์ดํ„ฐ ์…‹ ํ™•์ธ

CSV

pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197428 entries, 0 to 197427
Data columns (total 16 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   market_id                                     196441 non-null  float64
 1   created_at                                    197428 non-null  object 
 2   actual_delivery_time                          197421 non-null  object 
 3   store_id                                      197428 non-null  int64  
 4   store_primary_category                        192668 non-null  object 
 5   order_protocol                                196433 non-null  float64
 6   total_items                                   197428 non-null  int64  
 7   subtotal                                      197428 non-null  int64  
 8   num_distinct_items                            197428 non-null  int64  
 9   min_item_price                                197428 non-null  int64  
 10  max_item_price                                197428 non-null  int64  
 11  total_onshift                                 181166 non-null  float64
 12  total_busy                                    181166 non-null  float64
 13  total_outstanding_orders                      181166 non-null  float64
 14  estimated_order_place_duration                197428 non-null  int64  
 15  estimated_store_to_consumer_driving_duration  196902 non-null  float64
dtypes: float64(6), int64(7), object(3)
memory usage: 24.1+ MB

๐Ÿ“Œ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

๊ฒฐ์ธก์น˜์˜ ๋ฒ”์œ„๊ฐ€ ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์ด ๋งŽ์ง€ ์•Š์œผ๋ฉฐ ๋Œ€๋ถ€๋ถ„ ํ‰๊ท ์น˜๋ฅผ ๋‚ด๊ธฐ๋„ ์–ด๋ ค์šด ๋ถ€๋ถ„์ธ ๊ด€๊ณ„๋กœ ๋ชจ๋‘ ์‚ญ์ œํ•˜๋Š”๊ฒƒ์œผ๋กœ ๊ฒฐ๋ก ์„ ๋‚ด๋ ธ๋‹ค.

๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์šฐ๋Š” ๋ฐฉ์‹๋„ ์ง„ํ–‰ํ•ด ๋ณด์•˜์ง€๋งŒ ํฐ ๋ณ€ํ™”๊ฐ€ ์—†๊ธฐ๋„ ํ•˜์˜€๋‹ค.

df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 176016 entries, 0 to 197427
Data columns (total 16 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   market_id                                     176016 non-null  object 
 1   created_at                                    176016 non-null  object 
 2   actual_delivery_time                          176016 non-null  object 
 3   store_id                                      176016 non-null  int64  
 4   store_primary_category                        176016 non-null  object 
 5   order_protocol                                176016 non-null  float64
 6   total_items                                   176016 non-null  int64  
 7   subtotal                                      176016 non-null  int64  
 8   num_distinct_items                            176016 non-null  int64  
 9   min_item_price                                176016 non-null  int64  
 10  max_item_price                                176016 non-null  int64  
 11  total_onshift                                 176016 non-null  float64
 12  total_busy                                    176016 non-null  float64
 13  total_outstanding_orders                      176016 non-null  float64
 14  estimated_order_place_duration                176016 non-null  int64  
 15  estimated_store_to_consumer_driving_duration  176016 non-null  float64
dtypes: float64(5), int64(7), object(4)
memory usage: 22.8+ MB

๐Ÿ“Œ ์ „์ฒ˜๋ฆฌ

actual_delivery_time์—์„œ created_at๋ฅผ 60์œผ๋กœ ๋‚˜๋ˆˆ ๊ฐ’์„ ๋นผ์ค€ delivery_duration ์—ด์„ ์ƒ์„ฑํ•ด์ค€๋‹ค.

# datetime ํ˜•์‹์„ datetime64๋กœ ๋ณ€ํ™˜
df['created_at'] = pd.to_datetime(df['created_at'])
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'])

# ๋ฐฐ๋‹ฌ ์‹œ๊ฐ„ ๊ณ„์‚ฐ
df['delivery_duration'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds() / 60

# ๋ถˆํ•„์š”ํ•œ ์—ด ์ œ๊ฑฐ
df = df.drop(['created_at', 'actual_delivery_time'], axis=1)

# ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ
categorical_cols = ['store_primary_category']
numerical_cols = df.columns.drop(['store_primary_category', 'delivery_duration'])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])
    
# ํŠน์„ฑ๊ณผ ํƒ€๊ฒŸ ๋ถ„๋ฆฌ
X = df.drop('delivery_duration', axis=1)
y = df['delivery_duration']

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

๐Ÿ“Œ ๋ชจ๋ธ๋ง

๋ชจ๋ธ์€ XGBoost๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค.
๋จผ์ € Randomforest๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์•˜์ง€๋งŒ ์•ฝ 30๋ถ„๊ฐ„ ๋ชจ๋ธ ํ•™์Šต ์‹œ๊ฐ„์„ ๊ฐ€์กŒ์ง€๋งŒ ์–ด๋– ํ•œ ์ด์œ ์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ํ•™์Šต์ด ์™„๋ฃŒ๋˜์ง€ ์•Š์•„ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์„ ํƒํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

XGBoost๋Š” ๋จผ์ € ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.
๋ถ€์ŠคํŒ…์ด๋ž€ Randomforest์—์„œ ๊ทธ ๋‹ค์Œ ์„ธ๋Œ€๋กœ ์ง„ํ™”ํ•˜๊ฒŒ ๋˜๋Š” ์ค‘์š”ํ•œ ๊ฐœ๋…์ด๋‹ค.
Randomforest๋Š” ๊ฐ๊ฐ์˜ ํŠธ๋ฆฌ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ๋งŒ๋“œ๋Š” ๋ฐ˜๋ฉด ๋ถ€์ŠคํŒ…์€ ํŠธ๋ฆฌ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋งŒ๋“ค๋ฉด์„œ ์ด์ „ ํŠธ๋ฆฌ์—์„œ ํ•™์Šตํ•œ ๋‚ด์šฉ์ด ๋‹ค์Œ ํ•™์Šต์— ๋ฐ˜์˜๋œ๋‹ค.
์ด๋Š” RNN, DNN๊ณผ ์œ ์‚ฌํ•˜๋‹ค.

# DMatrix ์ƒ์„ฑ
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# XGBoost ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
params = {
    'objective': 'reg:squarederror', # reg:squarederror๋กœ ์„ค์ •ํ•˜์—ฌ MSE์‚ฌ์šฉ
    'eval_metric': 'rmse', # ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ RMSE ์‚ฌ์šฉ
    'learning_rate': 0.1,
    'max_depth': 6,
    'seed': 42,
    'tree_method': 'gpu_hist'  # GPU ์‚ฌ์šฉ ์„ค์ •
}

# ํ•™์Šต ๊ณผ์ • ๋ชจ๋‹ˆํ„ฐ๋ง
num_boost_round = 100
evals = [(dtrain, 'train'), (dtest, 'eval')]
progress = {}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=evals,
    evals_result=progress,
    verbose_eval=True
)

๐Ÿ“Œ ์‹œ๊ฐํ™”

import matplotlib.pyplot as plt

train_rmse = progress['train']['rmse']
eval_rmse = progress['eval']['rmse']

plt.figure(figsize=(10, 7))
plt.plot(train_rmse, label='Train RMSE')
plt.plot(eval_rmse, label='Eval RMSE')
plt.xlabel('Number of Rounds')
plt.ylabel('RMSE')
plt.title('RMSE over Training Rounds')
plt.legend()
plt.show()

๐Ÿ“Œ ํ‰๊ฐ€

# ์ตœ์ข… ๋ชจ๋ธ ํ‰๊ฐ€
y_pred = model.predict(dtest)

# ํ‰๊ฐ€ ์ง€ํ‘œ ๊ณ„์‚ฐ
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Under-prediction์˜ ๋น„์œจ ๊ณ„์‚ฐ
under_predictions = np.sum(y_pred < y_test)
under_prediction_ratio = under_predictions / len(y_test)

print(f"pred: {y_pred}")
print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Square Error: {rmse}')
print(f'Under-prediction ratio: {under_prediction_ratio}')
y_pred: [51.80057  60.28056  37.950516 ... 50.746944 47.010056 37.548664]
Mean Absolute Error: 10.915870883989538
Mean Squared Error: 295.4411207884746
Root Mean Square Error: 17.1884007629702
Under-prediction ratio: 0.4160314028899761

0๊ฐœ์˜ ๋Œ“๊ธ€