자전거 수요 예측 (Kaggle)
- 언어 : Python
- IDE : Jupyter notebook
1. 필요한 라이브러리 로드
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
2. 데이터셋 로드
- Season : 계절. 1,2,3,4로 Ordinal Encoding(순서가 있는 값)이 되어있음
- weather : 1이면 맑은 날, 2는 흐린날, 3은 눈,비 오는 날, 4는 폭우, 폭설, 우박 내리는 날
train = pd.read_csv("bike/train.csv", parse_dates = ["datetime"])
display(train.head(2))
train.shape
2.1. df 정보, 기술통계량 확인
train.info()
train.describe()
3. 전처리
3.1. 날짜 포맷 처리
train["year"] = train["datetime"].dt.year
train["month"] = train["datetime"].dt.month
train["day"] = train["datetime"].dt.day
train["hour"] = train["datetime"].dt.hour
train["minute"] = train["datetime"].dt.minute
train["second"] = train["datetime"].dt.second
train["dayofweek"] = train["datetime"].dt.dayofweek
4. EDA
4.1. 히스토그램
train.hist(figsize=(10,10), bins=50);
4.2. 산점도
sns.scatterplot(data=train, x="windspeed", y="count")
sns.scatterplot(data=train, x="temp", y="atemp")
4.3. 바플롯
sns.barplot(data=train, x="weather", y="count")
sns.barplot(data=train, x="month", y="count", hue="year")
4.4. 그룹화
train.groupby("season")["month"].unique()
season
1 [1, 2, 3]
2 [4, 5, 6]
3 [7, 8, 9]
4 [10, 11, 12]
Name: month, dtype: object
4.5. 파생변수 생성
train["year-month"] = train["datetime"].astype(str).str[:7]
4.6. 로그 변환
train["count_log1p"] = np.log(train["count"] + 1)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 3))
sns.kdeplot(train["count"], ax=axes[0])
sns.kdeplot(train["count_log1p"], ax=axes[1])
5. 학습, 예측 데이터셋 만들기
5.1. label 값, 사용할 피처 지정
label_name = "count"
feature_names = ['holiday', 'weather', ...]
5.2. 학습, 예측 데이터 나누기
X_train = train[feature_names]
X_teat = test[feature_names]
y_train = train[label_name]
6. 모델링
6.1. RF
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=42, n_jobs=-1)
6.2. RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {"max_depth" : np.random.randint(15, 30, 10),
"max_features" : np.random.uniform(0.8, 1, 10)}
reg = RandomizedSearchCV(model, param_distributions=param_distributions,
scoring='neg_root_mean_squared_error',
n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1)
reg.fit(X_train, y_train)
best_model = reg.best_estimator_
reg.best_score_
from sklearn.model_selection import cross_val_predict
y_valid_predict = cross_val_predict(best_model, X_train, y_train, n_jobs=-1)
6.3. 평가
from sklearn.metrics import mean_squared_error
mean_squared_error(y_train, y_valid_predict)
mean_squared_error(y_train, y_valid_predict) ** 0.5
6.4. 학습 및 예측
y_predict = best_model.fit(X_train, y_train).predict(X_test)
y_predict[:5]
sns.barplot(x=best_model.feature_importances_, y=best_model.feature_names_in_)
df_submit["count"] = np.expm1(y_predict)