[데이터분석] 자전거 수요 예측

문종현·2023년 5월 12일

데이터분석

TIL

목록 보기

115/119

자전거 수요 예측 (Kaggle)

언어 : Python
IDE : Jupyter notebook

1. 필요한 라이브러리 로드

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

2. 데이터셋 로드

Season : 계절. 1,2,3,4로 Ordinal Encoding(순서가 있는 값)이 되어있음
weather : 1이면 맑은 날, 2는 흐린날, 3은 눈,비 오는 날, 4는 폭우, 폭설, 우박 내리는 날

train = pd.read_csv("bike/train.csv", parse_dates = ["datetime"])
# train["datetime"] = pd.to_datetime(train["datetime"])
display(train.head(2))
train.shape

2.1. df 정보, 기술통계량 확인

train.info()
train.describe()

test에도 똑같이 적용

3. 전처리

3.1. 날짜 포맷 처리

train["year"] = train["datetime"].dt.year
train["month"] = train["datetime"].dt.month
train["day"] = train["datetime"].dt.day
train["hour"] = train["datetime"].dt.hour
train["minute"] = train["datetime"].dt.minute
train["second"] = train["datetime"].dt.second
train["dayofweek"] = train["datetime"].dt.dayofweek

4. EDA

4.1. 히스토그램

train.hist(figsize=(10,10), bins=50);

4.2. 산점도

# 풍속별 자전거 대여 수
sns.scatterplot(data=train, x="windspeed", y="count")

# 기온별 체감온도
sns.scatterplot(data=train, x="temp", y="atemp")

4.3. 바플롯

# 날씨별 자전거 대여 수
# errorbar=None
sns.barplot(data=train, x="weather", y="count")

# 연도별 월별 대여 수
sns.barplot(data=train, x="month", y="count", hue="year")

4.4. 그룹화

# 계절별 월의 유일값
train.groupby("season")["month"].unique()

season
1       [1, 2, 3]
2       [4, 5, 6]
3       [7, 8, 9]
4    [10, 11, 12]
Name: month, dtype: object

4.5. 파생변수 생성

train["year-month"] = train["datetime"].astype(str).str[:7]

4.6. 로그 변환

# 대여 수 로그 변환
# 1보다 작은 값에서 로그를 취하면 음수가 되기 때문에 이를 방지하기 위해 1을 더함
train["count_log1p"] = np.log(train["count"] + 1)

# 로그 변환 전후 시각화 비교
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 3))
# count - kdeplot
sns.kdeplot(train["count"], ax=axes[0])
# count log1p - kdeplot
sns.kdeplot(train["count_log1p"], ax=axes[1])

5. 학습, 예측 데이터셋 만들기

5.1. label 값, 사용할 피처 지정

label_name = "count"
feature_names = ['holiday', 'weather', ...]

5.2. 학습, 예측 데이터 나누기

X_train = train[feature_names]
X_teat = test[feature_names]
y_train = train[label_name]

6. 모델링

6.1. RF

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=42, n_jobs=-1)

6.2. RandomizedSearchCV, GridSearchCV

from sklearn.model_selection import RandomizedSearchCV

param_distributions = {"max_depth" : np.random.randint(15, 30, 10),
                      "max_features" : np.random.uniform(0.8, 1, 10)}

reg = RandomizedSearchCV(model, param_distributions=param_distributions,
                         scoring='neg_root_mean_squared_error',
                         n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1)

reg.fit(X_train, y_train)

# 가장 성능 좋은 파라미터와 그때의 스코어 출력
best_model = reg.best_estimator_
reg.best_score_

# 직접 예측값 구하기
from sklearn.model_selection import cross_val_predict

y_valid_predict = cross_val_predict(best_model, X_train, y_train, n_jobs=-1)

6.3. 평가

# mse
from sklearn.metrics import mean_squared_error
mean_squared_error(y_train, y_valid_predict)
# ((y_train - y_valid_predict) ** 2).mean()

# rmse
mean_squared_error(y_train, y_valid_predict) ** 0.5
# np.sqrt((np.square(y_train - y_valid_predict)).mean())

6.4. 학습 및 예측

y_predict = best_model.fit(X_train, y_train).predict(X_test)
y_predict[:5]

# 피처 중요도 
sns.barplot(x=best_model.feature_importances_, y=best_model.feature_names_in_)

# y에 로그를 취했었기 때문에 실제 값 예측 시 지수함수를 취해줌
df_submit["count"] = np.expm1(y_predict)

문종현

자라나라 새싹새싹🌱

이전 포스트

[TIL] 23.05.11

다음 포스트

[데이터분석] 자전거 수요 예측

TIL

자전거 수요 예측 (Kaggle)

1. 필요한 라이브러리 로드

2. 데이터셋 로드

2.1. df 정보, 기술통계량 확인

3. 전처리

3.1. 날짜 포맷 처리

4. EDA

4.1. 히스토그램

4.2. 산점도

4.3. 바플롯

4.4. 그룹화

4.5. 파생변수 생성

4.6. 로그 변환

5. 학습, 예측 데이터셋 만들기

5.1. label 값, 사용할 피처 지정

5.2. 학습, 예측 데이터 나누기

6. 모델링

6.1. RF

6.2. RandomizedSearchCV, GridSearchCV

6.3. 평가

6.4. 학습 및 예측

[TIL] 23.05.11

[TIL] 23.05.12

0개의 댓글