Kaggle: Machine Learning Competitions (Version 01)

daeungdaeung·2021년 7월 8일

Kaggle-ML

목록 보기

6/7

Kaggle 에서 제공하는 Intro to Machine Learning 7번 Machine Learning Competitions Exercise 부분입니다.

주어진 데이터(주택 관련 데이터)를 활용하여 Competitions에 참여하는 것입니다. (기본 코드 주어집니다.)

저는 캐글에서 제공하는 Notebook 에서 코드를 작성했습니다.

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

# Set up filepaths
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
    
# Import helpful libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Load the data, and separate the target
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice

여기 부터는 데이터 분석을 통해 feature selection 을 수행하는 부분이고 feature selection 과정의 근거는 제가 이전에 작성했던 포스트에 있습니다.
~~(글을 이쁘게 작성하기 위해서 코드상의 주석은 모두 영문으로 작성합니다...)~~

# Delete features having many missing data
delete_features = [
    'PoolQC',
    'MiscFeature',
    'Alley',
    'Fence',
    'FireplaceQu',
    'LotFrontage',
    'GarageCond',
    'GarageType',
    'GarageYrBlt',
    'GarageFinish',
    'GarageQual',
    'BsmtExposure',
    'BsmtFinType2',
    'BsmtFinType1',
    'BsmtCond',
    'BsmtQual',
    'MasVnrArea',
    'MasVnrType'
]

tmp_data = home_data.drop(columns=delete_features)

# Find the index which has null value (in 'Electrical' feature)
null_indices = tmp_data[tmp_data['Electrical'].isnull()].index.tolist()
tmp_data = tmp_data.drop(null_indices)

# Delete outliers
tmp = tmp_data[tmp_data['GrLivArea'] > 4000]
outliers_indices = tmp[tmp['SalePrice'] < 200000].index.tolist()

tmp_data = tmp_data.drop(outliers_indices)

# Log transformation for following Normal Distribution 
# ('SalePrice', 'GrLivArea')
import numpy as np

tmp_data['SalePrice'] = np.log(tmp_data['SalePrice'])
tmp_data['GrLivArea'] = np.log(tmp_data['GrLivArea'])

# Preprocess 'TotalBsmtSF'
tmp_data['HasBsmt'] = pd.Series(len(tmp_data['TotalBsmtSF']), index=tmp_data.index)
tmp_data['HasBsmt'] = 0
tmp_data.loc[tmp_data['TotalBsmtSF']>0, 'HasBsmt'] = 1

tmp_data.loc[tmp_data['HasBsmt']==1, 'TotalBsmtSF'] = np.log(tmp_data[tmp_data['TotalBsmtSF']>0]['TotalBsmtSF'])

# Label: y
y = tmp_data.SalePrice

음... 이유는 모르겠지만, categorical features가 random forest model에 사용할 수 없어서, feature 4개를 제외하고 모두 쓰지 않기로 했습니다.

4 features: GrLivArea, TotalBsmtSF, OverallQual, YearBuilt

# Select columns corresponding to features, and preview the data
features = ['GrLivArea', 'TotalBsmtSF', 'OverallQual', 'YearBuilt']
X = tmp_data[features]
X.head()

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define a random forest model
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

위에서 학습데이터를 train & validation 으로 나눠서 학습을 했습니다.
아래 코드는 나누지않고 학습데이터 전체를 이용하여 RandomForest Model을 학습시킵니다.

rf_model.fit(X, y)

아래는 테스트 데이터를 불러와서 학습시킨 모델에 적용해보는 코드입니다.
테스트 데이터의 샘플들 중에 null 값을 가진 데이터가 있었습니다.
해당 데이터는 TotalBsmtSF 값이 null이었기에 TotalBsmtSF의 평균값인 1046.11797을 넣어주었습니다.

# path to file you will use for predictions
test_data_path = '../input/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]
test_X.loc[660, 'TotalBsmtSF'] = 1046.11797

# make predictions which we will submit. 
test_preds = rf_model.predict(test_X)