[HUFSTUDY] Kaggle Getting Started 데이터 분석 -Spaceship Titanic

Uomnf97·2022년 8월 21일
0

Spaceship Titanic

Predict which passengers are transported to an alternate dimension

  • HUFS Data Scientist : Kim Juwon, Cho Kwonwhi, Baek Gunwoo

Summary :

  • This Data Analysis is done by Juwon Kim and for ML Modeling
  • Using Pandas(Histogram), Heatmap to check Correlation, and Seaborn for Visualization
  • Data Pre-processing was done using One-hot Encoding/Eliminating Missing Value/Standardization

1. Data Analysis

  • Accurate data analysis is required to learn from the correct ML Model.
  • For Machine Leraning, train data, and test data were loaded . The number of columns, names, and target data and the relationship between each variable was analyzed using various data analysis techniques such as histogram, heat map, and clustering.

import Library

  • pandas, numpy, seaborn, matplotblib.pyplot, seaborn
# Data Analyze
import pandas as pd
import numpy as np
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# 데이터 tqdm으로 살피기
import tqdm.notebook as tqdm

Data Description

  • train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    • PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    • HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    • CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    • Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    • Destination - The planet the passenger will be debarking to.
    • Age - The age of the passenger.
    • VIP - Whether the passenger has paid for special VIP service during the voyage.
    • RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    • Name - The first and last names of the passenger.
    • Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
  • test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
  • sample_submission.csv - A submission file in the correct format.
    • PassengerId - Id for each passenger in the test set.
    • Transported - The target. For each passenger, predict either True or False.

ML Goal is to find out who is transported to other dimension or not

0. Load Data

  • Load Train, Test Data
train_data = pd.read_csv("./dataset/train.csv")
train_data

1. Data Preprocessing

  • Remove missing values (fill with 0), convert categorical data into numerical form with one-hot encoding, and proceed with normalization.
  • We checked how many missing values are in each column using isna() and sum() functions in pandas.
# Checking train data
train_data.isna().sum()

# Checking train data
train_data.isna().sum()
  • Experiment 1. See How many Data Eliminated
    • 2087 of data were removed
    • About one-fourth of data was eliminated which means that data power decreased
    • But still, I will try with dropna() and fillna() during the ML Modeling
train_data.dropna()

# fill missing value with 0
train_data=train_data.fillna(0)
train_data.isna().sum()

  • pandas only has a datatype with int, float, bool, datetime64, category, object
  • So object is usually String
print(train_data.columns)
print(train_data.dtypes)
print("행 열 :", train_data.shape)

  • We have to Change Bool, object data to categorical data with one hot encoding
    • object : PassengerId, HomePlanet, CryoSleep, Cabin, Destination, VIP, Name
    • Bool : Transported
  • One hot Encoding for Bool
train_data["Transported"] = train_data["Transported"].astype(int)
  • Check Object Data
# Date데이터의 연월일을 변환
PassengerId = dict()
HomePlanet= dict()
CryoSleep= dict()
Cabin= dict()
Destination= dict()
VIP= dict()
Name= dict()
for i in tqdm.tqdm(range(len(train_data['PassengerId']))):

    if train_data.iloc[i]["PassengerId"] in PassengerId :
        PassengerId[train_data.iloc[i]["PassengerId"]]+=1
    else :
        PassengerId[train_data.iloc[i]["PassengerId"]]=1

    if train_data.iloc[i]["HomePlanet"] in HomePlanet :
        HomePlanet[train_data.iloc[i]["HomePlanet"]]+=1
    else :
        HomePlanet[train_data.iloc[i]["HomePlanet"]]=1

    if train_data.iloc[i]["CryoSleep"] in CryoSleep :
        CryoSleep[train_data.iloc[i]["CryoSleep"]]+=1
    else :
        CryoSleep[train_data.iloc[i]["CryoSleep"]]=1

    if train_data.iloc[i]["Cabin"] in Cabin :
        Cabin[train_data.iloc[i]["Cabin"]]+=1
    else :
        Cabin[train_data.iloc[i]["Cabin"]]=1

    if train_data.iloc[i]["Destination"] in Destination :
        Destination[train_data.iloc[i]["Destination"]]+=1
    else :
        Destination[train_data.iloc[i]["Destination"]]=1

    if train_data.iloc[i]["VIP"] in VIP :
        VIP[train_data.iloc[i]["VIP"]]+=1
    else :
        VIP[train_data.iloc[i]["VIP"]]=1

    if train_data.iloc[i]["Name"] in Name :
        Name[train_data.iloc[i]["Name"]]+=1
    else :
        Name[train_data.iloc[i]["Name"]]=1

print(PassengerId)

![](https://velog.velcdn.com/images/uonmf97/post/baa63f82-082e-4558-a465-851ce597fcad/image.png

print(HomePlanet)

{'Europa': 2131, 'Earth': 4602, 'Mars': 1759, 0: 201}

print(CryoSleep)

{False: 5656, True: 3037}

print(Cabin)

print(Destination)

{'TRAPPIST-1e': 5915, 'PSO J318.5-22': 796, '55 Cancri e': 1800, 0: 182}

print(VIP)

{False: 8494, True: 199}

print(Name)

  • VIP, CryoSleep Turn to have Bool Data
  • Name/Passenger ID Doesn't needed ( it is not categorical data, it is just ID)
  • Destination/Homeplanet only has four Categories which means that they can be done by one hot encoding
  • Cabin need to be splited
  • VIP, CroySleep One hot encoding
train_data["VIP"] = train_data["VIP"].astype('int')
train_data["CryoSleep"] = train_data["CryoSleep"].astype('int')
  • Destination, Homeplanet One hot encoding
des = pd.get_dummies(train_data['Destination'], prefix = 'Destination')
hpt = pd.get_dummies(train_data['HomePlanet'], prefix = 'HomePlanet')
train_data = train_data.drop(['Destination', 'HomePlanet','Name','PassengerId'],axis=1)
train_data = pd.concat([train_data, des, hpt], axis=1)
train_data

deck =[]
num=[]
side=[]

for i in tqdm.tqdm(range(len(train_data["Cabin"]))):
    temp = (str(train_data.iloc[i]["Cabin"]).split('/'))
    if len(temp) == 3:
        deck.append(temp[0])
        num.append(int(temp[1]))
        side.append(temp[2])
    else :
        deck.append(0)
        num.append(0)
        side.append(0)
train_data["deck"]=deck
train_data["num"] =num
train_data["side"] =side

de = pd.get_dummies(train_data['deck'], prefix = 'deck')
si = pd.get_dummies(train_data['side'], prefix = 'side')
train_data = train_data.drop(['deck', 'side'],axis=1)
train_data = pd.concat([train_data, de, si], axis=1)
train_data

train_data.hist(figsize=(30,20))

  • Check mean, std, 50%, 25%, 75% using describe
train_data.describe()

  • Most of them are biased, so used standiardization

Check Correlation

target = train_data['Transported']
norm = train_data.drop('Transported', axis = 1)
# z-정규화( x-평균/표준편차)
train_data_normed = (norm- norm.mean())/norm.std()
train_data_normed

analysis = pd.merge(train_data_normed, train_data['Transported'],
                left_index = True, right_index=True)
`# 선형성 확인
plt.figure(figsize=(16,16))
sns.heatmap(train_data.corr(), linewidths=.5, cmap = 'Blues', annot=True)

#pairplot with Seaborn
sns.pairplot(analysis[['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Transported']],hue='Transported')
plt.show()

sns.pairplot(analysis[['Spa', 'VRDeck', 'Destination_0', 'Destination_55 Cancri e', 'Destination_TRAPPIST-1e', 'Destination_PSO J318.5-22', 'Transported']],hue='Transported')
plt.show()

sns.pairplot(analysis[['HomePlanet_0', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'num', 'Transported']],hue='Transported')
plt.show()

sns.pairplot(analysis[['deck_0', 'deck_A', 'deck_B', 'deck_C', 'deck_D', 'deck_E', 'Transported']],hue='Transported')
plt.show()

sns.pairplot(analysis[['deck_G', 'deck_T', 'side_0', 'side_P', 'side_S', 'Transported']],hue='Transported')
plt.show()

profile
사회적 가치를 실현하는 프로그래머

0개의 댓글