Naver Project (Spaceship_Titanic_classification)

Jacob Kim·2024년 1월 31일

Naver project

Naver Project Week 2

목록 보기

14/20

`Spaceship Titanic with DL`

배경

우주의 미스터리를 풀기 위해 데이터 과학 기술이 필요한 2912년에 오신 것을 환영합니다. 4광년 떨어진 곳에서 전송을 받았는데 상태가 좋지 않습니다.
우주선 타이타닉은 한 달 전에 발사된 성간 여객선이었습니다. 약 13,000명의 승객을 태운 이 선박은 우리 태양계에서 가까운 별을 도는 새로 거주 가능한 세 개의 외계 행성으로 이민자들을 수송하는 첫 항해를 시작했습니다.
첫 번째 목적지인 55 Cancri E로 가는 도중 Alpha Centauri를 돌던 중 부주의한 우주선 Titanic이 먼지 구름 속에 숨겨진 시공간 변칙과 충돌했습니다. 안타깝게도 1000년 전의 이름과 비슷한 운명을 맞이했습니다. 배는 온전했지만 승객의 거의 절반이 다른 차원으로 이동했습니다!

데이터 정보

PassengerId
- 각 승객의 고유 ID. 각 Id는 승객이 함께 여행하는 그룹을 나타내고 그룹 내의 번호를 나타내는 형식을 취합니다 . 그룹의 사람들은 종종 가족 구성원이지만 항상 그런 것은 아닙니다.
HomePlanet
승객이 출발한 행성으로, 일반적으로 승객이 거주하는 행성입니다.
CryoSleep
- 승객이 항해 기간 동안 냉동 수면 선택했는지 여부를 나타냅니다. cryosleep의 승객은 객실에 갇혀 있습니다.
Cabin
- 승객이 머무르는 캐빈 번호. 형식을 취합니다 deck/num/side. 여기 에서 Port 또는 Starboard 가 side될 수 있습니다.
Destination
- 승객이 내릴 행성.
Age
- 승객의 나이.
VIP
- 승객이 항해 중 특별 VIP 서비스 비용을 지불했는지 여부.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
- 승객이 Spaceship Titanic 의 다양한 고급 편의 시설 각각에 대해 청구한 금액입니다.
Name
- 승객의 성과 이름.
Transported
- 승객이 다른 차원으로 이동했는지 여부. 정답 데이터입니다.

import library

import pandas as pd
import tensorflow as tf

Data Load

Read CSV files wit pandas

train_data = pd.read_csv("/content/drive/MyDrive/Lecture/양재AI_NLP_basic_to_LLMs/2주차/DL_exercise/DL 미니 프로젝트/2_Spaceship_Titanic/spaceship_titanic_train_data.csv")
train_labels = pd.read_csv("/content/drive/MyDrive/Lecture/양재AI_NLP_basic_to_LLMs/2주차/DL_exercise/DL 미니 프로젝트/2_Spaceship_Titanic/spaceship_titanic_train_labels.csv")

test_data = pd.read_csv("/content/drive/MyDrive/Lecture/양재AI_NLP_basic_to_LLMs/2주차/DL_exercise/DL 미니 프로젝트/2_Spaceship_Titanic/spaceship_titanic_test_data.csv")
test_labels = pd.read_csv("/content/drive/MyDrive/Lecture/양재AI_NLP_basic_to_LLMs/2주차/DL_exercise/DL 미니 프로젝트/2_Spaceship_Titanic/spaceship_titanic_test_labels.csv")

train = pd.concat([train_data, train_labels], axis=1)
test = pd.concat([test_data, test_labels], axis=1)

Preprocessing

결측치 제거 후 데이터 로더에 연결

train = train.fillna(method='bfill')
test = test.fillna(method='bfill')

train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep          bool
Cabin            object
Destination      object
Age             float64
VIP                bool
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

# 일부 dtype은 tensor로 변경 불가
train['HomePlanet'] = train['HomePlanet'].astype('category')
train['CryoSleep'] = train['CryoSleep'].map({True: 1, False: 0})
train['VIP'] = train['VIP'].map({True: 1, False: 0})
train['Transported'] = train['Transported'].map({True: 1, False: 0})

test['HomePlanet'] = test['HomePlanet'].astype('category')
test['CryoSleep'] = test['CryoSleep'].map({True: 1, False: 0})
test['VIP'] = test['VIP'].map({True: 1, False: 0})
test['Transported'] = test['Transported'].map({True: 1, False: 0})

train

test

Data Loader

batch_size = 4

def df_to_dataset(dataframe, label_name="Transported", shuffle=True, batch_size=batch_size):
    dataframe = dataframe.copy()
    labels = dataframe.pop(label_name)
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
        ds = ds.repeat()
    ds = ds.batch(batch_size)

    return ds

train_ds = df_to_dataset(train)
train_ds

#_BatchDataset element_spec=({'PassengerId': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'HomePlanet': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'CryoSleep': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'Cabin': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Destination': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Age': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VIP': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'RoomService': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'FoodCourt': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'ShoppingMall': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Spa': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VRDeck': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Name': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

test_ds = df_to_dataset(test, shuffle=False)
test_ds

#_BatchDataset element_spec=({'PassengerId': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'HomePlanet': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'CryoSleep': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'Cabin': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Destination': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Age': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VIP': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'RoomService': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'FoodCourt': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'ShoppingMall': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Spa': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VRDeck': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Name': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

for t, l in train_ds:
  print(t, l)
  break

for t, l in test_ds:
  print(t, l)
  break

#OUTPUT
{'PassengerId': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'5722_01', b'0453_01', b'8383_06', b'5220_01'], dtype=object)>, 'HomePlanet': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'Earth', b'Europa', b'Earth', b'Mars'], dtype=object)>, 'CryoSleep': <tf.Tensor: shape=(4,), dtype=int64, numpy=array([0, 1, 0, 0])>, 'Cabin': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'G/928/S', b'B/14/S', b'G/1368/P', b'D/165/P'], dtype=object)>, 'Destination': <tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'TRAPPIST-1e', b'TRAPPIST-1e', b'TRAPPIST-1e', b'TRAPPIST-1e'],
      dtype=object)>, 'Age': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([44., 19., 57., 21.])>, 'VIP': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'False', b'False', b'False', b'False'], dtype=object)>, 'RoomService': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([  69.,    0.,    0., 2206.])>, 'FoodCourt': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([  0.,   0., 839.,   0.])>, 'ShoppingMall': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([ 17.,   0.,   1., 353.])>, 'Spa': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([176.,   0.,   0.,   0.])>, 'VRDeck': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([2.105e+03, 0.000e+00, 2.000e+00, 1.900e+01])>, 'Name': <tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'Joyces Ington', b'Aldun Taptiritty', b'Jorgie Gibbsonton',
       b'Jackok Cooki'], dtype=object)>} tf.Tensor([0 1 1 0], shape=(4,), dtype=int64)
{'PassengerId': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'0337_02', b'2891_01', b'8998_01', b'1771_01'], dtype=object)>, 'HomePlanet': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'Mars', b'Earth', b'Earth', b'Earth'], dtype=object)>, 'CryoSleep': <tf.Tensor: shape=(4,), dtype=int64, numpy=array([0, 0, 1, 0])>, 'Cabin': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'F/63/S', b'G/460/S', b'G/1449/S', b'G/291/P'], dtype=object)>, 'Destination': <tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'TRAPPIST-1e', b'TRAPPIST-1e', b'TRAPPIST-1e', b'TRAPPIST-1e'],
      dtype=object)>, 'Age': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([19., 18., 41., 35.])>, 'VIP': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'False', b'False', b'False', b'False'], dtype=object)>, 'RoomService': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([417.,   4.,   0.,   0.])>, 'FoodCourt': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([349., 904.,   0., 338.])>, 'ShoppingMall': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([634.,   0.,   0., 436.])>, 'Spa': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([3., 0., 0., 0.])>, 'VRDeck': <tf.Tensor: shape=(4,), dtype=float64, numpy=array([1.057e+03, 1.000e+00, 0.000e+00, 0.000e+00])>, 'Name': <tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'Weros Perle', b'Gleney Ortinericey', b'Gerry Englence',
       b'Antone Cardner'], dtype=object)>} tf.Tensor([1 0 0 1], shape=(4,), dtype=int64)

list(set(train['VIP']))

#'True', 'False']

Preprocessing with layers

inputs = {
  'CryoSleep': tf.keras.Input(shape=(), dtype='int64'),
  'HomePlanet': tf.keras.Input(shape=(), dtype='string'),
  'RoomService': tf.keras.Input(shape=(), dtype='float64'),
  'VIP': tf.keras.Input(shape=(), dtype='string'),
  'Cabin': tf.keras.Input(shape=(), dtype='string')
}

split_text = tf.strings.split(inputs['Cabin'], sep="/")

# Convert index to one-hot; e.g. [2] -> [0,1].
type_output = tf.keras.layers.CategoryEncoding(num_tokens=2, output_mode='one_hot')(inputs['CryoSleep'])
print(type_output.shape)
dense_type = tf.keras.layers.Dense(2, activation='relu')(type_output)
print(dense_type.shape)

vip = tf.keras.layers.StringLookup(vocabulary=["True", "False"], num_oov_indices=0, output_mode='one_hot')(inputs['VIP'])
print(vip.shape)

# Convert size strings to indices; e.g. ['small'] -> [1].
size_output = tf.keras.layers.StringLookup(vocabulary=list(set(train['HomePlanet'])))(inputs['HomePlanet'])
size_output = tf.keras.layers.Reshape([-1])(size_output)
# print(size_output.shape)
dense_size = tf.keras.layers.Dense(2, activation='relu')(size_output)
# print(dense_size.shape)

# Normalize the numeric inputs; e.g. [2.0] -> [0.0].
weight_output = tf.keras.layers.Normalization(
      axis=None, mean=2.0, variance=1.0)(inputs['RoomService'])
weight_output = tf.keras.layers.Reshape([-1])(weight_output)
# print(weight_output.shape)
dense_weight = tf.keras.layers.Dense(2, activation='relu')(weight_output)
# print(dense_weight.shape)

x = tf.concat([dense_type, dense_size, dense_weight], -1) # batch, 특징 (여기로 합쳐라)
x = tf.keras.layers.Dense(4, activation='relu')(x)
x = tf.keras.layers.Dense(4, activation='relu')(x)

outputs = tf.keras.layers.Dense(1)(x) # Sigmoid, BCE loss

# outputs = {
#   'CryoSleep': type_output,
#   'HomePlanet': size_output,
#   'RoomService': weight_output,
#   'VIP': vip,
#   's': split_text
# }

preprocessing_model = tf.keras.Model(inputs, outputs)

#(None, 2)
#(None, 2)
#/usr/local/lib/python3.10/dist-packages/numpy/core/numeric.py:2463: #FutureWarning: elementwise comparison failed; returning scalar instead, #but in the future will perform elementwise comparison
#  return bool(asarray(a1 == a2).all())
#(None, 2)

Input and preprocessing Layers

String Look Up

 tf.strings.split(train['Cabin'], "/")[:,-1:].numpy()

array([[b'P'],
       [b'P'],
       [b'S'],
       ...,
       [b'P'],
       [b'P'],
       [b'P']], dtype=object)

inputs = {
    'Cabin': tf.keras.Input(shape=(), dtype='string')
}

# 캐빈 데이터를 분할합니다
cabin_split = tf.strings.split(inputs['Cabin'], "/")

# 마지막 요소만 선택합니다
cabin_last = cabin_split.to_tensor()[:, -1]

# StringLookup 레이어를 사용하여 one-hot 인코딩을 수행합니다
cabin_output = tf.keras.layers.StringLookup(vocabulary=["S", "P"], num_oov_indices=1, output_mode='one_hot')(cabin_last)

outputs = {
    'Cabin': cabin_output
}

model = tf.keras.Model(inputs, outputs)

for t, l in train_ds:
    pred = model(t)
    print(pred['Cabin'])
    break

tf.Tensor(
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]], shape=(4, 3), dtype=float32)
/usr/local/lib/python3.10/dist-packages/keras/src/engine/functional.py:642: UserWarning: Input dict contained keys ['PassengerId', 'HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Name'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)

s = "F/575/P"
tf.strings.split(s, sep="/")

#<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'F', b'575', b'P'], dtype=object)>

list(set(train['HomePlanet']))

#['Mars', 'Earth', 'Europa']

string_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=list(set(train['HomePlanet'])),
    num_oov_indices=0,
    output_mode='one_hot')
"""
int 0, 1, 2 (idx)
one_hot [1, 0, 0]
multi_hot [1, 0, 1] (다중입력)
"""

string_lookup_layer(list(train['HomePlanet'].to_numpy()))

#<tf.Tensor: shape=(6954, 3), dtype=float32, numpy=
#array([[0., 1., 0.],
#       [0., 1., 0.],
#       [0., 0., 1.],
#       ...,
#       [0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 1.]], dtype=float32)>

Category Encoding

train['CryoSleep']

0       0
1       0
2       1
3       0
4       1
       ..
6949    0
6950    0
6951    0
6952    0
6953    0
Name: CryoSleep, Length: 6954, dtype: int64

one_hot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=2, output_mode='one_hot')
one_hot_layer(list(train['CryoSleep'].to_numpy()))

<tf.Tensor: shape=(6954, 2), dtype=float32, numpy=
array([[1., 0.],
       [1., 0.],
       [0., 1.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]], dtype=float32)>

Normalization

normalization_layer = tf.keras.layers.Normalization(mean=1.0, variance=10.0)

# 각 입력 값에서 mean 값을 뺍니다. 예를 들어, 입력이 [1., 2., 3.]라면, 결과는 [-2., -1., 0.]이 됩니다.
# 다음으로, 이 결과를 variance의 제곱근 값, 즉 sqrt(2.)로 나눕니다. 따라서 결과는 [-2/sqrt(2), -1/sqrt(2), 0]가 됩니다. (정규화 과정)

print(normalization_layer(train['RoomService'][:5]).numpy())
print(train['RoomService'][:5].to_numpy())

#[-0.31622776 -0.31622776 -0.31622776 -0.31622776 -0.31622776]
#[0. 0. 0. 0. 0.]

Model Train

preprocessing_model.compile(optimizer=tf.keras.optimizers.Adam(1e-4),
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

max_epochs = 150

history = preprocessing_model.fit(train_ds,
                                  epochs=max_epochs,
                                  steps_per_epoch=len(train) // batch_size,
                                  validation_data=test_ds,
                                  validation_steps=len(test) // batch_size)

1738/1738 [==============================] - 6s 3ms/step - loss: 0.5712 - accuracy: 0.7172 - val_loss: 0.5622 - val_accuracy: 0.7247
Epoch 144/150
1738/1738 [==============================] - 5s 3ms/step - loss: 0.5745 - accuracy: 0.7145 - val_loss: 0.5625 - val_accuracy: 0.7247
Epoch 145/150
1738/1738 [==============================] - 5s 3ms/step - loss: 0.5732 - accuracy: 0.7158 - val_loss: 0.5622 - val_accuracy: 0.7247
Epoch 146/150
1738/1738 [==============================] - 7s 4ms/step - loss: 0.5716 - accuracy: 0.7168 - val_loss: 0.5623 - val_accuracy: 0.7247
Epoch 147/150
1738/1738 [==============================] - 5s 3ms/step - loss: 0.5734 - accuracy: 0.7155 - val_loss: 0.5628 - val_accuracy: 0.7247
Epoch 148/150
1738/1738 [==============================] - 5s 3ms/step - loss: 0.5690 - accuracy: 0.7185 - val_loss: 0.5622 - val_accuracy: 0.7247
Epoch 149/150
1738/1738 [==============================] - 6s 3ms/step - loss: 0.5718 - accuracy: 0.7184 - val_loss: 0.5621 - val_accuracy: 0.7247
Epoch 150/150
1738/1738 [==============================] - 5s 3ms/step - loss: 0.5727 - accuracy: 0.7142 - val_loss: 0.5621 - val_accuracy: 0.7247

Jacob Kim

AI, Information and Communication, Electronics, Computer Science, Bio, Algorithms

Naver Project (Spaceship_Titanic_classification)

Naver Project Week 2

`Spaceship Titanic with DL`

배경

데이터 정보

import library

Data Load

Read CSV files wit pandas

Preprocessing

Data Loader

Preprocessing with layers

Input and preprocessing Layers

String Look Up

Category Encoding

Normalization

Model Train

Naver Project (Energy_prediction)

Naver Project (Cat and Dog classification_v2)

0개의 댓글

관련 채용 정보