Sentimental Classification_IMDB_baseline

Jacob Kim·2024년 1월 31일

Naver project

Naver Project Week3

목록 보기

7/10

Sentiment Classification

Task

IMDB 영화사이트에서 50000개의 영화평을 가지고 positive/negative인지 구분해보자.
데이터 불러오기를 제외한 딥러닝 트레이닝 과정을 직접 구현해보는 것이 목표 입니다.

Dataset

IMDB datasets

Base code

Dataset: train, val, test로 split
Input data shape: (batch_size, max_sequence_length)
Output data shape: (batch_size, 1)
Architecture:
- RNN을 이용한 간단한 classification 모델 가이드
- Embedding - SimpleRNN - Dense (with Sigmoid)
- tf.keras.layers 사용
Training
- model.fit 사용
Evaluation
- model.evaluate 사용 for test dataset

Try some techniques

Training-epochs 조절
Change model architectures (Custom model)
- Use another cells (LSTM, GRU, etc.)
- Use dropout layers
Embedding size 조절
- 또는 one-hot vector로 학습
Number of words in the vocabulary 변화
pad 옵션 변화
Data augmentation (if possible)

자연어처리에 관한 work flow

The flowchart of the algorithm is roughly:

Import modules

use_colab = True
assert use_colab in [True, False]

from google.colab import drive
drive.mount('/content/drive')

# Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import time
import shutil
import tarfile

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output

import tensorflow as tf

from tensorflow.keras import layers

os.environ["CUDA_VISIBLE_DEVICES"]="0"

Load Data

IMDB에서 다운받은 총 50000개의 영화평을 사용한다.
tf.keras.datasets에 이미 잘 가공된 데이터 셋이 있으므로 쉽게 다운받아 사용할 수 있다.
원래는 text 데이터이지만 tf.keras.datasets.imdb는 이미 Tokenizing이 되어있다.

# Load training and eval data from tf.keras
imdb = tf.keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
train_labels = train_labels.astype(np.float64)
test_labels = test_labels.astype(np.float64)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17464789/17464789 [==============================] - 0s 0us/step

print("Train-set size: ", len(train_data))
print("Test-set size:  ", len(test_data))

#Train-set size:  25000
#Test-set size:   25000

Data 출력

데이터셋을 바로 불러왔을때 출력되는 데이터를 확인해보자

print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

print("sequence length: {}".format(len(train_data[0])))

sequence length: 218

Label정보를 확인해보자
- 0.0 for a negative sentiment 부정적인 리뷰
- 1.0 for a positive sentiment 긍정적인 리뷰

# negative sample
index = 1
print("text: {}\n".format(train_data[index]))
print("label: {}".format(train_labels[index]))

text: [1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]

label: 0.0

# positive sample
index = 200
print("text: {}\n".format(train_data[index]))
print("label: {}".format(train_labels[index]))

text: [1, 14, 9, 6, 227, 196, 241, 634, 891, 234, 21, 12, 69, 6, 6, 176, 7, 4, 804, 4658, 2999, 667, 11, 12, 11, 85, 715, 6, 176, 7, 1565, 8, 1108, 10, 10, 12, 16, 1844, 2, 33, 211, 21, 69, 49, 2009, 905, 388, 99, 2, 125, 34, 6, 2, 1274, 33, 4, 130, 7, 4, 22, 15, 16, 6424, 8, 650, 1069, 14, 22, 9, 44, 4609, 153, 154, 4, 318, 302, 1051, 23, 14, 22, 122, 6, 2093, 292, 10, 10, 723, 8721, 5, 2, 9728, 71, 1344, 1576, 156, 11, 68, 251, 5, 36, 92, 4363, 133, 199, 743, 976, 354, 4, 64, 439, 9, 3059, 17, 32, 4, 2, 26, 256, 34, 2, 5, 49, 7, 98, 40, 2345, 9844, 43, 92, 168, 147, 474, 40, 8, 67, 6, 796, 97, 7, 14, 20, 19, 32, 2188, 156, 24, 18, 6090, 1007, 21, 8, 331, 97, 4, 65, 168, 5, 481, 53, 3084]

label: 1.0

Prepare dataset

Convert the integers back to words

실제 우리가 다루고 있는 데이터가 진짜 리뷰데이터인지 확인해보자

# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
  return ' '.join([reverse_word_index.get(i, '?') for i in text])

Text data 출력

print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

decode_review(train_data[0])

<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for wha

Padding and truncating data using pad sequences

전부 길이가 다른 리뷰들의 길이를 통일해주자

from tensorflow.keras.preprocessing.sequence import pad_sequences

num_seq_length = np.array([len(tokens) for tokens in list(train_data) + list(test_data)])
train_seq_length = np.array([len(tokens) for tokens in train_data], dtype=np.int32)
test_seq_length = np.array([len(tokens) for tokens in test_data], dtype=np.int32)

max_seq_length = 256

Max length보다 작은 리뷰의 퍼센트

print(np.sum(num_seq_length < max_seq_length) / len(num_seq_length))

# 0.8052

max_seq_length을 256으로 설정하면 전체 데이터 셋의 70%를 커버할 수 있다.
30% 정도의 데이터가 256 단어가 넘는 문장으로 이루어져 있다.
보통 미리 정한 max_seq_length를 넘어가는 문장의 데이터는 truncate 한다.

# padding 옵션은 두 가지가 있다.
pad = 'pre'
# pad = 'post'

train_data_pad = pad_sequences(train_data,
                               maxlen=max_seq_length,
                               padding=pad,
                               value=word_index["<PAD>"])
test_data_pad = pad_sequences(test_data,
                              maxlen=max_seq_length,
                              padding=pad,
                              value=word_index["<PAD>"])

print(train_data_pad.shape)
print(test_data_pad.shape)

# (25000, 330)
# (25000, 330)

Padding data 출력

index = 0
print("text: {}\n".format(decode_review(train_data[index])))
print("token: {}\n".format(train_data[index]))
print("pad: {}".format(train_data_pad[index]))

text: <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

token: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

pad: [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
 5244   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117 5952   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141    6  194 7486   18    4  226   22   21  134  476
   26  480    5  144   30 5535   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16 5345   19  178   32]
####  Create a validation set
```python
num_val_data = #TODO
val_data_pad = train_data_pad[:num_val_data]
train_data_pad_partial = train_data_pad[num_val_data:]

val_labels = train_labels[:num_val_data]
train_labels_partial = train_labels[num_val_data:]

Create a validation set

num_val_data = 5000
val_data_pad = train_data_pad[:num_val_data]
train_data_pad_partial = train_data_pad[num_val_data:]

val_labels = train_labels[:num_val_data]
train_labels_partial = train_labels[num_val_data:]

Dataset 구성

batch_size = 1024

# for train
train_dataset = tf.data.Dataset.from_tensor_slices((train_data_pad_partial, train_labels_partial))
train_dataset = train_dataset.shuffle(10000).repeat().batch(batch_size=batch_size)
print(train_dataset)

# for test
test_dataset = tf.data.Dataset.from_tensor_slices((test_data_pad, test_labels))
test_dataset = test_dataset.batch(batch_size=batch_size)
print(test_dataset)

# for valid
valid_dataset = tf.data.Dataset.from_tensor_slices((val_data_pad, val_labels))
valid_dataset = valid_dataset.batch(batch_size=batch_size)
print(valid_dataset)

#<BatchDataset shapes: ((None, 330), (None,)), types: (tf.int32, tf.float64)>
#<BatchDataset shapes: ((None, 330), (None,)), types: (tf.int32, tf.float64)>
#<BatchDataset shapes: ((None, 330), (None,)), types: (tf.int32, tf.float64)>

Setup hyper-parameters

# Set the hyperparameter set
max_epochs = #TODO
embedding_size = #TODO
vocab_size = #TODO

# the save point
if use_colab:
    checkpoint_dir ='./drive/My Drive/train_ckpt/sentimental/exp1'
    if not os.path.isdir(checkpoint_dir):
        os.makedirs(checkpoint_dir)
else:
    checkpoint_dir = 'sentimental/exp1'

Build the model

Embedding layer

embedding-layer는 전체 vocabulary의 갯수(num_words)로 이루어진 index가 embedding_size의 dense vector 로 변환되는 과정이다.

# TODO
model = tf.keras.Sequential()
model.add(layers.Embedding(vocab_size, embedding_size)) # 10000, 256

# model.add 를 통해 자유롭게 모델을 만들어보세요.
# TODO
model.add(layers.Bidirectional(layers.GRU(32, return_sequences=True)))
model.add(layers.Bidirectional(layers.GRU(32)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(8, activation='relu'))
# TODO
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 256)         2560000   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 64)          55680     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                18816     
_________________________________________________________________
dense (Dense)                (None, 16)                1040      
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
=================================================================
Total params: 2,635,681
Trainable params: 2,635,681
Non-trainable params: 0
_________________________________________________________________

Compile the model

optimizer = tf.keras.optimizers.Adam(#TODO)
model.compile(optimizer=optimizer,
              loss=#TODO,
              metrics=['accuracy'])

Train the model

cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_dir,
                                                 save_weights_only=True,
                                                 monitor='val_loss',
                                                 mode='auto',
                                                 save_best_only=True,
                                                 verbose=1)

history = model.fit(#,
                    epochs=#,
                    validation_data=#,
                    steps_per_epoch=#,
                    validation_steps=#,
                    callbacks=[#,#]
                    )

Epoch 1/10
19/19 [==============================] - 126s 6s/step - loss: 0.6891 - accuracy: 0.5939 - val_loss: 0.6774 - val_accuracy: 0.6765

Epoch 00001: val_loss improved from inf to 0.67736, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 2/10
19/19 [==============================] - 105s 6s/step - loss: 0.6089 - accuracy: 0.7507 - val_loss: 0.5100 - val_accuracy: 0.7571

Epoch 00002: val_loss improved from 0.67736 to 0.50999, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 3/10
19/19 [==============================] - 106s 6s/step - loss: 0.3875 - accuracy: 0.8328 - val_loss: 0.4383 - val_accuracy: 0.7944

Epoch 00003: val_loss improved from 0.50999 to 0.43826, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 4/10
19/19 [==============================] - 106s 6s/step - loss: 0.2686 - accuracy: 0.8956 - val_loss: 0.4098 - val_accuracy: 0.8296

Epoch 00004: val_loss improved from 0.43826 to 0.40978, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 5/10
19/19 [==============================] - 106s 6s/step - loss: 0.1885 - accuracy: 0.9321 - val_loss: 0.4133 - val_accuracy: 0.8362

Epoch 00005: val_loss did not improve from 0.40978
Epoch 6/10
19/19 [==============================] - 106s 6s/step - loss: 0.1261 - accuracy: 0.9597 - val_loss: 0.4417 - val_accuracy: 0.8501

Epoch 00006: val_loss did not improve from 0.40978
Epoch 7/10
19/19 [==============================] - 105s 6s/step - loss: 0.0815 - accuracy: 0.9772 - val_loss: 0.5225 - val_accuracy: 0.8362

Epoch 00007: val_loss did not improve from 0.40978
Epoch 8/10
19/19 [==============================] - 105s 6s/step - loss: 0.0783 - accuracy: 0.9752 - val_loss: 0.5447 - val_accuracy: 0.8418

Epoch 00008: val_loss did not improve from 0.40978
Epoch 9/10
19/19 [==============================] - 105s 6s/step - loss: 0.0530 - accuracy: 0.9861 - val_loss: 0.5841 - val_accuracy: 0.8416

Epoch 00009: val_loss did not improve from 0.40978
Epoch 10/10
19/19 [==============================] - 106s 6s/step - loss: 0.0340 - accuracy: 0.9928 - val_loss: 0.6312 - val_accuracy: 0.8406

Epoch 00010: val_loss did not improve from 0.40978

모델 테스트

테스트 데이터셋을 이용해 모델을 테스트해봅시다.

model.load_weights(checkpoint_dir)

# <tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f9ad5c70a50>

results = model.evaluate(test_dataset)
# loss
print("loss value: {:.3f}".format(results[0]))
# accuracy
print("accuracy value: {:.3f}".format(results[1]))

25/25 [==============================] - 14s 553ms/step - loss: 0.4232 - accuracy: 0.8255
loss value: 0.423
accuracy value: 0.825

Jacob Kim

AI, Information and Communication, Electronics, Computer Science, Bio, Algorithms

Sentimental Classification_IMDB_baseline

Naver Project Week3

Sentiment Classification

Task

Dataset

Base code

Try some techniques

자연어처리에 관한 work flow

Import modules

Load Data

Data 출력

Prepare dataset

Convert the integers back to words

Text data 출력

Padding and truncating data using pad sequences

Padding data 출력

Create a validation set

Dataset 구성

Setup hyper-parameters

Build the model

Embedding layer

Compile the model

Train the model

모델 테스트

Sentimental Classification_IMDB

Proj) Spaceship_Titanic_classification_baseline_2

0개의 댓글