๐ŸงฉTensorflow Certification ์ทจ๋“ํ•˜๊ธฐ - part 8. ์‹ค์ „ (sarcasm)

vincaยท2023๋…„ 1์›” 3์ผ
0

๐ŸŒ• AI/DL -Tenserflow Certification

๋ชฉ๋ก ๋ณด๊ธฐ
8/11
post-thumbnail

Sarcasm (๋น„๊ผผ) ๋ถ„๋ฅ˜ํ•˜๊ธฐ

  • RNN ์„ ํ™œ์šฉํ•œ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ (Text Classification)

NLP QUESTION
For this task you will build a classifier for the sarcasm dataset
The classifier should have a final layer with 1 neuron activated by sigmoid as shown.
It will be tested against a number of sentences that the network hasn't previously seen
And you will be scored on whether sarcasm was correctly detected in those sentences

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ
์ด ์ž‘์—…์—์„œ๋Š” sarcasm ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.
๋ถ„๋ฅ˜๊ธฐ๋Š” 1 ๊ฐœ์˜ ๋‰ด๋Ÿฐ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ sigmoid ํ™œ์„ฑํ•จ์ˆ˜๋กœ ๊ตฌ์„ฑ๋œ ์ตœ์ข… ์ธต์„ ๊ฐ€์ ธ์•ผํ•ฉ๋‹ˆ๋‹ค.

์ œ์ถœ๋  ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ์…‹์ด ์—†๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธ๋ฉ๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ๋‹น์‹ ์€ ๊ทธ ๋ฌธ์žฅ์—์„œ sarcasm ํŒ๋ณ„์ด ์ œ๋Œ€๋กœ ๊ฐ์ง€๋˜์—ˆ๋Š”์ง€์— ๋”ฐ๋ผ ์ ์ˆ˜๋ฅผ ๋ฐ›๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค


Solution

์ˆœ์„œ ์š”์•ฝ

  1. import: ํ•„์š”ํ•œ ๋ชจ๋“ˆ import
  2. ์ „์ฒ˜๋ฆฌ: ํ•™์Šต์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  3. ๋ชจ๋ธ๋ง(model): ๋ชจ๋ธ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
  4. ์ปดํŒŒ์ผ(compile): ๋ชจ๋ธ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  5. ํ•™์Šต (fit): ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.

1. import ํ•˜๊ธฐ

ํ•„์š”ํ•œ ๋ชจ๋“ˆ์„ import ํ•ฉ๋‹ˆ๋‹ค.

import json
import tensorflow as tf
import numpy as np
import urllib

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint

2.1 ์ „์ฒ˜๋ฆฌ (Load dataset)

tensorflow-datasets๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

url = 'https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json'
urllib.request.urlretrieve(url, 'sarcasm.json')

2.2 ์ „์ฒ˜๋ฆฌ (Json ํŒŒ์ผ ๋กœ๋“œ)

## Json ํŒŒ์ผ ๋กœ๋“œ
with open('sarcasm.json') as f:
    datas = json.load(f)

datas 5๊ฐœ ์ถœ๋ ฅํ•ด๋ด…์‹œ๋‹ค.

  • article_link: ๋‰ด์Šค ๊ธฐ์‚ฌ URL
  • headline: ๋‰ด์Šค๊ธฐ์‚ฌ์˜ ์ œ๋ชฉ
  • is_sarcastic: ๋น„๊ผฌ๋Š” ๊ธฐ์‚ฌ ์—ฌ๋ถ€ (๋น„๊ผผ: 1, ์ผ๋ฐ˜: 0)
datas[:5]

2.3 ์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ(sentences, labels)

๋นˆ list๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (sentences, labels)

  • X (Feature): sentences
  • Y (Label): label
sentences = []
labels = []
for data in datas:
    sentences.append(data['headline'])
    labels.append(data['is_sarcastic'])

๋ฌธ์žฅ 5๊ฐœ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

sentences[:5]
labels[:5]

2.4 ์ „์ฒ˜๋ฆฌ (Train / Validation Set ๋ถ„๋ฆฌ)

20,000๊ฐœ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

training_size = 20000
train_sentences = sentences[:training_size]
train_labels = labels[:training_size]
validation_sentences = sentences[training_size:]
validation_labels = labels[training_size:]

2.5 ์ „์ฒ˜๋ฆฌ Step 1. Tokenizer ์ •์˜

๋‹จ์–ด์˜ ํ† ํฐํ™”๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • num_words: ๋‹จ์–ด max ์‚ฌ์ด์ฆˆ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด๋ถ€ํ„ฐ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  • oov_token: ๋‹จ์–ด ํ† ํฐ์— ์—†๋Š” ๋‹จ์–ด๋ฅผ ์–ด๋–ป๊ฒŒ ํ‘œ๊ธฐํ•  ๊ฒƒ์ธ์ง€ ์ง€์ •ํ•ด์ค๋‹ˆ๋‹ค.
vocab_size = 1000
oov_tok = "<OOV>"
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')

2.6 ์ „์ฒ˜๋ฆฌ Step 2. Tokenizer๋กœ ํ•™์Šต์‹œํ‚ฌ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ํ† ํฐํ™” ์ง„ํ–‰

fit_on_texts๋กœ ํ•™์Šตํ•  ๋ฌธ์žฅ์— ๋Œ€ํ•˜์—ฌ ํ† ํฐํ™”๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

tokenizer.fit_on_texts(train_sentences)
for key, value in tokenizer.word_index.items():
    print('{}  \t======>\t {}'.format(key, value))
    if value == 25:
        break


ํ† ํฐํ™”๋œ ๋‹จ์–ด ์‚ฌ์ „์˜ ๊ฐฏ์ˆ˜

len(tokenizer.word_index)

๋‹จ์–ด์‚ฌ์ „์€ dictionary ํ˜•ํƒœ๋กœ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ฆ‰, ๋‹จ์–ด๋ฅผ key๋กœ ์ž…๋ ฅํ•˜๋ฉด ๊ฐ’์„ return ํ•ฉ๋‹ˆ๋‹ค.

word_index = tokenizer.word_index
word_index['trump']
word_index['hello']

2.7 ์ „์ฒ˜๋ฆฌ Step 3. ๋ฌธ์žฅ(sentences)์„ ํ† ํฐ์œผ๋กœ ๋ณ€๊ฒฝ (์น˜ํ™˜)

texts_to_sequences: ๋ฌธ์žฅ์„ ์ˆซ์ž๋กœ ์น˜ํ™˜ ํ•ฉ๋‹ˆ๋‹ค. Train Set, Valid Set ๋ชจ๋‘ ๋ณ„๋„๋กœ ์ ์šฉํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

train_sequences = tokenizer.texts_to_sequences(train_sentences)
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)

๋ณ€ํ™˜๋œ Sequences ํ™•์ธ

train_sequences[:5]

2.8 ์ „์ฒ˜๋ฆฌ Step 4. ์‹œํ€€์Šค์˜ ๊ธธ์ด๋ฅผ ๋งž์ถฐ์ฃผ๊ธฐ

3๊ฐ€์ง€ ์˜ต์…˜์„ ์ž…๋ ฅํ•ด ์ค๋‹ˆ๋‹ค.

  • maxlen: ์ตœ๋Œ€ ๋ฌธ์žฅ ๊ธธ์ด๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์ตœ๋Œ€ ๋ฌธ์žฅ๊ธธ์ด๋ณด๋‹ค ๊ธธ๋ฉด, ์ž˜๋ผ๋ƒ…๋‹ˆ๋‹ค.
  • truncating: ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ maxlen๋ณด๋‹ค ๊ธธ ๋•Œ ์•ž์„ ์ž๋ฅผ์ง€ ๋’ค๋ฅผ ์ž๋ฅผ์ง€ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
  • padding: ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ maxlen๋ณด๋‹ค ์งง์„ ๋•Œ ์ฑ„์›Œ์ค„ ๊ฐ’์„ ์•ž์„ ์ฑ„์šธ์ง€, ๋’ค๋ฅผ ์ฑ„์šธ์ง€ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
# ํ•œ ๋ฌธ์žฅ์˜ ์ตœ๋Œ€ ๋‹จ์–ด ์ˆซ์ž
max_length = 120

# ์ž˜๋ผ๋‚ผ ๋ฌธ์žฅ์˜ ์œ„์น˜
trunc_type='post'

# ์ฑ„์›Œ์ค„ ๋ฌธ์žฅ์˜ ์œ„์น˜
padding_type='post'
train_padded = pad_sequences(train_sequences, maxlen=max_length, truncating=trunc_type, padding=padding_type)
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

๋ณ€ํ™˜๋œ Sequences ํ™•์ธ

train_padded.shape

2.9 ์ „์ฒ˜๋ฆฌ Step 5. label ๊ฐ’์„ numpy array๋กœ ๋ณ€ํ™˜

model์ด list type์€ ๋ฐ›์•„๋“ค์ด์ง€ ๋ชปํ•˜๋ฏ€๋กœ, numpy array๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

train_labels = np.array(train_labels)
validation_labels = np.array(validation_labels)

Embedding Layer

๊ณ ์ฐจ์›์„ ์ €์ฐจ์›์œผ๋กœ ์ถ•์†Œ์‹œ์ผœ์ฃผ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
one-hot encoding์„ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ, 1000์ฐจ์›์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ๋‹จ์–ด๋“ค์„ 16์ฐจ์›์œผ๋กœ ์ค„์—ฌ์ฃผ๋Š” ๊ฒ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•ด์„œ sparsity๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•˜๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.

embedding_dim = 16
  • ๋ณ€ํ™˜ ์ „
sample = np.array(train_padded[0])
sample

  • ๋ณ€ํ™˜ ํ›„
x = Embedding(vocab_size, embedding_dim, input_length=max_length)
x(sample)[0]

3. ๋ชจ๋ธ ์ •์˜ (Sequential)

์ด์ œ Modeling์„ ํ•  ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค.

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Dense(32, activation='relu'),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

๋ชจ๋ธ ๊ฒฐ๊ณผ ์š”์•ฝ

model.summary()

4. ์ปดํŒŒ์ผ (compile)

  1. optimizer๋Š” ๊ฐ€์žฅ ์ตœ์ ํ™”๊ฐ€ ์ž˜๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ 'adam'์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  2. loss๋Š” ์ด์ง„ ๋ถ„๋ฅ˜์ด๊ธฐ ๋•Œ๋ฌธ์— binary_crossentropy๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

ModelCheckpoint: ์ฒดํฌํฌ์ธํŠธ ์ƒ์„ฑ

val_loss ๊ธฐ์ค€์œผ๋กœ epoch ๋งˆ๋‹ค ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ModelCheckpoint๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

  • checkpoint_path๋Š” ๋ชจ๋ธ์ด ์ €์žฅ๋  ํŒŒ์ผ ๋ช…์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ModelCheckpoint์„ ์„ ์–ธํ•˜๊ณ , ์ ์ ˆํ•œ ์˜ต์…˜ ๊ฐ’์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
checkpoint_path = 'my_checkpoint.ckpt'
checkpoint = ModelCheckpoint(checkpoint_path, 
                             save_weights_only=True, 
                             save_best_only=True, 
                             monitor='val_loss',
                             verbose=1)

5. ํ•™์Šต (fit)

epochs=10
history = model.fit(train_padded, train_labels, 
                    validation_data=(validation_padded, validation_labels),
                    callbacks=[checkpoint],
                    epochs=epochs)

ํ•™์Šต ์™„๋ฃŒ ํ›„ Load Weights (ModelCheckpoint)

ํ•™์Šต์ด ์™„๋ฃŒ๋œ ํ›„์—๋Š” ๋ฐ˜๋“œ์‹œ load_weights๋ฅผ ํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด, ์—ด์‹ฌํžˆ ModelCheckpoint๋ฅผ ๋งŒ๋“  ์˜๋ฏธ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

# checkpoint ๋ฅผ ์ €์žฅํ•œ ํŒŒ์ผ๋ช…์„ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.
model.load_weights(checkpoint_path)

์˜ค์ฐจ ๋ฐ ์ •ํ™•๋„ ์‹œ๊ฐํ™”

ํ•™์Šต Loss (์˜ค์ฐจ)์— ๋Œ€ํ•œ ์‹œ๊ฐํ™”

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 9))
plt.plot(np.arange(1, epochs+1), history.history['loss'])
plt.plot(np.arange(1, epochs+1), history.history['val_loss'])
plt.title('Loss / Val Loss', fontsize=20)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(['loss', 'val_loss'], fontsize=15)
plt.show()

accuracy (์ •ํ™•๋„)์— ๋Œ€ํ•œ ์‹œ๊ฐํ™”

plt.figure(figsize=(12, 9))
plt.plot(np.arange(1, epochs+1), history.history['acc'])
plt.plot(np.arange(1, epochs+1), history.history['val_acc'])
plt.title('Acc / Val Acc', fontsize=20)
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(['acc', 'val_acc'], fontsize=15)
plt.show()

profile
๋ถ‰์€ ๋ฐฐ ์˜ค์ƒ‰ ๋”ฑ๋‹ค๊ตฌ๋ฆฌ ๊ฐœ๋ฐœ์ž ๐ŸฆƒCloud & DevOps

0๊ฐœ์˜ ๋Œ“๊ธ€