Text Generation with RNN

Jacob Kim·2024년 1월 31일

Naver project

Naver Project Week3

목록 보기

3/10

Text Generation with RNN

RNN을 사용하여 어떻게 텍스트를 생성하는지 알아보자.
주어진 텍스트를 기반으로 텍스트를 생성하는 모델을 구현해보자.

생성의 한계

문장 중 일부는 문법적으로 맞지만 대부분 자연스럽지 않다.

이 모델은 단어의 의미를 학습하지는 않았지만, 고려해야 할 점으로:

데이터는 문자 기반이다. 훈련이 시작되었을 때, 이 모델은 영어 단어의 철자를 모르고 심지어 텍스트의 단위가 단어라는 것도 모른다.

설정

Drive 연결

use_colab = True
assert use_colab in [True, False]

from google.colab import drive
drive.mount('/content/drive')

# Mounted at /content/drive

import tensorflow as tf

import numpy as np
import os
import time

셰익스피어 데이터셋 다운로드

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

#Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
#1115394/1115394 [==============================] - 0s 0us/step

데이터 읽기

# 데이터를 불러와서 디코딩
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# 문자의 수
print ('텍스트의 길이: {}'.format(len(text)))

# 텍스트의 길이: 1115394

# 텍스트 처음 250자 출력
print(text[:250])


#First Citizen:
#Before we proceed any further, hear me speak.

#All:
#Speak, speak.

#First Citizen:
#You are all resolved rather to die than to famish?

#All:
#Resolved. resolved.

#First Citizen:
#First, you know Caius Marcius is chief enemy to the people.

# 파일의 고유 문자수를 출력
vocab = sorted(set(text)) # 내가 불러온 text 데이터를 집합으로 만들어서 정렬시킨 상태입니다.
print ('고유 문자수 {}개'.format(len(vocab)))

# 고유 문자수 65개

텍스트 처리

텍스트 벡터화

학습을 위해서 텍스트들을 수치화할 필요가 있다.
텍스트를 인덱스화 시켜 학습에 사용

# 고유 문자에서 인덱스로 매핑 생성
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab) # idx <=> char 변환할 수 있는 사전을 들고있는 것이 중요합니다.

text_as_int = np.array([char2idx[c] for c in text]) # 람다식

char2idx

{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64}

idx2char

# array(['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?',
#      'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
#      'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
#      'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
#      'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
#          dtype='<U1')

print(text_as_int)
text_as_int.shape

#[18 47 56 ... 45  8  0]
#(1115394,)

텍스트 0번부터 전체 텍스트 길이까지 인덱스화

print('{')
for char,_ in zip(char2idx, range(65)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  "'" :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '3' :   9,
  ':' :  10,
  ';' :  11,
  '?' :  12,
  'A' :  13,
  'B' :  14,
  'C' :  15,
  'D' :  16,
  'E' :  17,
  'F' :  18,
  'G' :  19,
  'H' :  20,
  'I' :  21,
  'J' :  22,
  'K' :  23,
  'L' :  24,
  'M' :  25,
  'N' :  26,
  'O' :  27,
  'P' :  28,
  'Q' :  29,
  'R' :  30,
  'S' :  31,
  'T' :  32,
  'U' :  33,
  'V' :  34,
  'W' :  35,
  'X' :  36,
  'Y' :  37,
  'Z' :  38,
  'a' :  39,
  'b' :  40,
  'c' :  41,
  'd' :  42,
  'e' :  43,
  'f' :  44,
  'g' :  45,
  'h' :  46,
  'i' :  47,
  'j' :  48,
  'k' :  49,
  'l' :  50,
  'm' :  51,
  'n' :  52,
  'o' :  53,
  'p' :  54,
  'q' :  55,
  'r' :  56,
  's' :  57,
  't' :  58,
  'u' :  59,
  'v' :  60,
  'w' :  61,
  'x' :  62,
  'y' :  63,
  'z' :  64,

# 텍스트 맵핑
print ('{} ---- Index ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

예측 과정

주어진 문자나 문자 시퀀스가 주어졌을 때, 다음 문자로 가장 가능성 있는 문자는 무엇일까?

이는 모델을 훈련하여 수행할 작업이다.
모델의 입력은 문자열 시퀀스가 될 것이고, 모델을 훈련시켜 출력을 예측한다.
이 출력은 현재 타임 스텝(time step)의 다음 문자이다.

훈련 샘플과 타깃 만들기

다음으로 텍스트를 샘플 시퀀스로 나누자.
각 입력 시퀀스에는 텍스트에서 나온 seq_length개의 문자가 포함된다.
각 입력 시퀀스에서, 해당 타깃은 한 문자를 오른쪽으로 이동한 것을 제외하고는 동일한 길이의 텍스트를 포함한다.
텍스트를seq_length + 1개의 청크(chunk)로 나누자
- 예를 들어, seq_length는 4이고 텍스트를 "Hello"이라고 가정해 봅시다. 입력 시퀀스는 "Hell"이고 타깃 시퀀스는 "ello"가 된다.
이렇게 하기 위해 먼저 tf.data.Dataset.from_tensor_slices 함수를 사용해 텍스트 벡터를 문자 인덱스의 스트림으로 변환한다.

# Hello 라는 단어를 만들고 싶습니다.
# 학습 데이터를 아래와 같이 세팅을 해줘야합니다.
# input : H -> e -> l -> l
# output : e -> l -> l -> o

# 단일 입력에 대해 원하는 문장의 최대 길이
seq_length = 100
examples_per_epoch = len(text)//seq_length

# 훈련 샘플/타깃 만들기
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) # 알파벳을 하나씩 생성합니다.

for i in char_dataset.take(5):
    print(i.numpy())
    print(idx2char[i.numpy()])#확인가능

18
F
47
i
56
r
57
s
58
t

batch를 이용해 몇개의 텍스트를 가져올 것인지 정할 수 있다.

sequences = char_dataset.batch(seq_length + 1, drop_remainder=True) # 문장을 가져와합니다.
                               # 문장을 가져올 수 있도록 batch 를 구성해줍니다.

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()]))) # repr 개행문자 출력

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'

각 시퀀스에서, map 메서드를 사용해 각 배치에 간단한 함수를 적용하고 입력 텍스트와 타깃 텍스트를 복사 및 이동

모델이 Text를 생성할때, 한 텍스트 단위로 생성하기 때문에 그 다음 텍스트를 target으로 사용

# Hello -> 불러온 데이터의 길이
# input : Hell # input의 길이
# output : ello # output의 길이

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

첫 번째 샘플의 타깃 값을 출력해보자

for input_example, target_example in  dataset.take(1):
    print ('입력 데이터: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('타깃 데이터: ', repr(''.join(idx2char[target_example.numpy()])))

#입력 데이터:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
#타깃 데이터:  'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

이 벡터의 각 인덱스는 하나의 타임 스텝(time step)으로 처리됩니다. 타임 스텝 0의 입력으로 모델은 "F"의 인덱스를 받고 다음 문자로 "i"의 인덱스를 예측한다.
다음 타임 스텝에서도 같은 일을 하지만 RNN은 현재 입력 문자 외에 이전 타임 스텝의 컨텍스트(context)를 고려한다.

for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("iter {:4d}".format(i))
    print("  inputs: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  generated text: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

#iter    0
#  inputs: 18 ('F')
#  generated text: 47 ('i')
#iter    1
#  inputs: 47 ('i')
#  generated text: 56 ('r')
#iter    2
#  inputs: 56 ('r')
#  generated text: 57 ('s')
#iter    3
#  inputs: 57 ('s')
#  generated text: 58 ('t')
#iter    4
#  inputs: 58 ('t')
#  generated text: 1 (' ')

훈련 배치 생성

텍스트를 다루기 쉬운 시퀀스로 분리하기 위해 tf.data를 사용
이 데이터를 모델에 넣기 전에 데이터를 섞은 후 배치를 만들어야 한다.

# 배치 크기
BATCH_SIZE = 32

dataset = dataset.shuffle(10000).batch(BATCH_SIZE, drop_remainder=True)

for t,l in dataset:
    print(t)
    print(l)
    break
# 알파벳을 기준으로 했던 데이터에서 문장단위로 데이터를 불러오는 dataset을 구성할 수 있습니다.

tf.Tensor(
[[ 0 21 31 ... 53 56  1]
 [45 53 52 ... 57 51 39]
 [50 42  1 ...  8  0  0]
 ...
 [24 13 26 ...  1 57 47]
 [58  1 58 ... 45  1 51]
 [46 43  1 ... 42  1 58]], shape=(32, 100), dtype=int64)
tf.Tensor(
[[21 31 13 ... 56  1 53]
 [53 52  7 ... 51 39 50]
 [42  1 57 ...  0  0 24]
 ...
 [13 26 16 ... 57 47 58]
 [ 1 58 53 ...  1 51 53]
 [43  1 47 ...  1 58 46]], shape=(32, 100), dtype=int64)

모델 설계

모델을 정의하려면 tf.keras.Sequential을 사용한다.

이 예제에서는 3개의 층을 사용하여 모델을 정의한다:

tf.keras.layers.Embedding : 입력층. embedding_dim 차원 벡터에 각 문자의 정수 코드를 매핑하는 훈련 가능한 검색 테이블.
tf.keras.layers.GRU : 크기가 units = rnn_units인 RNN의 유형(여기서 LSTM층을 사용할 수도 있다.)
tf.keras.layers.Dense : 크기가 vocab_size인 출력을 생성하는 출력층.

각 문자에 대해 모델은 임베딩을 검색하고, 임베딩을 입력으로 하여 GRU를 1개의 타임 스텝으로 실행하고, FC layers를 적용하여 다음 문자의 로그 가능도(log-likelihood)를 예측하는 로짓을 생성한다:

# 문자로 된 어휘 사전의 크기
vocab_size = len(vocab)

# 임베딩 차원
embedding_dim = 256

# RNN 유닛(unit) 개수
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units,
                            return_sequences=True,
                            stateful=True,
                            recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])

    return model

model = build_model(
    vocab_size = vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)

for layer in model.layers:
   print(layer.output_shape)

#(32, None, 256)
#(32, None, 1024)
#(32, None, 65)

모델 사용

이제 모델을 실행하여 원하는대로 동작하는지 확인해보자.

먼저 출력의 형태를 확인하자.

for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (배치 크기, 시퀀스 길이, 어휘 사전 크기)")

# (32, 100, 65) # (배치 크기, 시퀀스 길이, 어휘 사전 크기)

위 예제에서 입력의 시퀀스 길이는 100이지만 모델은 임의 시퀀스 길이의 입력도 사용 가능하다.

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (32, None, 256)           16640     
                                                                 
 lstm (LSTM)                 (32, None, 1024)          5246976   
                                                                 
 dense (Dense)               (32, None, 65)            66625     
                                                                 
=================================================================
Total params: 5330241 (20.33 MB)
Trainable params: 5330241 (20.33 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

모델 훈련

이 문제는 표준 분류 문제로 취급될 수 있습니다. 이전 RNN 상태와 이번 타임 스텝(time step)의 입력으로 다음 문자의 클래스를 예측한다.

Optimizer, Loss function

tf.keras.losses.sparse_softmax_crossentropy 를 사용해 label을 벡터로 바꾸지 않고 loss를 계산한다.

이 모델은 로짓을 반환하기 때문에from_logits 플래그를 설정해야 한다.

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("예측 배열 크기(shape): ", example_batch_predictions.shape, " # (배치 크기, 시퀀스 길이, 어휘 사전 크기)")
print("Loss: ", example_batch_loss.numpy().mean())

#예측 배열 크기(shape):  (32, 100, 65)  # (배치 크기, 시퀀스 길이, 어휘 사전 크기)
#Loss:  4.173869

학습준비

tf.keras.Model.compile 메서드를 사용하여 훈련 절차를 설정
기본 매개변수의 tf.keras.optimizers.Adam과 손실 함수를 사용

체크포인트 구성

tf.keras.callbacks.ModelCheckpoint를 사용하여 훈련 중 체크포인트(checkpoint)가 저장되도록 설정한다.

# the save point
if use_colab:
    checkpoint_dir ='./drive/My Drive/train_ckpt/text_gen/exp1'
    if not os.path.isdir(checkpoint_dir):
        os.makedirs(checkpoint_dir)
else:
    checkpoint_dir = 'text_gen/exp1'

cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_dir,
                                                 save_weights_only=True,
                                                 monitor='loss',
                                                 mode='auto',
                                                 save_best_only=True,
                                                 verbose=1)

훈련 실행

EPOCHS=5 # 5~10 에폭정도

history = model.fit(dataset,
                    epochs=EPOCHS,
                    callbacks=[cp_callback])

Epoch 1/5
344/345 [============================>.] - ETA: 0s - loss: 2.2324
Epoch 1: loss improved from inf to 2.23098, saving model to ./drive/My Drive/train_ckpt/text_gen/exp1
345/345 [==============================] - 20s 45ms/step - loss: 2.2310
Epoch 2/5
345/345 [==============================] - ETA: 0s - loss: 1.6181
Epoch 2: loss improved from 2.23098 to 1.61809, saving model to ./drive/My Drive/train_ckpt/text_gen/exp1
345/345 [==============================] - 18s 44ms/step - loss: 1.6181
Epoch 3/5
345/345 [==============================] - ETA: 0s - loss: 1.4512
Epoch 3: loss improved from 1.61809 to 1.45120, saving model to ./drive/My Drive/train_ckpt/text_gen/exp1
345/345 [==============================] - 17s 44ms/step - loss: 1.4512
Epoch 4/5
345/345 [==============================] - ETA: 0s - loss: 1.3682
Epoch 4: loss improved from 1.45120 to 1.36821, saving model to ./drive/My Drive/train_ckpt/text_gen/exp1
345/345 [==============================] - 18s 46ms/step - loss: 1.3682
Epoch 5/5
344/345 [============================>.] - ETA: 0s - loss: 1.3102
Epoch 5: loss improved from 1.36821 to 1.31024, saving model to ./drive/My Drive/train_ckpt/text_gen/exp1
345/345 [==============================] - 18s 47ms/step - loss: 1.3102

텍스트 생성

최근 체크포인트 복원

이 예측 단계에선 Batch size 1을 사용한다.

RNN 상태가 타임 스텝에서 타임 스텝으로 전달되는 방식이기 때문에 모델은 한 번 빌드된 고정 배치 크기만 허용
다른 배치 크기로 모델을 실행하려면 모델을 다시 빌드하고 체크포인트에서 가중치를 복원해야 한다.

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(checkpoint_dir)

model.build(tf.TensorShape([1, None]))

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (1, None, 256)            16640     
                                                                 
 lstm_1 (LSTM)               (1, None, 1024)           5246976   
                                                                 
 dense_1 (Dense)             (1, None, 65)             66625     
                                                                 
=================================================================
Total params: 5330241 (20.33 MB)
Trainable params: 5330241 (20.33 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

예측 루프

텍스트 생성:

시작 문자열 선택과 순환 신경망 상태를 초기화하고 생성할 문자 수를 설정
시작 문자열과 순환 신경망 상태를 사용하여 다음 문자의 예측 배열을 가져온다.
다음, 범주형 배열을 사용하여 예측된 문자의 인덱스를 계산
이 예측된 문자를 모델의 다음 입력으로 활용
모델에 의해 리턴된 RNN 상태는 모델로 피드백되어 이제는 하나의 단어가 아닌 더 많은 컨텍스트를 갖추게 된다.
다음 단어를 예측한 후 수정된 RNN 상태가 다시 모델로 피드백되어 이전에 예측된 단어에서 더 많은 컨텍스트를 얻으면서 학습하는 방식
텍스트를 생성하기 위해 모델의 출력이 입력으로 피드백
생성된 텍스트를 보면 모델이 언제 대문자로 나타나고, 절을 만들고 셰익스피어와 유사한 어휘를 가져오는지 볼 수 있다.

def generate_text(model, start_string):
  # 평가 단계 (학습된 모델을 사용하여 텍스트 생성)

  # 생성할 문자의 수
  num_generate = 1000

  # 시작 문자열을 숫자로 변환(벡터화)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # 결과를 저장할 빈 문자열
  text_generated = []

  # 온도가 낮으면 더 예측 가능한 텍스트 생성
  # 온도가 높으면 더 의외의 텍스트 생성 (불확실성)
  # 최적의 세팅을 찾기 위한 실험
  temperature = 1.0

  # 여기에서 배치 크기 == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # 배치 차원 제거
      predictions = tf.squeeze(predictions, 0)

      # 범주형 분포를 사용하여 모델에서 리턴한 단어 예측
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # 예측된 단어를 다음 입력으로 모델에 전달
      # 이전 은닉 상태와 함께
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])
    #   print(text_generated)
  return (start_string + ''.join(text_generated))

print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: $by hear?
May so conscience fasting of the Lord Gonger!
The wiol-d-joy Servingeromous; childisaped,
Who's fortune's power, wife,
At his praise and damned toking my head,
Who can so lour me so from help! Come, smuty, we would
my death-'twixt her pound and harm,' protisina,
I call them possess'd. Hark thingstant light.
save me, mothanger.

#BRUTUS:
#I'll payient light as find.

#SOMERSET:
#Therefore, as one, sir, but they say stirr.

#LUTENCIS:
#Therefore, Catesby;
#You have bare, like together;
#For York he it Sucjuding him by forget to better brought, aside!
#O keepsier than a womannonseor;
#Lord so my doter-joyful divERY:
#Ever as 'ere makes me, and totates
#Being now so fair as I contented wood,
#Of slave are fly to leavinest at tHe complexions.

#SIT:
#Ay, couns dembs for this:
#If thou do give you.

#BULIThy nobles, mine is should,
#And, with my father Adain, Farror--Collow's pine;
#And that that my contempt both presumpts my pity
#You know no worstion. What, God set them appoint.

#Senators:
#No, that

Jacob Kim

AI, Information and Communication, Electronics, Computer Science, Bio, Algorithms

Text Generation with RNN

Naver Project Week3

Text Generation with RNN

생성의 한계

설정

Drive 연결

셰익스피어 데이터셋 다운로드

데이터 읽기

텍스트 처리

텍스트 벡터화

예측 과정

훈련 샘플과 타깃 만들기

훈련 배치 생성

모델 설계

모델 사용

모델 훈련

Optimizer, Loss function

학습준비

체크포인트 구성

훈련 실행

텍스트 생성

최근 체크포인트 복원

예측 루프

Ex2_Seq2Seq_with_attention

Advanced_RNN

0개의 댓글