Codecademy [Learn Natural Language Processing]
GENERATING TEXT WITH DEEP LEARNING
Use Tensorflow’s Keras API to build a seq2seq deep learning network of LSTM cells for machine translation.
Neural networks: machine learning framework loosely based on the structure of the human brain
early models were not good
deep-learning architectures
- multi-layer perceptrons (MLP)
comprised of layered perceptrons. good at solving simple tasks- convolutional neural networks (CNN)
to process image data,- recurrent neural networks (RNN)
widely adopted within natural language processing
integrate a loop into the connections between neurons, allowing information to persist across a chain of neurons
순환 신경망
딥 러닝에 있어 가장 기본적인 시퀀스 모델
입력과 출력을 시퀀스 단위로 처리하는 모델
기본 RNN
LSTMs 적용
시퀀스-투-시퀀스
입력된 시퀀스로부터 다른 도메인의 시퀀스를 출력
ex) 챗봇(Chatbot)과 기계 번역(Machine Translation), 내용 요약(Text Summarization), STT(Speech to Text)
크게 두 개로 구성된 아키텍처
encoder : accepts language input.
입력 문장의 모든 단어들을 순차적으로 입력받은 뒤에 마지막에 이 모든 단어 정보들을 압축해서 하나의 벡터로 (컨텍스트 벡터(context vector)) 만들고 디코더에 전송
decoder : takes the encoder’s final state (or memory) as its initial state
컨텍스트 벡터를 받아서 번역된 단어를 한 개씩 순차적으로 출력
TensorFlow
with theKeras API
from tensorflow import keras
from tensorflow import keras
import re
# Importing our translations
data_path = "span-eng.txt"
# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
lines = f.read().split('\n')
# Building empty lists to hold sentences
input_docs = []
target_docs = []
# Building empty vocabulary sets
input_tokens = set()
target_tokens = set()
for line in lines:
# Input and target sentences are separated by tabs
input_doc, target_doc = line.split('\t')
# Appending each input sentence to input_docs
input_docs.append(input_doc)
# Splitting words from punctuation
target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
# Redefine target_doc below
# and append it to target_docs:
target_doc = '<START> ' + target_doc + ' <END>'
target_docs.append(target_doc)
# Now we split up each sentence into words
# and add each unique word to our vocabulary set
for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
print(token)
# Add your code here:
if token not in input_tokens:
input_tokens.add(token)
for token in target_doc.split():
print(token)
# And here:
if token not in target_tokens:
target_tokens.add(token)
input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))
# Create num_encoder_tokens and num_decoder_tokens:
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)
max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])
결과
We'll
see
.
<START>
Después
veremos
.
<END>
We'll
see
.
<START>
Ya
veremos
.
<END>
import numpy as np
encoder_input_data = np.zeros(
(len(input_docs), max_encoder_seq_length, num_encoder_tokens),
dtype='float32')
예시
from tensorflow import keras
import numpy as np
from preprocessing import input_docs, target_docs, input_tokens, target_tokens, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length
print('Number of samples:', len(input_docs))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
input_features_dict = dict(
[(token, i) for i, token in enumerate(input_tokens)])
# Build out target_features_dict:
target_features_dict = dict(
[(token, i) for i, token in enumerate(target_tokens)])
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_features_dict = dict(
(i, token) for token, i in input_features_dict.items())
# Build out reverse_target_features_dict:
reverse_target_features_dict = dict(
(i, token) for token, i in target_features_dict.items())
encoder_input_data = np.zeros(
(len(input_docs), max_encoder_seq_length, num_encoder_tokens),
dtype='float32')
print("\nHere's the first item in the encoder input matrix:\n", encoder_input_data[0], "\n\nThe number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.")
# Build out the decoder_input_data matrix:
decoder_input_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
# Build out the decoder_target_data matrix:
decoder_target_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
result
Number of samples: 11
Number of unique input tokens: 18
Number of unique output tokens: 27
Max sequence length for inputs: 4
Max sequence length for outputs: 12
Here's the first item in the encoder input matrix:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
The number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.
matrix_name[line, timestep, features_dict[token]] = 1.
from tensorflow import keras
import numpy as np
import re
from preprocessing import input_docs, target_docs, input_tokens, target_tokens, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length
input_features_dict = dict(
[(token, i) for i, token in enumerate(input_tokens)])
target_features_dict = dict(
[(token, i) for i, token in enumerate(target_tokens)])
reverse_input_features_dict = dict(
(i, token) for token, i in input_features_dict.items())
reverse_target_features_dict = dict(
(i, token) for token, i in target_features_dict.items())
encoder_input_data = np.zeros(
(len(input_docs), max_encoder_seq_length, num_encoder_tokens),
dtype='float32')
decoder_input_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
decoder_target_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):
for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):
print("Encoder input timestep & token:", timestep, token)
# Assign 1. for the current line, timestep, & word
# in encoder_input_data:
encoder_input_data[line, timestep, input_features_dict[token]] = 1.
for timestep, token in enumerate(target_doc.split()):
# decoder_target_data is ahead of decoder_input_data by one timestep
print("Decoder input timestep & token:", timestep, token)
# Assign 1. for the current line, timestep, & word
# in decoder_input_data:
decoder_input_data[line, timestep, target_features_dict[token]] = 1.
if timestep > 0:
# decoder_target_data will be ahead by one timestep
# and will not include the start token.
print("Decoder target timestep:", timestep)
# Assign 1. for the current line, timestep, & word
# in decoder_target_data:
decoder_target_data[line, timestep - 1, target_features_dict[token]] = 1.
deep learning
requires two layer types from Keras:
input layer : defines a matrix to hold all the one-hot vectors that we’ll feed to the model.
LSTM layer : with some output dimensionality
레이어 & 모델 임포트하기
from keras.layers import Input, LSTM
from keras.models import Model
input layer :requires some number of dimensions that we’re providing
encoder_inputs = Input(shape=(None, num_encoder_tokens))
LSTM layer : select the dimensionality , whether to return the state
encoder_lstm = LSTM(100, return_state=True)
using a dimensionality of 100
so any LSTM output matrix will have
shape [batch_size, 100]
getting final states : linking our LSTM layer with our input layer:
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
save the states
encoder_states = [state_hidden, state_cell]
ex
from prep import num_encoder_tokens
from tensorflow import keras
from keras.layers import Input, LSTM
from keras.models import Model
# Create the input layer:
encoder_inputs = Input(shape=(None, num_encoder_tokens))
# Create the LSTM layer:
encoder_lstm = LSTM(256, return_state=True)
# Retrieve the outputs and states:
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
# Put the states together in a list:
encoder_states = [state_hidden, state_cell]
decoder layer
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(100, return_sequences=True, return_state=True)
states 말고 output 저장하기
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)
run final actication layer : using the Softmax
function, that will give us the probability distribution
Build a final Dense layer
from keras.layers import Input, LSTM, Dense
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
Filter outputs through the Dense layer
decoder_outputs = decoder_dense(decoder_outputs)
Model()
function from Keras
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
eed it the encoder and decoder inputs, as well as the decoder output
optimizer : help minimize our error rate
loss function : to determine the error rate
For accuracy, we’re adding that into the metrics to pay attention to while training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
.fit()
method the encoder and decoder input data, the decoder target data,
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=10, epochs=100, validation_split=0.2)
before testing model, need to redefine the seq2seq architecture in pieces
why? model we used for training our network only works when we already know the target sequence
build an encoder model : with our encoder inputs and the placeholders for the encoder’s output states
encoder_model = Model(encoder_inputs, encoder_states)
Building the two decoder state input layers
placeholders for the decoder’s input states
latent_dim = 256 decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))
Put the state input layers into a list:
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]
create new decoder states and outputs
by using the decoder LSTM and decoder dense layer
Call the decoder LSTM
decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
Saving the new LSTM output states:
decoder_states = [state_hidden, state_cell]
redefine the decoder output
by passing it through the dense layer:
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model( [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
testing our model
to accept a Numpy matrix(representing test EN sentence input) & use the encoder and decoder for ES output
takes in new input (as a NumPy matrix) and gives us output states that we can pass on to the decoder
states = encoder.predict(test_input)
build an empty NumPy array for our Spanish translation, giving it three dimensions
target_sequence = np.zeros((1, 1, num_decoder_tokens))
- batch size, number of tokens to start with:, number of tokens in our target vocabulary
Populate the first token of target sequence with the start token
target_sequence[0, 0, target_features_dict['<START>']] = 1.
before decoding, string where can add translation to. word by word
decoded_sentence = ''
full funtion
def decode_sequence(test_input):
# Encode the input as state vectors:
encoder_states_value = encoder_model.predict(test_input)
# Set decoder states equal to encoder final states
decoder_states_value = encoder_states_value
# Generate empty target sequence of length 1:
target_seq = np.zeros((1, 1, num_decoder_tokens))
# Populate the first token of target sequence with the start token:
target_seq[0, 0, target_features_dict['<START>']] = 1.
decoded_sentence = ''
return decoded_sentence
trying translation
함수 수정
def decode_sequence(test_input):
encoder_states_value = encoder_model.predict(test_input)
decoder_states_value = encoder_states_value
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, target_features_dict['<START>']] = 1.
decoded_sentence = ''
stop_condition = False
while not stop_condition:
# Run the decoder model to get possible
# output tokens (with probabilities) & states
output_tokens, new_decoder_hidden_state, new_decoder_cell_state = decoder_model.predict(
[target_seq] + decoder_states_value)
# Choose token with highest probability
sampled_token_index = np.argmax(
output_tokens[0, -1, :])
sampled_token = reverse_target_features_dict[
sampled_token_index]
decoded_sentence += " "+ sampled_token
# Exit condition: either hit max length
# or find stop token.
if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1.
# Update states
decoder_states_value = [
new_decoder_hidden_state,
new_decoder_cell_state]
return decoded_sentence
seq2seq : deep learning model. use recurrent neural networks like LSTMs
seq2seq(machine translation) : encoder(input languae, output state vectors) & decoder(accepts encoder's final state, outputs translation)
teacher forcing : mehtod to train seq2seq decorders
marking beginning/end of target sentences -> decoder 가 알 수 있게.
one-hot vectors : represent given word in set of words, use 1 to indicate the current word. (else is 0)
Timesteps : tracking where we are in a sent.
Softmax function : converts the output of the LSTMs into a probability distribution over words in our vocabulary.