Sentiment Analysis | BASE #1

이지수·2022년 2월 9일

CNN IMDB Keras RNN Sentiment Analysis python word embedding

ML/DL

목록 보기

1/3

1. Mainly Use

Word Embedding (Embedding Layer)
Tensorflow
RNN(LSTM)
CNN

2. Base Info

텍스트 감성분석 접근법은 아래와 같이 2가지가 존재함
1. 기계학습 기반
2. 감성사전 기반

사전 기반의 감성분석은 기계학습 기반 대비 2가지 단점이 존재함

분석 대상에 따라 단어의 감성 점수가 달라질 수 있다는 가능성에 대응하기 어려움

단순 긍부정을 넘어서 긍부정의 원인이 되는 대상 속성 기반의 감성 분석이 어려움

단어의 특성을 저차원 벡터값으로 표현하는 워드 임베딩(word embedding) 방법을 이용하여 머신러닝 기반 감성분석의 정확도를 높일 수 있음

Word Embedding : 단어의 의미가 유사할 경우 가까운 벡터 공간에 존재함

3. Word to Index & Index to Word

IMDB 리뷰 감성분석을 위해 텍스트를 벡터로 변환해야 함
텍스트는 기호일 뿐, 텍스트가 내포하는 의미를 텍스트의 기호가 내포하고있지는 않음
그렇기 때문에 단어와 벡터의 matching이 필수적임
벡터는 딥러닝을 통해서 만들 수 있음
텍스트의 벡터화를 진행하기 전에, 텍스트의 인덱스를 사용하여 텍스트를 숫자로 변환하고자 함

sentences = ['i love pizza', 'i like chichen', 'now i want bread']

word_list = sentences[0].split()
word_list

['i', 'love', 'pizza']

# text data로부터 사전을 만들기위해 모든 단어를 split해서 dict 자료구조로 표현 

index_to_word={}  # 빈 딕셔너리를 만들어서

# 채우는 순서는 임의로 세팅 but 순서는 중요하지 않음
# <BOS>, <PAD>, <UNK>는 관례적으로 딕셔너리 맨 앞에 넣음 

index_to_word[0]='<PAD>'  # 패딩용 단어
index_to_word[1]='<BOS>'  # 문장의 시작지점
index_to_word[2]='<UNK>'  # 사전에 없는(Unknown) 단어
index_to_word[3]='i'
index_to_word[4]='love'
index_to_word[5]='pizza'
index_to_word[6]='like'
index_to_word[7]='chicken'
index_to_word[8]='now'
index_to_word[9]='want'

print(index_to_word)

{0: '<PAD>', 1: '<BOS>', 2: '<UNK>', 3: 'i', 4: 'love', 5: 'pizza', 6: 'like', 7: 'chicken', 8: 'now', 9: 'want'}

# 텍스트 데이터를 숫자로 바꿔 보려고 하는데, 텍스트를 숫자로 바꾸려면 위의 딕셔너리가 {텍스트:인덱스} 구조여야 함

word_to_index = {word : index for index, word in index_to_word.items()}
print(word_to_index)

{'<PAD>': 0, '<BOS>': 1, '<UNK>': 2, 'i': 3, 'love': 4, 'pizza': 5, 'like': 6, 'chicken': 7, 'now': 8, 'want': 9}

# 단어를 주면 index를 반환
print(word_to_index['want'])

3-1. Word to Encoded (텍스트 to 숫자)

# 문장 1개를 활용딕셔너리와 함께 제공하면 단어 인덱스 리스트로 변환해 주는 함수
# 모든 문장은 <BOS>로 시작
def get_encoded_sentence(sentence, word_to_index):
    return [word_to_index['<BOS>']]+[word_to_index[word] if word in word_to_index else word_to_index['<UNK>'] for word in sentence.split()]

print(get_encoded_sentence('i love pizza', word_to_index))

[1, 3, 4, 5]

# Multiple Sentences 분석
# 여러개의 문장 리스트를 한번에 숫자 텐서로 인코딩
sentences = ['i love pizze', 'i like chichen', 'now i want bread']

def get_encoded_sentences(sentences, word_to_index):
    return [get_encoded_sentence(sentence, word_to_index) for sentence in sentences]

encoded_sentences = get_encoded_sentences(sentences, word_to_index)
encoded_sentences

[[1, 3, 4, 2], [1, 3, 6, 2], [1, 8, 3, 9, 2]]

3-2. Encoded to Word

# encode된 벡터를 decoding하여 텍스트로 복구 
def get_decoded_sentence(encoded_sentence, index_to_word):
    return ' '.join(index_to_word[index] if index in index_to_word else '<UNK>' for index in encoded_sentence[1:])

print(get_decoded_sentence([1,3,4,5], index_to_word))

i love pizza

def get_decoded_sentences(encoded_sentences, index_to_word):
    return [get_decoded_sentence(encoded_sentence, index_to_word) for encoded_sentence in encoded_sentences]

print(get_decoded_sentences(encoded_sentences, index_to_word))

['i love <UNK>', 'i like <UNK>', 'now i want <UNK>']

4. Embedding Layer | Word Embedding

Word to Index를 통해 텍스트를 encoding 시켰지만 이는 의미와 대응되는 벡터가 아닌 그냥 임의로 부여된 단어의 순서에 불과함
따라서, 단어의 의미를 나타내는 벡터를 훈련 가능한 파라미터로 놓고 이를 딥러닝을 통해 학습해서 최적화
Pytorch, TF는 이런 의미벡터 파라미터를 구현한 Embedding Layer를 제공함
참고페이지 : 임베딩 레이어를 통해 word가 벡터화되는 과정
Embedding 레이어를 활용하여 이전 스텝의 텍스트 데이터를 워드 벡터 텐서 형태로 다시 표현하고자 함

ATTENTION

Embedding Layer의 인풋이 되는 문장은 길이가 일정해야 함

즉 입력 데이터 내의 모든 문장은 단어의 개수가 같아야 한다는 것 -> PAD 사용

keras.preprocessing.sequence.pad_sequences는 문장벡터 뒤에 추가하여 모든 문장의 길이를 일정하게 만듦

value 옵션은 padding 되는 곳에 어떤 값을 넣을 것인지 설정해주는 것

import numpy as np 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Sequential
from tensorflow.keras.layers import Embedding

vocab_size = len(word_to_index)
word_vector_dim = 4 # 4차원의 워드벡터를 가정 

embedding = tf.keras.layers.Embedding(input_dim = vocab_size, \
                                    output_dim = word_vector_dim, mask_zero = True)
# 숫자로 변환된 텍스트 데이터에 Embedding 레이어를 적용
# list 형태의 sentences는 numpy array로 변환되어야 딥러닝 레이어의 입력이 될 수 있음

raw_inputs = np.array(get_encoded_sentences(sentences, word_to_index)) 
raw_inputs = keras.preprocessing.sequence.pad_sequences(raw_inputs,
                                                    value = word_to_index['<PAD>'], padding = 'post', maxlen = 5)
output = embedding(raw_inputs)
print(output)

tf.Tensor(
[[[ 0.02264932  0.037343   -0.04712098 -0.00348868]
  [-0.04136361 -0.00685711 -0.04964727 -0.02061704]
  [ 0.00404279 -0.01909448 -0.03293613 -0.00015283]
  [ 0.00404279 -0.01909448 -0.03293613 -0.00015283]
  [ 0.04007402 -0.01136142 -0.046102    0.00067567]]

 [[ 0.02264932  0.037343   -0.04712098 -0.00348868]
  [-0.04136361 -0.00685711 -0.04964727 -0.02061704]
  [ 0.00404279 -0.01909448 -0.03293613 -0.00015283]
  [ 0.00404279 -0.01909448 -0.03293613 -0.00015283]
  [ 0.04007402 -0.01136142 -0.046102    0.00067567]]

 [[ 0.02264932  0.037343   -0.04712098 -0.00348868]
  [-0.02836841 -0.02056072 -0.00797039  0.02290538]
  [-0.04136361 -0.00685711 -0.04964727 -0.02061704]
  [ 0.00404279 -0.01909448 -0.03293613 -0.00015283]
  [ 0.00404279 -0.01909448 -0.03293613 -0.00015283]]], shape=(3, 5, 4), dtype=float32)

shape = (3,5,4)는 순서대로 입력문장 개수, 입력문장의 최대 길이, 워드벡터 차원의 수를 의미함

5. RNN | Sequence Data

시퀀스 형태의 데이터를 처리하기에 최적의 모델
텍스트 데이터를 다루는데 주로 사용됨
RNN은 시간의 흐름에 따라 새롭게 들어오는 입력에 맞춰 변하는 현재 상태를 묘사하는 state machine으로 설계됨
stateful한 대화라는 것은 반응하는 대상이 이전의 대화 내용을 기억하고있다 라는 것
즉, A와 B가 대화를 한다 했을 때 A가 stateful하다면 B가 한 이전의 이야기를 모두 기억하고 있다가 해당 내용을 제외한 내용들을 말해야 함
참고자료 : 김성훈 교수 RNN 강의

RNN 모델을 사용하여 텍스트 데이터를 처리하는 코드를 아래와 같이 구현함

vocab_size = 10
word_vector_dim = 4 # 단어 하나를 표현하는 임베딩 벡터의 차원 수 

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, word_vector_dim, input_shape = (None,)))
# 가장 널리 쓰이는 RNN인 LSTM 레이어를 사용. 이때 LSTM state 벡터의 차원수는 8로 setting(수정 가능)
model.add(keras.layers.LSTM(8))   
model.add(keras.layers.Dense(8, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))  # 최종 출력은 긍정/부정을 나타내는 1-dim

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 4)           40        
_________________________________________________________________
lstm (LSTM)                  (None, 8)                 416       
_________________________________________________________________
dense (Dense)                (None, 8)                 72        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 9         
=================================================================
Total params: 537
Trainable params: 537
Non-trainable params: 0
_________________________________________________________________

ERROR OCCUR

numpy 1.20v 에서 Cannot convert a symbolic Tensor (lstm/strided_slice:0) to a numpy array.의 에러가 나서 확인해보니 numpy 1.19v에서는 괜찮다 하여 downgrade 진행함

그런데 package간 conflicst가 너무 많이 발생함

해결. Faced Error 시리즈에서 해결 방법 확인 가능

6. 1-D CNN

텍스트 처리를 위해 RNN 대신 사용 가능한 모델
이미지는 시퀀스 데이터가 아니기 때문에 이미지 분류기 모델에는 이미지 전체가 한번에 입력값으로 입력됨
따라서 1-D CNN은 문장 전체를 한꺼번에 한 방향으로 길이 7짜리 필터(설정한 필터 개수)로 스캐닝 하면서, 7단어 이내에서 발견되는 특징을 추출하여 그것으로 문장을 분류하는 방식으로 사용됨
CNN 계열이 RNN계열보다 병렬처리에 효율성을 갖고 있기 때문에 학습 속도가 빠름

# Convolution Layer, Pooling 사용 

vocab_size = 10
word_vector_dim = 4 # 단어 하나를 표현하는 임베딩 벡터의 차원수 

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, word_vector_dim, input_shape = (None, )))
model.add(keras.layers.Conv1D(16, 7, activation = 'relu')) # Filter 7개 설정 
model.add(keras.layers.MaxPooling1D(5))
model.add(keras.layers.Conv1D(16, 7, activation='relu'))
model.add(keras.layers.GlobalMaxPool1D())
model.add(keras.layers.Dense(8, activation='relu'))
model.add(keras.layers.Dense(1, activation = 'sigmoid'))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, None, 4)           40        
_________________________________________________________________
conv1d (Conv1D)              (None, None, 16)          464       
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 16)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 16)          1808      
_________________________________________________________________
global_max_pooling1d (Global (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
=================================================================
Total params: 2,457
Trainable params: 2,457
Non-trainable params: 0
_________________________________________________________________

간단히는 GlobalMaxPooling1D() 레이어 하나만 사용하는 방법도 있음
이 방식은 전체 문장 중에서 단 하나의 가장 중요한 단어만 피처로 추출하여 그것으로 문장의 긍정/부정을 평가하는 방식
의외로 성능이 잘 나올 수도 있음

## GlobalMaxPooling만 사용한 경우 

vocab_size = 10
word_vector_dim = 4

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, word_vector_dim, input_shape = (None, )))
model.add(keras.layers.GlobalMaxPooling1D())
model.add(keras.layers.Dense(8, activation = 'relu'))
model.add(keras.layers.Dense(1, activation= 'sigmoid'))

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 4)           40        
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 4)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 9         
=================================================================
Total params: 89
Trainable params: 89
Non-trainable params: 0
_________________________________________________________________

다른 방법으로 1-D CNN과 RNN 레이어를 섞어서 사용할 수도 있고
FFN(FeedForward Network) 레이어만으로 구성하거나
혹은 Transformer 레이어를 쓰는 등의 방법을 사용할 수 있음

본격적인 IMDB 데이터 분석은 다음 글에서 확인 가능 :)

이지수

다음 포스트