[NLP실습]1.자연어 처리 개요-단어 임베딩

jaeyun·2021년 1월 21일

BOW CBOW NLP Skipgram 자연어처리

자연어 처리 실습

목록 보기

1/2

이번 포스팅은 앞의 이론에서 언급했었던 임베딩 방법들을 한번 더 설명하고, 구현해보겠습니다.

이론 포스팅은 여기를 클릭해주세요.

단어 임베딩이란 컴퓨터가 언어적인 의미가 담겨있는 자연어를 인식하기 위해, 언어적 특성을 반영하여 수치화 하는 것을 뜻합니다.단어 임베딩은 다음과 같이 크게 3가지로 나눌 수 있습니다.

1. 원핫 인코딩(one-hot enocoding)

▶구현 포인트

(1) 단어의 중복을 제거해줍니다.

(2) 단어의 수만큼 배열을 만들고, 0으로 채워줍니다.

(3) 해당 단어의 인덱스를 찾고, 그 부분을 1로 만들어줍니다.

아래의 코드는 아무런 라이브러리도 사용하지 않고, 구현한 코드입니다.

## no library
def one_hot(word_list):
  #(1) 단어의 중복을 제거해줍니다.
  word_list = list(set(word_list))
  #(2) 단어의 수만큼 배열을 만들고, 0으로 채워줍니다.
  encoding_matrix = [[0 for col in range(len(word_list))] for row in range(len(word_list))]
  #(3) 해당 단어의 인덱스를 찾고, 그 부분을 1로 만들어줍니다.
  for index, word in enumerate(word_list):
    encoding_matrix[index][index] = 1
  return encoding_matrix

labels = ['cat','dog','rabbit','turtle']

'''
label :  cat , encoding :  [1, 0, 0, 0]
label :  dog , encoding :  [0, 1, 0, 0]
label :  rabbit , encoding :  [0, 0, 1, 0]
label :  turtle , encoding :  [0, 0, 0, 1]
'''

만약 pandas, sklearn 라이브러리를 사용할 경우에는 조금더 간단하게 구현하실 수 있습니다.

## using pandas
import pandas as pd

label_dict = {'label':['cat','dog','rabbit','turtle']}
#df = pd.DataFrame(label_dict)
one_hot_encoding = pd.get_dummies(label_dict['label'])
print(one_hot_encoding)

'''
   cat  dog  rabbit  turtle
0    1    0       0       0
1    0    1       0       0
2    0    0       1       0
3    0    0       0       1
'''

## using sklearn
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

label_dict = {'label':['cat','dog','rabbit','turtle']}
df = pd.DataFrame(label_dict)
one_hot = OneHotEncoder()
one_hot_encoding = one_hot.fit_transform(df)
print(one_hot_encoding)

'''
(0, 0)	1.0
(1, 1)	1.0
(2, 2)	1.0
(3, 3)	1.0
'''

2. BoW(Bag of Words)

▶구현 포인트

(1) 입력받은 문장을 단어 단위로 쪼갠 뒤, 중복을 제거해줍니다.

(2) 단어의 수만큼 배열을 만들고, 0으로 채워줍니다.

(3) 각 인덱스의 단어가 몇 번 나오는지 count한뒤, 갱신해줍니다.

아래의 코드는 아무런 라이브러리도 사용하지 않고, 구현한 코드입니다.

## no library
def bow(sentence):
  #(1) 입력받은 문장을 단어 단위로 쪼갠 뒤, 중복을 제거해줍니다.
  word_list = sentence.split(' ')
  word_list = list(set(word_list))
  #(2) 단어의 수만큼 배열을 만들고, 0으로 채워줍니다.
  embedding_matrix = [0 for element in range(len(word_list))]
  #(3) 각 인덱스의 단어가 몇 번 나오는지 count한뒤, 갱신해줍니다.
  for index, word in enumerate(word_list):
    embedding_matrix[index] = sentence.count(word)
  return word_list, embedding_matrix

sentence = "Suzy is very very pretty woman and YoonA is very pretty woman too"
word_list, bow_embedding = bow(sentence)
print("word_list : ",word_list,", embedding : ",bow_embedding)

'''
word_list :  ['Suzy', 'pretty', 'very', 'YoonA', 'too', 'is', 'and', 'woman'] , embedding :  [1, 2, 3, 1, 1, 2, 1, 2]
'''

sklearn을 사용하여 구현할 경우 아래와 같습니다.

## using sklearn
from sklearn.feature_extraction.text import CountVectorizer

sentence = ["Suzy is very very pretty woman and YoonA is very pretty woman too"]
vectorizer = CountVectorizer(min_df = 1, ngram_range = (1,1))
embedding = vectorizer.fit_transform(sentence)
vocab = vectorizer.get_feature_names()
print("word_list : ",vocab,", embedding : ",embedding.toarray())

'''
word_list :  ['and', 'is', 'pretty', 'suzy', 'too', 'very', 'woman', 'yoona'] , embedding :  [[1 2 2 1 1 3 2 1]]
'''

3. Word2Vec

word2vec는 CBOW, Skip-Gram 두 개를 모두 구현해 보겠습니다.

CBOW

▶구현 포인트

(1) 입력받은 문장을 단어로 쪼개고, 중복을 제거해줍니다.

(2) 단어 : 인덱스, 인덱스 : 단어를 가지는 딕셔너리를 선언해 줍니다.

(3) 학습을 위한 데이터를 생성해 줍니다.

 ① 예측을 할 단어인 target을 정해주고, 주변 단어인 context의 범위도 정해줍니다.

 ② context-label 쌍을 data리스트에 넣어줍니다.

(4) CBOW 모델을 정의해 줍니다.

(5) 모델을 선언해주고, loss function, optimizer등을 선언해줍니다.

(6) 학습을 진행합니다.

(7) test하고 싶은 문장을 뽑고, test를 진행합니다.

## using pytorch
import torch
import torch.nn as nn

EMBEDDING_DIM = 128
EPOCHS = 100

example_sentence = """In the case of CBOW, one word is eliminated, and the word is predicted from surrounding words.
Therefore, it takes multiple input vectors as inputs to the model and creates one output vector.
In contrast, Skip-Gram learns by removing all words except one word and predicting the surrounding words in the context through one word. 
So, it takes a vector as input and produces multiple output vectors.
CBOW and Skip-Gram are different.""".split()

#(1) 입력받은 문장을 단어로 쪼개고, 중복을 제거해줍니다.
vocab = set(example_sentence)
vocab_size = len(example_sentence)

#(2) 단어 : 인덱스, 인덱스 : 단어를 가지는 딕셔너리를 선언해 줍니다.
word_to_index = {word:index for index, word in enumerate(vocab)}
index_to_word = {index:word for index, word in enumerate(vocab)}

#(3) 학습을 위한 데이터를 생성해 줍니다.
data = make_data(example_sentence)

# convert context to index vector
def make_context_vector(context, word_to_ix):
  idxs = [word_to_ix[w] for w in context]
  return torch.tensor(idxs, dtype=torch.long)

# make dataset function
def make_data(sentence):
  data = []
  for i in range(2, len(example_sentence) - 2):
    context = [example_sentence[i - 2], example_sentence[i - 1], example_sentence[i + 1], example_sentence[i + 2]]
    target = example_sentence[i]
    data.append((context, target))
  return data

#(4) CBOW 모델을 정의해 줍니다.
class CBOW(nn.Module):
  def __init__(self, vocab_size, embedding_dim):
    super(CBOW, self).__init__()

    self.embeddings = nn.Embedding(vocab_size, embedding_dim)

    self.layer1 = nn.Linear(embedding_dim, 64)
    self.activation1 = nn.ReLU()

    self.layer2 = nn.Linear(64, vocab_size)
    self.activation2 = nn.LogSoftmax(dim = -1)

  def forward(self, inputs):
    embeded_vector = sum(self.embeddings(inputs)).view(1,-1)
    output = self.activation1(self.layer1(embeded_vector))
    output = self.activation2(self.layer2(output))
    return output

#(5) 모델을 선언해주고, loss function, optimizer등을 선언해줍니다.
model = CBOW(vocab_size, EMBEDDING_DIM)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

#(6) 학습을 진행합니다.
for epoch in range(EPOCHS):
    total_loss = 0
    for context, target in data:
        context_vector = make_context_vector(context, word_to_index)  
        log_probs = model(context_vector)
        total_loss += loss_function(log_probs, torch.tensor([word_to_index[target]]))
    print('epoch = ',epoch, ', loss = ',total_loss)
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

#(7) test하고 싶은 문장을 뽑고, test를 진행합니다.
test_data = ['CBOW','and','are','different.']
test_vector = make_context_vector(test_data, word_to_index)
result = model(test_vector)
print('Prediction : ', index_to_word[torch.argmax(result[0]).item()])

Skip-Gram

▶구현 포인트

(1) 입력받은 문장을 단어로 쪼개고, 중복을 제거해줍니다.

(2) 단어 : 인덱스, 인덱스 : 단어를 가지는 딕셔너리를 선언해 줍니다.

(3) 학습을 위한 데이터를 생성해 줍니다.

 ① 예측을 할 주변 단어인 target을 정해주고, 제공할 단어인 context를 정해줍니다.

 ② context-label 쌍을 data리스트에 넣어줍니다.

(4) Skip-Gram 모델을 정의해 줍니다.

(5) 모델을 선언해주고, loss function, optimizer등을 선언해줍니다.

(6) 학습을 진행합니다.

(7) test하고 싶은 문장을 뽑고, test를 진행합니다.

## using pytorch
import torch
import torch.nn as nn

EMBEDDING_DIM = 128
EPOCHS = 200
CONTEXT_SIZE = 4

example_sentence = """In the case of CBOW, one word is eliminated, and the word is predicted from surrounding words.
Therefore, it takes multiple input vectors as inputs to the model and creates one output vector.
In contrast, Skip-Gram learns by removing all words except one word and predicting the surrounding words in the context through one word. 
So, it takes a vector as input and produces multiple output vectors.
CBOW and Skip-Gram are different.""".split()

# convert context to index vector
def make_context_vector(context, word_to_ix):
  idxs = word_to_ix[context]
  return torch.tensor(idxs, dtype=torch.long)

# make dataset function
def make_data(sentence):
  data = []
  for i in range(2, len(example_sentence) - 2):
    context = example_sentence[i]
    target = [example_sentence[i - 2], example_sentence[i - 1], example_sentence[i + 1], example_sentence[i + 2]]
    data.append((context, target))
  return data

#(1) 입력받은 문장을 단어로 쪼개고, 중복을 제거해줍니다.
vocab = set(example_sentence)
vocab_size = len(example_sentence)

#(2) 단어 : 인덱스, 인덱스 : 단어를 가지는 딕셔너리를 선언해 줍니다.
word_to_index = {word:index for index, word in enumerate(vocab)}
index_to_word = {index:word for index, word in enumerate(vocab)}

#(3) 학습을 위한 데이터를 생성해 줍니다.
data = make_data(example_sentence)

#(4) Skip-Gram 모델을 정의해 줍니다.
class SKIP_GRAM(nn.Module):
  def __init__(self, vocab_size, embedding_dim, context_size):
    super(SKIP_GRAM, self).__init__()
    self.context_size = context_size
    self.embeddings = nn.Embedding(vocab_size, embedding_dim)

    self.layer1 = nn.Linear(embedding_dim, 64)
    self.activation1 = nn.ReLU()

    self.layer2 = nn.Linear(64, vocab_size * context_size)
    self.activation2 = nn.LogSoftmax(dim = -1)

  def forward(self, inputs):
    embeded_vector = self.embeddings(inputs)
    output = self.activation1(self.layer1(embeded_vector))
    output = self.activation2(self.layer2(output))
    return output.view(self.context_size,vocab_size)

#(5) 모델을 선언해주고, loss function, optimizer등을 선언해줍니다.
model = SKIP_GRAM(vocab_size, EMBEDDING_DIM, CONTEXT_SIZE)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

#(6) 학습을 진행합니다.
for epoch in range(EPOCHS):
    total_loss = 0
    for context, target in data:
        context_vector = make_context_vector(context, word_to_index)  
        log_probs = model(context_vector)
        total_loss += loss_function(log_probs, torch.tensor([word_to_index[t] for t in target]))
    print('epoch = ',epoch, ', loss = ',total_loss)
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

#(7) test하고 싶은 문장을 뽑고, test를 진행합니다.
test_data = 'Skip-Gram'
test_vector = make_context_vector(test_data, word_to_index)
result = model(test_vector)
print('Prediction : ', [index_to_word[torch.argmax(r).item()] for r in result])

jaeyun

벨로그에서는 인공지능 관련 포스팅만 합니다! 더 많은 정보는 소개를 참고해주세요!

다음 포스트

[NLP실습]1.자연어 처리 개요-단어 임베딩

자연어 처리 실습

1. 원핫 인코딩(one-hot enocoding)

▶구현 포인트

2. BoW(Bag of Words)

▶구현 포인트

3. Word2Vec

CBOW

▶구현 포인트

Skip-Gram

▶구현 포인트

[NLP실습]2.자연어 처리 개요-유사도 및 문제들

0개의 댓글