Word2vec

홍찬우·2023년 7월 23일

Word2Vec 구현 및 임베딩 시각화

CBOW & SkipGram

Word2Vec의 학습 방식

CBOW : 주변 단어들로 중간 단어들을 예측
SkipGram : 중간 단어들로 주변 단어들을 예측

class CBOWDataset(Dataset):
  def __init__(self, train_tokenized, window_size=2):
    self.x = []
    self.y = []

    for tokens in tqdm(train_tokenized):
      token_ids = [w2i[token] for token in tokens]
      for i, id in enumerate(token_ids):
        if i-window_size >= 0 and i+window_size < len(token_ids):
          self.x.append(token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1])
          self.y.append(id)
		
    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수, 2 * window_size)
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)

self.x, self.y를 구성하는 부분에서,

CBOW는 주변 단어들로 중간 단어를 예측하기 때문에,
- self.y에는 id(중심단어)만 추가
- self.x에는 id(중심단어)의 주변 window size개의 단어 리스트 추가

class SkipGramDataset(Dataset):
  def __init__(self, train_tokenized, window_size=2):
    self.x = []
    self.y = []

    for tokens in tqdm(train_tokenized):
      token_ids = [w2i[token] for token in tokens]
      for i, id in enumerate(token_ids):
        if i-window_size >= 0 and i+window_size < len(token_ids):
          self.y += (token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1])
          self.x += [id] * 2 * window_size

    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수)
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)

self.x, self.y를 구성하는 부분에서,

SkipGram은 중심 단어들로 주변 단어들을 예측하기 때문에,
- self.x에는 id(중심단어)를 window size 만큼 추가
- self.y에는 해당 id의 window size 만큼의 주변 단어들을 추가

# B: batch size, W: window size, d_w: word embedding size, V: vocab size
# CBOW Class
  def forward(self, x):  # x: (B, 2W)
    embeddings = self.embedding(x)  # (B, 2W, d_w)
    embeddings = torch.sum(embeddings, dim=1)  # (B, d_w)
    output = self.linear(embeddings)  # (B, V)
    return output

CBOW는 입력으로 중심 단어들의 주변 단어 리스트 형태
embeddings = torch.sum(embeddings, dim=1)
- 주변 단어 embedding값을 sum해 차원 축소

# B: batch size, W: window size, d_w: word embedding size, V: vocab size
# SkipGram Class
  def forward(self, x): # x: (B)
    embeddings = self.embedding(x)  # (B, d_w)
    output = self.linear(embeddings)  # (B, V)
    return output

SkipGram은 input이 단어 1개의 중심 단어에 대해 1개의 주변 단어를 예측
CBOW와 같은 차원 축소 과정이 필요하지 않음

PCA

시각화를 위해 고차원 embedding vector를 저차원으로 축소
from sklearn.decomposition import PCA 이용

t-SNE

고차원 벡터 시각화에 이용하는 또 하나의 방법론
가까운 벡터는 더 가깝게 먼 벡터는 더 멀게 배치
from sklearn.manifold import TSNE 이용

다국어 임베딩

LaBSE

Google AI에서 개발한 다국어 임베딩 모델
BERT 이용

I'd rather die tomorrow than live a hundred yuears without knowing you.
당신을 알지 못하고 수백 년을 사느니 당장 내일 죽는 게 낫겠어요.
==========================================================================
My dream wouldn't be complete without you in it.
당신이 그 안에 없다면, 내 꿈은 완벽하게 이뤄질 수 없어요.
==========================================================================
A day spent with you is my favorite day. So today is my new favorite day.
너와 보낸 하루는 내가 가장 좋아하는 날이야. 그러니까 오늘은 새로운 내가 제일 좋아하는 날이야.
==========================================================================
Life is a journey to be experienced, not a problem to be solved.
인생은 풀어야 하는 문제가 아니라 경험을 쌓는 여정이야.
==========================================================================
The flower that blooms in adversity is the most rare and beautiful of all.
역경 속에서 피어나는 꽃이 가장 귀하고 아름다운 거란다.
==========================================================================

입력 문장과 candidates가 주어졌을 때, candidates에서 입력 문장과 가장 유사도가 높은 문장을 반환해주는 task도 가능하다.

※ 모든 코드 출처는 네이버 커넥트재단 부스트캠프 AI Tech 5기입니다. ※

홍찬우

AI-Kid

이전 포스트

Beam Search & BLEU Score

다음 포스트